Why data tokenization is insecure

Priyadarshan KolteMay 13, 2020

(Why you can’t stop data breaches – Part II)

This blog, the second in a series on “Why you can’t stop data breaches,” details various attack methods used to compromise a set of common data tokenization methods. The first entry in the series discussed database level encryption, encryption at-rest and gaps in the threat model and is linked here.

For years, the cybersecurity and cloud security industry has relied on data tokenization or data masking to secure and de-identify sensitive data at the field level to mitigate the risk of data breaches and ensure privacy.

In fact, many privacy regulations and security standards outline the use of data tokenization as a method to ensure Safe Harbor and data privacy. Some common use cases are retailers and e-commerce sites. These industries are under regulatory requirements to protect credit card information and cardholder data and ensure PCI compliance (Payment Card Industry). Along with PCI DSS, HIPAA-HITECH, GDPR, ITAR, all have privacy regulations detailing pseudonymization of data.

Without these compliance standards, breaches would be more prevalent than they already are. For example, attackers could easily uncover personal data, PII, or possibly the primary account number (PAN) for a customer’s bank account without a token service.

The Efficacy of Tokenization

While we explore the efficacy of tokenization, it is essential to remember that it has advantages in the market. Tokenized data can be processed by legacy applications and handle structured and unstructured data, making tokenization more versatile than traditional encryption. And the use of format-preserving encryption algorithms protects data while preserving its original formatting and length.

This blog looks at specific tokenization methods and addresses how an adversary or hacker could execute attacks to gain knowledge of the transform method and ultimately reverse tokens to steal the protected data and sensitive information.

There are varying methods of data tokenization that have been introduced by vendors and service providers. Two of the more prevalent methods for data tokenization are a token vault service and vaultless tokenization.

Token Vaults

Tokenization vaults or services use either a database or file-based method that replaces the original data value with a token and stores the original plaintext value and the respective token inside a file or database. Then, when an application needs to de-tokenize the tokenized value, a call is made to the token store or vault, and a de-tokenization process occurs.

One of the main challenges with this approach is that it creates a copy of your sensitive data and simply moves it to another location. Some in the industry refer to this practice as “moving the problem,” In many ways, it creates another attack surface for someone trying to steal data. Vault-based tokenization methods also do not scale or perform well and are difficult to operationalize in distributed data environments.

Vaultless Tokenization

Vaultless tokenization, on the other hand, does not use a file or data store but instead creates an in-memory codebook of the tokenization mappings to plaintext values. Within the tokenization domain, this may be referred to as table-connected tokenization. This method performs better and overcomes some of the challenges associated with the above referenced vault-based tokenization. Still, upon closer examination, there are also some significant security gaps in the vaultless tokenization approach.

More specifically, data tokenization methods that leverage a codebook approach can be vulnerable to a chosen-plaintext or chosen-ciphertext attack. In these types of attacks, an adversary can produce tokenization requests in order to gain knowledge of the transform or encryption method. It may seem that performing this type of attack may be exhaustive and infeasible, but as it relates to codebook or vaultless tokenization, it can be quite effective depending on the size of the lookup tables.

Codebook Approach

To begin with, the codebook approach utilizes static mappings of plaintext values to tokens as described above. The codebook needs to be populated with all possible range values in a data set in order to return unique tokens in a performant manner.

For example, a range of phone numbers, social security numbers, or credit card numbers would, in theory, create a static lookup table with trillions of entries. However, the codebook approaches commonly do not create a single static mapping, but instead, cascade a series of small static tables, and it is the smallness of these static tables that introduce the weakness in the model. Smaller tables yield higher performance and require less memory but increase the chances of a successful attack.

Unmasking The Token

An attacker executing a chosen-plaintext attack can choose plaintexts to exploit the smaller static tables to discover the relationships between the plaintexts and their tokens to invert the static mapping tables and completely detokenize any given token with a high probability based on the size of the codebook tables. An attacker executing a chosen-ciphertext attack would basically compromise the tokenization system every time.

For example, a numeric data range of twelve digits would create a mapping of one trillion entries. However, by using smaller static tables for scale and performance, the codebook mappings can be successfully attacked with roughly two million tokenization attempts. While that may seem like a large number, in the world of cryptographic attacks, it would be trivial to execute and invert the tables.

To execute this attack, consider the following flow for a vaultless tokenization method. In this example, a sixteen-digit credit card number with digits d0 through d15 is tokenized such that the middle six digits of a credit card are tokenized from d6..d11 and transformed to g6..g11. The first six digits and last four digits are in the clear.

As stated above, normal tokenization of the credit card variations would require a lookup table with a trillion entries. However, the codebook approach splits this single lookup into a series of small table lookups. The first table maps d0..d5 in a million entry lookup table that references the second table and maps e6..e11 in a million entry token table to f6..f11 which is then looked up in another million entry token table 2 to produce token g6..g11. The token tables 1 and 2 are selected from token table sets 1 and 2 based on the last four d12..d15 digits.

Given an arbitrary T0..T15 digits of a token, an attacker can compromise the tokenized value by issuing requests at a factor of one million less than the entire tokenization mapping space. It’s worth noting that if more digits were tokenized or hidden, the scheme becomes easier to break.

In order to execute these attacks, the adversary fixes the first six and last four digits of a plaintext and specifies values for tokenization and de-tokenization of the middle six digits. This creates a high probability that the attacker learns which tokens the system will choose given specific requests. So instead of a trillion attempts of tokenization to detokenize T0…T15, the number of requests attempted is reduced by a factor of one million. These attacks are detailed further in the white paper, “A Security Analysis of Tokenization Systems”.

In summary, codebook or vaultless tokenization methods, while offering enhanced scale and performance, can be easily compromised by an attacker using a different attack approaches.

You Can’t Stop Breaches

Returning to our theme of “why you can’t stop data breaches” — we continue to find gaps in data security methods that are in widespread use and considered best practice and/or compliant amongst many in the IT security industry.

We recommend re-evaluating the threat model as data distribution, services creation, cloud storage & SAAS adoption all continue to increase. You may find that some data protection methods may need to be refreshed or replaced to keep pace with how data is getting breached today.