Why data tokenization is insecure

By Ameesh Divatia, CEO and Co-founder | May 13, 2020

(Why you can’t stop data breaches – Part II)

This blog, the second in a series on “Why you can’t stop data breaches,” details various attack methods used to compromise a set of common data tokenization methods.  The first entry in the series discussed database level encryption, encryption at-rest and gaps in the threat model and is linked here.

For years, the security industry has relied on data tokenization as a method to secure or de-identify sensitive data at the field level in order to mitigate the risk of data breaches and ensure privacy.  In fact, many privacy regulations and security standards outline the use of data tokenization as a method to ensure compliance and provide Safe Harbor such as PCI DSS and other privacy regulations detailing pseudonymization of data.  

This blog looks at specific tokenization methods and addresses how an adversary could execute attacks to gain knowledge of the transform method and ultimately reverse tokens to steal the protected data.  

There are varying methods of data tokenization that have been introduced by vendors. Two of the more prevalent methods for data tokenization are a tokenization vault or service and vaultless tokenization. 

Tokenization vaults or services use either a database or file-based method that replaces the original data value with a token and stores the original plaintext value and the respective token inside a file or database.  When an application needs to de-tokenize the tokenized value, a call is made to the token store or vault and a de-tokenization process occurs.  

One of the main challenges with this approach is that it creates a copy of your sensitive data and simply moves it to another location.  Some in the industry refer to this practice as “moving the problem” and in many ways it creates another attack surface for someone trying to steal data. Vault-based tokenization methods also do not scale or perform very well and are difficult to operationalize in distributed data environments.

Vaultless tokenization, on the other hand, does not use a file or data store, but instead creates an in-memory codebook of the tokenization mappings to plaintext values.  Within the tokenization domain, this may be referred to as table-connected tokenization.  This method performs better and overcomes some of the challenges associated with the above referenced vault-based tokenization, but upon closer examination, there are also some significant security gaps in the vaultless tokenization approach.  

More specifically, data tokenization methods that leverage a codebook approach can be vulnerable to a chosen-plaintext or chosen-ciphertext attack. In these types of attacks, an adversary can produce tokenization requests in order to gain knowledge of the transform or encryption method. It may seem that performing this type of attack may be exhaustive and infeasible, but as it relates to codebook or vaultless tokenization, it can be quite effective depending on the size of the lookup tables. 

To begin with, the codebook approach utilizes static mappings of plaintext values to tokens as described above.  The codebook needs to be populated with all possible range values in a data set in order to return unique tokens in a performant manner. For example, a range of phone numbers or social security numbers or credit card numbers would, in theory, create a static lookup table with trillions of entries.  However, the codebook approaches commonly do not create a single static mapping, but instead cascade a series of small static tables, and it is the smallness of these static tables that introduce the weakness in the model. Smaller tables yield higher performance and require less memory, but increase the chances of a successful attack.

An attacker executing a chosen-plaintext attack can choose plaintexts to exploit the smaller static tables to discover the relationships between the plaintexts and their tokens to invert the static mapping tables and completely detokenize any given token with a high probability based on the size of the codebook tables.  An attacker executing a chosen-ciphertext attack would basically compromise the tokenization system every time. 

For example, a numeric data range of twelve digits would create a mapping of one trillion entries.  However, by using smaller static tables for scale and performance, the codebook mappings can be successfully attacked with roughly two million tokenization attempts. While that may seem like a large number, in the world of cryptographic attacks, it would be trivial to execute and invert the tables.  

To execute this attack, consider the following flow for a vaultless tokenization method.  In this example, a sixteen digit credit card number with digits d0 through d15 is tokenized such that the middle six digits of a credit card are tokenized from d6..d11 and transformed to g6..g11.  The first six digits and last four digits are in the clear.

As stated above, normal tokenization of the credit card variations would require a lookup table with a trillion entries. However, the codebook approach splits this single lookup into a series of small table lookups. The first table maps d0..d5 in a million entry lookup table that references the second table and maps e6..e11 in a million entry token table to f6..f11 which is then looked up in another million entry token table 2 to produce token g6..g11. The token tables 1 and 2 are selected from token table sets 1 and 2 based on the last four d12..d15 digits.

Given an arbitrary T0..T15 digits of a token, an attacker can compromise the tokenized value by issuing requests at a factor of one million less than the entire tokenization mapping space.  It’s worth noting that if more digits were tokenized or hidden, the scheme becomes easier to break.  In order to execute these attacks, the adversary fixes the first six and last four digits of a plaintext and specifies values for tokenization and de-tokenization of the middle six digits. This creates a high probability that the attacker learns which tokens the system will choose given specific requests.  So instead of a trillion attempts of tokenization to detokenize T0…T15, the number of requests attempted is reduced by a factor of one million.  These attacks are detailed further in the white paper, “A Security Analysis of Tokenization Systems”.

In summary, codebook or vaultless tokenization methods, while offering enhanced scale and performance, can be easily compromised by an attacker using a different attack approaches. 

Returning to our theme of “why you can’t stop data breaches” — we continue to find gaps in data security methods that are in widespread use and considered best practice and/or compliant amongst many in the IT security industry.  We recommend re-evaluating the threat model as data distribution, services creation, and cloud storage & SAAS adoption all continue to increase.  You may find that some data protection methods may need to be refreshed or replaced to keep pace with how data is getting breached today.