Data de-identification is the process used to prevent a person’s identity or private information in a database, from being revealed. Processes include masking personal identifiers or replacing real information with temporary data. Oftentimes de-identified data is then put through a process of re-identification for further use.
De-identification is part of the overall practice of loss prevention and data containment, and is used in nearly every industry to protect consumer data, provide adherence to HIPPA privacy laws, protect the privacy of research participants, and many other use cases.
Data De-Identification and the Cloud
Enterprise data growth has continued exponentially, and the migration to cloud data lakes has become a major trend, allowing big data environments to expand flexibly and without limit. As more organizations use cloud data lakes for business analytics, machine learning and artificial intelligence, specific data privacy challenges are emerging.
The explosion of data has led to another trend – the inadvertent exposure or misconfiguration of data in some cloud-based environments. There is an obvious need to de-identify the underlying data or the data inside cloud data lakes, while still permitting the business to take advantage of big data technologies such as analytics and AI modeling.
In the cloud providers’ “shared responsibility model”, the provider is responsible for providing security of the underlying infrastructure and the data centers, while the customer is responsible for the data put into the cloud. So de-identification means de-identifying the actual data values or protecting the underlying data values that are being put in.
Unfortunately, many de-identification methods require additional development or altering the data pipeline, and as a result either slow down the use of cloud-based analytics or leave data potentially exposed. Further, de-identification represents only part of the challenge as new methods to access and warehouse data can limit scenarios where authorized re-identification or analysis of data may be required.
Common Methods for Data De-Identification
- Data-centric encryption, including field-level encryption / record-level encryption - actually protecting specific data values with some forms of encryption
- Data tokenization, which made use a tokenization library, or a tokenization, or what is known as a boltless tokenization, or some type of codebook method – for example, methods of replacing an account number with another value
- Format-preserving encryption - another variant of a data type and length-preserving encryption mode. Data such as social security number, or a medical record number, or a credit card number, or email address would then be preserved from a formatting perspective, but randomized in terms of the actual underlying data values.
- Data masking - used for controlling the presentation layer, so instead of seeing somebody’s full credit card number, you might only see the last four digits, or values can be completely obfuscated.
- Role-based data masking – allows certain roles within an organization, or certain parties that are in a more privileged state or third parties that may be external, but basically get presented with different views of the data that is dynamically rendered based on your group membership.
- Advanced Encryption schemes – includes enclave-based encryption, where data is secured much like the fingerprint or face ID on a mobile device, extended to a server class compute mode; and homomorphic encryption, and multi-party computation.
Baffle’s Data Protection Services allows for easy de-identification of data on-the-fly and selective re-identification of data based on authorized roles via a no-code model.
Benefits of Objects Encryption vs. Data-Centric Encryption
- De-identify, tokenize or encrypt data INSIDE objects and files
- Safe harbor from accidental data leaks from key privacy and compliance regulations
- Accelerate cloud-based data analytics programs by addressing key security and privacy concerns
Baffle allows you to push data to the cloud on a continual basis, de-identify it on the fly, and then allow members of your organization to run the types of analytics that they would need to run on that respective data set:
Cloud data lakes give organizations a lot more flexibility to accommodate massive data growth for the future. The risk of data breaches continues to be with us, but companies can leverage data-centric protection methods and Baffle’s solution to reduce your risk, unlock the value in the data and help to monetize it.
Baffle also supports headless deployments via Docker images. If you’d prefer to deploy via a Docker image and get up and running quickly. Please email [email protected].
Shared Responsibility Model
Sample De-Identification of Data vs Clear Text
Baffle supports multiple database and file encryption modes including NIST certified and FIPS validated AES modes.
Attend this webinar to learn how data can be easily de-identified or tokenized as part of your data pipeline engineering. Learn how you can establish an S3 cloud data lake fully de-identified and use AWS Athena to easily access and run analytics on data sets for authorized users.
This session will provide a review of some key data security gaps related to a data analytics pipeline, and provide a live demonstration of de-identifying data on the fly during a migration, as well as subsequent data analysis via AWS Athena.
Data Protection Services
The Baffle Data Protection Service provides a transparent data-centric security layer that offers several data protection modes. Capabilities include data de-identification, tokenization, field level encryption, record level encryption, format preserving encryption (FPE) BYOK for SaaS, dynamic data masking, database encryption solutions such as file encryption, file content encryption, encryption API services, role-based access control (RBAC), privacy preserving analytics and secure data sharing.
Monitor access to databases to identify patterns or anomalous behavior and profile applications
Role-Based Access Control
Define which systems, users or groups can access data stores and dynamically entitle who can see what data