Securing Cloud AI Data Pipelines

By Ameesh Divatia, CEO and co-founder | June 14, 2022

Recent attacks on Jupyter Notebook show that threat actors exploit cloud-based data science platforms. And as companies are transitioning to cloud working environments, they need data security controls to mitigate risks of unauthorized access and theft posed by the expanded attack surface. Cloud AI data pipelines, in particular, provide companies with tremendous flexibility and opportunity, but the extensive data sets required to perform the analytics present significantly heightened risks.

Let’s look at the AI data pipeline and what organizations should be doing to protect it.

Gartner’s take on AI data security

Examining this issue for a 2021 report, Gartner clarified that an AI data pipeline is only as secure as its weakest link. Like “shift left” security in DevSecOps, Gartner recommends that AI data pipelines focus on security controls from the very beginning of the development process. Entitlement, Gartner’s term for user access management, must be carefully implemented across all stages of the pipeline before platforms and analytics tools are even considered.

Gartner’s analysis divided AI data pipelines into three primary security concerns:

  • Model security refers to the continuous analysis of data to identify any potential tampering or data poisoning that might harm data-driven systems.
  • Infrastructure security focuses on securing networks and data storage and ingestion points, which can typically be handled with mature technologies.
  • Data security, the most prominent of the three security concerns, is the protection of data at all points in the AI pipeline from malicious or unintended access.

The report also points out that regulatory compliance is one of the more pressing data security issues companies should consider when designing an AI pipeline. Now that regulations covering customer data privacy have passed in five U.S. states—and many more are in the works—best-in-class data protection is no longer a differentiator but a requirement for many organizations.  

Creating an AI data pipeline security stack

Gartner emphasizes two subsets of data security: data theft/unintended disclosure and compliance requirements. The report recommends implementing metadata creation, differential privacy, data masking, tokenization, and fully homomorphic encryption to mitigate these risks and facilitate compliance.

For companies looking to follow Gartner’s recommendations and develop an AI data pipeline security stack, finding the right security vendor is the place to start. The ideal vendor offers continuous defense against data theft through three mutually reinforcing methods: 

  • Masking: If a piece of data has sensitive information—a Social Security number, for example—but will undergo analytics, masking occurs. Masking obfuscates the actual value of data, but it comes at a price—that data can never be processed downstream. This data will be stored for a fixed period, typically for compliance purposes.
  • Tokenization: Data that contains sensitive information but will be analyzed is tokenized. This protection method replaces a piece of data with other characters in the same format. For example, a tokenized Social Security number would consist of nine random numbers to look like an actual Social Security number. Tokenization allows existing applications to analyze the data, even though it is not real data. If a hacker accesses tokenized data, the actual values will be protected from exposure.
  • Privacy-enhancing Computation: Sensitive data can be fully encrypted and safely processed through the analytics pipeline in the cloud without risk of exposure, i.e., data is never decrypted, even during usage. When an authorized user performs analysis, the output is identical to the expected result as if the analytics and data transformations had been applied to plain text.

While it seems cumbersome and challenging to integrate these methods into an existing security stack, companies can do so through a no-code or low-code deployment. When exploring a data protection solution, especially for AI projects, companies should consider the following attributes: 

  • Continuous data protection at the file level throughout the AI data pipeline, as defined by Gartner.
  • Smooth integration without disrupting existing applications.
  • Able to maintain compliance with industry and government regulations.
  • Data protection without requiring clones of data or schema changes. This minimizes the risk of data polymorphism and how legacy encryption systems not designed for the cloud can expose data.

Data’s value is ever-increasing, and companies rely on AI projects to pinpoint areas where they can increase market share. Protecting data in a manner that does not interrupt or slow down AI data analysis is critical to success. Baffle’s Data Protection Services handles large AI data sets, allowing processing speed in environments containing more than 100 billion records. For more about the challenges of de-identifying cloud data pipelines and how Baffle can help you secure your data analytics pipeline, watch this webinar: De-identifying Cloud Data Pipelines: Approaches, Best Practices, and Learnings.