An Unprotected Data Analytics Pipeline Undermines The Value Of Data

By Ameesh Divatia, CEO and co-founder | November 10, 2020

Some estimate that 90% of the world’s data has been produced in the past two years alone. This proverbial tidal wave of information positions businesses to inform decisions that optimize operations, attract and retain customers, and create significant market differentiation. The challenge is how to make sense of data in multiple formats that emanate from disparate sources.

For data to provide value, it must flow through what is referred to as the analytics pipeline: the infrastructure used to collect, store and process data in an IT environment. In the analytics pipeline, unstructured data — such as emails, Excel spreadsheets, Word documents, presentations, instant messages, photos, audio and video — enters “upstream.” As data moves “downstream” toward the end of the pipeline, it is cleansed, organized and analyzed via predictive analytics and machine learning. 

At this point, data is at its peak value and, consequently, is more attractive to hackers. For this reason data protection must be an integral part of the data analytics pipeline to prevent incidents that can offset the many benefits that analyzed data can provide. 

 

451 Research – “The Need for Secure Data Sharing”

451 Research – “The Need for Secure Data Sharing”

Learn more about trends and challenges related to secure data sharing in the report from 451 Research.

Download the Report

 

Cloud Storage And Data Sharing Risk

Before exploring how to secure the analytics pipeline, let’s look at two important business trends that will benefit from such protections: cloud storage and data sharing.

The cloud’s limitless storage capabilities are prompting enterprises to migrate data from on-premises environments, store it in data lakes and extract useful data into warehouses for analysis. Downstream data stored in the cloud is a target for criminals due to its high value and because it is often not protected properly. Many organizations relax security controls to momentarily enable easier access, but forget to restore the protection that it requires. 

Further, Palo Alto Networks (via Help Net Security) found that 43% of cloud storage is left unencrypted, even though cloud providers encrypt those buckets by default. This is an alarming statistic for organizations incorporating the cloud as part of their analytics pipelines — especially for those tasked with complying with regulations like GDPR and CCPA, with almost half of their crown jewels ready to be stolen.

Risk of exposure is further compounded when data moves outside of an organization. Many enterprises rely on data sharing as an integral part of their operations or in collaboration with other organizations to solve problems and gain insight. The Ponemon Institute found that, on average, companies share data with 583 third parties. The same study crystallized the risk of this practice, with 61% of U.S. CISOs experiencing data leakage via a third party. 

This creates a conundrum: Stop sharing data, or share it in an insecure manner. Without a secure analytics pipeline, these two critical elements now represent unnecessary risk.

 

Securing Data Throughout The Pipeline

Security controls for the analytics pipeline can be categorized into two groups: visibility and entitlement. According to Gartner, visibility pertains to implementing “controls that remove ambiguity and increase visibility” of data, while entitlement is the management of data access. 

Organizations can address visibility and entitlement through strategies such as data discovery, protection and monitoring.

  • Data Discovery: Data varies in levels of sensitivity, and it is important to determine how each piece of unstructured data will be protected and processed upon entering upstream. In discovery, each file or record is processed to identify the data and assigned a protection policy based on the level of its sensitivity, applicable compliance standards and eventual use. Applying detailed metadata also helps clear up any confusion about what a piece of data is and who should have access to it.
  • Data Protection: Once data has been classified as sensitive, it must be protected as it moves downstream via the following actions:
    • Masking occurs after data has entered upstream and involves disguising a piece of data (e.g., a Social Security number) with dummy characters. Masking prevents the real value of data from being seen by anyone, but it comes at a steep price — that data can never be processed downstream.
    • Tokenization is a midstream protection that takes a piece of data and replaces it with other characters in the same format. For example, a tokenized Social Security number would consist of nine random numbers so it looks like an actual SSN. This allows existing applications to analyze the tokenized data, even though it is not the real data. If a hacker accessed tokenized data, any analytics they performed would be inaccurate because the data does not reflect real values.
    • Encryption is a downstream protection to ensure that data — once clean and ready for analysis — cannot be compromised. In this process, plain text data is converted into unreadable ciphertext and can be deciphered with a key that is accessed by only a privileged few. Innovative techniques are now becoming available to analyze that data by multiple parties without ever decrypting it.
  • Data Monitoring: Even with the appropriate protections in place, oversights can occur, and it is necessary to implement monitoring to identify data that may have progressed unprotected. However, monitoring is not limited solely to data; it is just as important to know who is accessing data throughout the pipeline. Only those with authorized access should be able to view the data.

As the volume of data continues to grow, businesses can gain equally valuable insights. But the reputational and financial fallout of a breach can overshadow the benefits of the analytics pipeline. By implementing analytics pipeline protection strategies, organizations can use the data they have worked to generate, collect and analyze and reduce the risks of unintentional or nefarious data exposure.

 

This article originally appeared in Forbes.