Discover and Classify Data

Sampling Unstructured Data Brings Risk of Silent Failure

Michael Ness
Security Researcher
October 19, 2022

Have you ever had a hard time finding something really important to you? Imagine it’s Friday night, you’re unbelievably hungry, and you are on the way out the door to meet a friend for dinner. You need your car keys. You glance in the drawer where you usually stash them, but they’re not there. You move on to see if they might be in your coat pocket, or left on a counter, but they’re nowhere to be found. Frustrated and in a rush, you come back to the first drawer and, after finally launching an exhaustive search, you find your keys snugly nestled there between the tool kit and an empty tape dispenser.

Finding sensitive data in your cloud repositories can be similarly challenging, and requires a thorough approach at a much larger scale. Unfortunately, scanning large data repositories can be expensive, and too often organizations rely on techniques that trade off accuracy for speed and throughput. This allows them to “check the box” that says they have scanned their entire enterprise for sensitive data, but it also runs the very real risk of silent failure, missing important data that’s lurking in the cracks and crevices. 

Open Raven’s ability to fully scan data ensures that our customers are able to confidently find sensitive data wherever it hides, without compromise.

Scanning and classifying data at scale with data sampling

As cloud-based data repositories and services proliferate across the enterprise, many organizations struggle to achieve complete visibility into what types of data they have stored and where. Classifying data at the terabyte and petabyte scale is resource intensive, in terms of time as well as compute resources. Achieving data transparency across a large estate is extremely difficult. 

In an attempt to settle on a compromise solution, organizations may look to make tradeoffs between throughput and accuracy. In these situations, it’s common to employ data sampling techniques in place of comprehensive scanning. 

In data sampling, rather than classify every element of a data repository, only a subset of the data is examined and classified. This approach can work well when the data is structured/semi-structured, where the data follows the same format from the beginning of the file through to the end. For these types of data, we can examine a few representative samples and be assured that the rest of the data is substantially similar.

The perfect example of where the sampling approach works is with a relational DBMS such as Postgres, or semi-structured data stores such as the JSON-based Elasticsearch. If you query any given table or index within these structured and semi-structured datastores, you can have confidence that the first entry in that table through to the last entry is going to contain all of the same data types and structure. In this case there is no need to scan the entire table to obtain a complete picture of what it contains. 

Table showing name, country, card #, and status and that sampling structured records can easily identify sensitive information
Sampling is effective at finding sensitive data in structured repositories such as a database

Data sampling leaves visibility gaps

While data sampling can work very well for structured data, it can have significant accuracy issues when the data is unstructured. In these cases there are often significant differences in the kinds of information that exist at the start of the file versus the middle and end. Employing sampling for unstructured data brings a high risk of false negatives: missing pieces of data that contain valid findings. 

Common examples of unstructured data include log files or PDF files. Classifying these types of data requires a different approach. Imagine, for example, that you have identified an uncategorized S3 bucket in AWS, and you’d like to enumerate and classify the data in it. In trying to classify data within this bucket, you identify a wide range of different file types, most of which are unstructured PDFs. These may be benign, but they may also include NDAs, employment contracts, financial transactions, or other sensitive information. Using file sampling in this situation leaves finding the sensitive data up to chance, creating a real risk of missing valuable classification findings, potentially exposing your organization to compliance violations or cyber threats.

Offer letter where random sampling doesn't detect the sensitive information (address, bank account number, insurance number).
Sampling can easily miss sensitive information in unstructured data stores such as PDFs or log files

Full File Scanning with Open Raven Eliminates Blind Spots

Open Raven’s cloud-native data classification platform was designed with both structured and unstructured data in mind, cleanly solving the issues in classifying unstructured data without having to trade accuracy for throughput. 

Open Raven delivers cost-effective full file coverage when conducting data classification scans by leveraging a unique combination of serverless infrastructure and controllable scans. The serverless architecture allows classification to occur in a distributed manner, directly within the customer environment where the data resides. Controllable scans give administrators a high degree of control over the resources and time allocated to scanning operations.  Together, these capabilities eliminate performance barriers, reduce the risk of out-of-control costs and high monthly compute bills for scanning, and also eliminate security and privacy concerns that come with backhauling data. 

Open Raven helps organizations to answer fundamental questions such as “Where’s our data?”, “What types of data do we have?”, and “Is it secured properly?”, without compromise. If these are questions that keep you up at night, you might like to check our technical blog series, Designing and Building Data Classification Systems for Security and Privacy, where you’ll learn more about designing and building data classification systems for security, privacy, and performance.

Don't miss a post

Get stories about data and cloud security, straight to your inbox.