Discover and Classify Data

Operationalizing Data Security in the Cloud at Scale - Data Classification Explained Part 1

Jeffrey Binder
Security Researcher
May 25, 2023

Data Classification - The Open Raven Way 

To some, the word “classification” evokes shadowy documents stamped “Top Secret.” At Open Raven, classifying data means determining what types of information are stored in certain places. Does a database table contain credit card information? Does a chat transcript contain someone’s social security number? Does a log file contain a password or access key? Knowing where sensitive data are stored is an essential first step toward remediating issues that could lead to data leaks or breaches.

Open Raven provides over 200 built-in data classes—that is, definitions for identifying types of data with potential security implications, from driver’s license numbers to private health information to encryption keys. Our platform also enables users to design custom data classes to look for specialized or company-specific data types. And, we regularly add new data classes based on research and customer needs. Our platform offers a robust solution for classifying petabytes of data accurately and efficiently.

The aim of this blog series is to explain our core principles, systems, and methods for classifying sensitive data residing in unstructured and structured cloud data stores at scale. In this first post, we’ll look at the principles behind Open Raven’s data classification system, our methods, and improvements made since we launched the Open Raven Data Security Platform in 2020. In future posts, we’ll go into more depth about these topics and show how Open Raven’s classification system is at the forefront of data security.

Our Principles

Open Raven’s data classification methods follow three main principles: security, efficiency, and accuracy.

Privacy

Open Raven’s platform is built from the ground up with security in mind. The design of our classification engine ensures that sensitive information never leaves customer environments—all processing is done within customer accounts and without creating persistent copies of the data.

Efficiency

In today’s cloud environments, data classification speed is not just a matter of convenience. Many of our customers have multiple petabytes of data in their accounts. If analysis takes more than a few microseconds per kilobyte, then getting through such a large volume of data would simply be impossible. Our data classification is fast enough to handle petabytes of data without sampling, so that our users can be assured that nothing was missed.

Accuracy

Robust, secure data analysis is of little use if the results aren’t accurate. Effective data classification must have a high recall rate—that is, it must not fail to identify sensitive data in the objects it scans. In other words, false negatives. Equally important is avoiding false positives. Since no one wants to wade through hundreds of incorrect results to find what they really care about, data classification should only raise alarms for real results. At Open Raven, we routinely run experiments on real and synthetic data sets to ensure that our data classes are highly accurate.

How Our Data Classification System Works

In a past blog series, we’ve discussed some of the complexities that make finding sensitive data a difficult problem. We won’t rehash the details here, but we do want to highlight some of the ways we’ve refined and expanded our data classification capabilities since 2021.

To begin with, consider one of the most basic types of sensitive data types: social security numbers. US social security numbers are nine-digit numbers that raise serious concerns about privacy and identity theft. These numbers are commonly written with a distinctive pattern of hyphens (374-64-1263), but databases sometimes represent them with digits alone (374641263). In fact, around 88% of nine-digit numbers could be valid SSNs.

Yet the number 374641263 could also be many other things. It might be a passport number. It might be an account ID. It might be a number in an analytics report. If a scanner flags all of these numbers as SSNs, the results would be more noise than signal.

For this reason, accurate data classification requires context. We must look at where a nine-digit number occurs and what other content surrounds it to determine whether it is an SSN. A very simple yet highly efficient method is to check for keywords that are associated with certain data types. Our scanner employs a range of keyword matching methods tailored to particular data types, for instance parsing JSON data to extract field names and identifying relevant columns in databases.

In addition to keywords, our scanner employs validator functions, which are bits of code that perform arbitrary computations tailored to particular data classes. For some types of sensitive data, such as credit card numbers, there is a check digit that can be used to determine whether a number is valid. Testing the check digit—most often using the Luhn algorithm—allows the scanner to eliminate around 90% of false positives.

Since our last blog series, we’ve implemented a range of more advanced validator logic to make our scanner more accurate and versatile. Our validator functions can now:

  • Run ML models for identifying complex data types
  • Analyze code in several programming languages to identify developer secrets more accurately
  • Categorize geospatial data by region
  • Deploy language models to avoid false positives when scanning for data types with very general patterns

Also, we’ve implemented a number of more advanced classification features, such as metadata scanning and composite data classes, which detect when certain combinations of data types occur together. We’ve also taken great strides in optimizing the performance of our system, saving our customers money and providing fast results even for cloud accounts containing massive amounts of data. We will cover these topics in future posts. 

Next Up

Now that we've laid out our core principles, explained our data classification system, and improvements made over the years, our next blog will address our approach to using ML and AI with real-world data that highlight the advantages of combining them with human expertise. If you're interested in learning more about what it takes to write dataclasses, check out our guide Introduction to Regex Based Data Classification for the Cloud.

Don't miss a post

Get stories about data and cloud security, straight to your inbox.