Discover and Classify Data

Data Classification Techniques

Mike Andrews
Head of Engineering
May 19, 2021

In this part six of a now eight part series (started off as a six part series; part one, part two, part three, part four and part five), I describe the various things you need to know about data classification itself. 

You will hopefully find this refreshingly honest and transparent for the security industry. The spoiler alert is that while I will talk about ML and complex data science (and indeed we do use it), the reality is that the state of the art, and often the best, technique for the majority of data classes are regular expressions. Of course, as you hopefully expect of us, we will point to other work we have done to prove that.

For this post I am joined by our Chief Architect Mike Andrews who has designed the core data classification engine. He has a PhD in code analysis and led the platform team for Microsoft's Cortana (Siri equivalent for the Apple fans) so he knows his stuff. 

The structure (sic pun) of this post is:

  • Data classification techniques, i.e., how to classify sensitive data
  • Data classes, i.e., what to classify
  • Data collections, i.e., how to organize data

Data Classification Techniques 

When I talk to a lot of people, especially computer science graduates, about data classification they often think about Natural Language Processing or NLP. But NLP is about making human sense of text rather than identifying if a set of data or an individual piece of data is of a particular type(a class of data) such as a credit card or a social security number or an AWS API key. 

In the field (my phrase for a broad generalization) you will find that this data is generally stored as either

  • Structured data
  • Unstructured data
  • Semi-structured data

Structured data is usually contained in rows and columns of a relational database or in a spreadsheet where its elements can be mapped into fixed, predefined fields such as total or date. Structured data is the easiest to classify and was historically the main way all data was stored but estimates today are that it now accounts for less than 20% of the worlds data and I personally think that's a massive over estimation.

Unstructured data is data that is not contained in a row-column database or similar and doesn’t have an associated data model. Think of things like text in a Word document or an email. Unstructured data may have sensitive data embedded in it but is much harder to find. Think of a text file dumped in an AWS S3 bucket that happens to be open to the world with hundreds of thousands of sensitive records in and you will be spot on. This is the type of data you now find in data-lakes that I covered in part two of the series and makes up the vast majority of data today. 

There is an “in between” type as well which is semi-structured data. This is generally where fixed data types are within a structure but that structure is not as well defined as say a database table. An example is a data file like a JSON document or perhaps a log file. 

Knowing the types of data sets us up to better understand which classification technique is best at determining which data classes are present. Although previously discussed, let’s cover the 4 cycles of data classification, one-by-one: 

  1. Regex - simple data pattern matching
  2. Keywords and Data Adjacency - analysis of patterns and their surrounding context
  3. Machine Learning - sophisticated analysis to look for specific data 
  4. Data Validation - checking that matches are actually real data and not just theoretical pattern matches 

Regular Expressions (Regex)

Regexs are deterministic ways of matching definite strings under a fixed memory constraint. They are well suited for scanning enormous amounts of data very cheaply with high degrees of accuracy, because they can be reduced to state machines (finite state automata), where based on an input (eg. “abc”), the regex matcher moves along allowable paths, as determined by the regex statement, until it either gets to an end state, or has no more inputs.

A regex to validate a social security number, from other possible nine digit numbers. SSN’s cant start with 000 or 666, so the machine will not accept strings starting with those sequences as a valid social security numbers. Notice that anything longer than 9 digits will be sent to the “Reject” state.

For the computer scientists among you, while they are not Turing complete, they are very extendable to fit unstructured data. However, because they are not Turing complete, they can't perform computations like checksums, which we will talk about later. 

Regex’s way of describing all the seemingly-endless possibilities of strings that meet some specified structure is through a formal language describing the set of all string possibilities which can easily get parsed and turned into a state-machine as described above. In more common terminology, it’s a set of rules that a string has to pass in order to be considered valid, such as validating that a user’s birthday is structurally correct, or validating that a password meets security standards by having a number, special symbol, upper case letters, and minimum character length. 

For the field of data classification, regular expressions suit a very different need for us as developers tackling a problem at petabytes of scale: they will provide us a deterministic (i.e., we will get the same answer each time) means of scanning across large quantities of data, while having a predictable memory overhead (i.e., the size of the automata, and its state whilst processing the input data). As regular expressions are not Turing complete and are bound by finite state machines, the memory overhead is predictable (provided we don’t abuse repetition and avoid expressions that require lots of backtracking), and in most cases, can be kept to operate tightly within bounds of the data classes we need to match against. Since regular expressions are deterministic in nature, false positives seldom occur (contrary to an ML approach, which we’ll talk about later), and if/where they do they are more a function of “over matching”, due to a RexEx pattern that is too “permissive” and matches more data than desired, but is something we’ve taken into account, and will discuss the significant importance in our validator function later in this post. 

Keywords and Adjacency

Keyword matching and adjacency is where we consider matches of certain data classes only if it’s surrounded by designated keywords (users can also set their own keywords). This is fundamentally important within the domain of large scale data-classification, as not all 15 number digits that may be matched against are credit cards, and it would be naive to assume so. Likewise, not all 9 digit numbers are social security numbers, so requiring “social security number” or “ssn” or “ss#” appearing near a 9 digit number is a technique to improve accuracy in data classification results. The keyword distance plays an important role: too high of a keyword distance and the memory expenditure of the regex goes up exponentially, and will eventually generate false positives on it’s own. 

As well as keywords providing “hints” on the context of data for matching, other matches around the data can also provide additional context that help increase the accuracy of matches. For example, if we found a date in close proximity to a credit card number, it’s more likely to be an expiry date than it is someone's birthday. This “data adjacency” has similar issues to keyword distances, as described above, and also vastly increases the complexity of the data scanner, along the state that it has to maintain, so it must be used very sparingly.

Machine Learning

Ask any college graduate what they’re interested in studying or specializing in, and it’ll probably be machine learning: it’s what companies and employers think they’re looking for, but in reality (especially in the security field) ML has a niche in a few (but potent) use cases. That’s not to say we’re not using it at Open Raven, we are but we use it gracefully and know when to use it and when not to. We could add it everywhere as a marketing buzzword like we see, but we’re better than that. 

In our eyes, machine learning will have a larger role in data classification once:

  • Enough data exists to train a model to high enough efficacy and to give meaningful probabilities to its predictions.
  • A procured dataset that specifically represents a large enough data set of sensitive data stored within customer environments, upon which machine learning algorithms can be trained.
  • It can be done affordably enough to scale across petabytes of data.

Unless something radically changes in the short-term, none of these criteria are obtainable given current (and probably future) laws and regulations, as well as limitations of classical machine-learning technology. The goal of obtaining large sums of data to train on is (correctly) limited by privacy and data sovereignty requirements, which prevent the mass movement and distribution of a company's internal, sensitive data. Furthermore, the problem in finding a representative data set that will generalize across unseen data only compounds upon the data sovereignty issue as environments and file formats are often unique to a company's needs, making it hard to produce a suitably sized and one-size-fits-all training set from which an ML model can learn. To top it off, machine learning is an inherently expensive task. Scaled across quintillions of bits, the high memory requirements and computational overhead will cost more money while often yielding poor heuristical answers, whereas algorithmic solutions (regexes) can currently provide better answers at a fraction of the cost.

However, Machine Learning can be usefully applied to aid with algorithmic solutions, versus replacing them entirely. Unsupervised learning plays a large role in the ongoing work regarding the scalability and speed of our scanning engine, which uses the known metadata properties of files within a bucket to categorize files into different clusters, and sampling clusters strategically, rather than sampling the entire bucket at random. For example, if a “log cluster” of files is named “k8-log1.txt, k8-log2.txt, etc etc”, and all share similar naming structure and metadata properties, only sampling a select representative files in favor of also sampling a few files from the “worksheet cluster” consisting of “sheet1.csv, sheet2.csv”, allows customers to do surface level dives into their buckets, leaving “deep scans” for targeted resources. This is a more cost effective approach at previewing the contents of buckets on an extremely larger scale, as it ensures that buckets are probed strategically, rather than the imbalanced nature of randomized probing. For example, suppose a bucket consists 90% of csv files, and 10% of json, both the csv cluster and the json cluster should get equal probing, rather than a 9/1 split it would receive if it was truly done at random. 

Data Validation

One of the complaints you always hear about old school DLP systems--and frankly, about our competitors--is that these systems are full of false positives. We don't claim to be perfect, just very, very, good. One of the things we designed early on was the concept of data validators which work to address two questions given, a data class match:

  1. Is this real data (i.e., a real SSN versus one that matches the format)?
  2. Is this one of my customers' SSN's?

This concept applies to everything from game validation codes to credit cards to employee numbers to passports and the list goes on. Data validation is an important weapon in the ongoing fight to reduce noise and improve signal. 

Validation works by passing pieces of matched data to a third party API to check if it's valid. 

Here is an example:

function validate(input) {
    var auth = "AUTH-CODE";
    var token = "AUTH-TOKEN";
    const request = new XMLHttpRequest();
    const url="https://internal.health-api.example.com/lookup?"
 + "auth-id=" +auth
 + "&auth-token=" + token
 + "&HICN=" + input;
    request.open("GET", url, false);
    request.send();
    return !request.responseText.includes("Not a valid HICN");

In this example we are using an internal API to see if a Health Insurance Claim Number (HICN) data match is a valid one in the very specific context of that company's data. Some validators work across multiple customers and it's just a matter of plugging in your own API key. Others may require a custom validator to point to your private API’s or those we don't support. 

Data Classes

Data classes (also called data identifiers or data types) are specific “bits” of data you will want to find. A canonical example is credit card class. We organize data classes into default categories of Personal Data, Health Care Data, Financial Data and Developer Secrets using a group concept we call collections (see below). In general there are standard data classes that everyone will be interested in such as credit cards or social security numbers and custom classes specific to a company such as employee records or intellectual property. 

Lets dive deep into a payment card as an example which is a simple 16 digit number, right? Yes but nope, not quite that simple, at least if you want to do it right. Please note this section is long and detailed, mainly to prove a point. Unless you are genuinely interested in the magic behind a credit card number you can happily skip ahead. 

Also known as a primary account number or card number, payment card numbers are found on credit cards, debit cards, stored-value cards, gift cards, and an array of similar bits of plastic we carried in our wallets before Apple Pay and Bitcoin (sarcasm).

Detection requires the data to be a 13–19 digit sequence that adheres to the Luhn check formula and uses a standard card number prefix for any of the following types of credit cards: American Express, Dankort, Diner’s Club, Discover, Electron, Japanese Card Bureau (JCB), Mastercard, UnionPay, and Visa. There is even a standard ISO/IEC 7812.

The payment card number starts with the Issuer Identification Number (IIN), 6-8 digits. The first digit is the Major Industry Identifier (MII), which ranges from 0-9. The next set is the individual account number, 1-10 digits. The last digit is the check digit (maximum length of 1). The check digit is computed according to the Luhn formula for modulus-10 check digit.

The Luhn formula for computing modulus-10 "double-add-double" uses a check digit calculated on all eight digits of the IIN and all of the digits of the individual account number (variable up to 10 digits).

The following steps are involved in this calculation:

Step 1
Double the value of alternate digits beginning with the first right-hand digit (low order).

Step 2
Add the individual digits comprising the products obtained in Step 1 to each of the unaffected digits in the original number.

Step 3
Subtract the total obtained in Step 2 from the next higher number ending in 0 [this is the equivalent of calculating the “tens complement” of the low-order digit (unit digit) of the total]. If the total obtained in Step 2 is a number ending in zero (30, 40, etc.), the check digit is 0.

IINs beginning with "00" are not for card issuers. IINs beginning with "80" are for use by healthcare institutions. Format of these are "80[CCC]" where "CCC" is the three-digit numeric country code as defined by ISO 3166-1*. IINs beginning with "89" are for use by telecommunications administrations. IINs beginning with "9" are reserved for use by national standards bodies. The format of these are "9[CCC]" where "CCC" is the three-digit numeric country code as defined by ISO 3166-1*. Any 9 series IIN assigned by the Registration Authority (RA) is 9 digits in length. IINs assigned under a national numbering system should be minimum 8 digits in length.

* US numeric country code is 840

* US (minor outlying) numeric country code is 581

* UK numeric country code is 826

The full list of country codes is in the ISO standard here and of course we live in a global economy so you have to consider them all. Knowing the county code is also useful (but not reliable) in knowing the geographical location of the card holder. 

The Major Industry Identifier or MII is the first digit of the IIN.

0 ISO/TC 68 and other industry assignments
1 Airlines
2 Airlines and other future industry assignments
3 Travel and entertainment and banking/financial
4 Banking and financial
5 Banking and financial
6 Merchandising and banking/financial
7 Petroleum and other future industry assignments
8 Healthcare, telecommunications and other future industry assignments
9 For assignment by national standards bodies

And as if that isn't enough, here is another level of detail:

Showing IIN Range, Length, and Example for American Express, Discover Card, Mastercard, Maestro, Visa, Diners Club, UATP Corporate Card, JCB, InstaPayment, InterPayment, and UnionPay.

As you can see, what you think may be simple, “Oh, it's a 16 digit integer isn't it?” has a lot of complexity and rules behind the number and the same is generally true of almost every data “standard” class you will find. We ship over 200 out-of-the box data classes and did a lot of work to research them all. 

Note : This is why many of the Git scanning tools for developer secrets and early DLP tools are unfortunately very inaccurate and generate a lot of noise.

Data Collections

You have probably heard the term PII or Personally Identifiable Information. I am told by Chief Privacy Officers that this is an antiquated term that went out of fashion in privacy circles a decade ago but the security industry still uses it and so will I to illustrate the point. PII is a collection of data classes. Personally identifiable information (PII) is any data that can be used to identify a specific individual and usually considered where you have at least three pieces of data that tie together to an individual. Social Security numbers, mailing, email address, phone numbers, etc. 

The term PII arose from a need to group the growing number of types of data that were considered sensitive to an individual, including things like cookies, IP addresses, login ID’s, airline numbers, college ID’s and on and on it goes. That grouping is useful, and in recent years has become more useful, for companies in both defining their data handling guidance and reducing noise when performing data classification. Let's take a theoretical example of a future transport company that rents personal taxi drones. Sensitive data for that company may include the lease number, flying taxi hardware number, firmware version, customer number, and GPS coordinates for home pickup. All such data may be defined in a data handling policy as “Taxi Lease Humans Sensitive Data.” Grouping data into collections enables companies to provide clear rules such as “all Taxi Lease Humans Sensitive Data must be encrypted at rest and in transit.” 

The key point here is that every company does and usually should define sensitive data differently because their business needs are different. There is a base genetic definition, i.e., PII, but lots of sensitive data is augmented to that definition based on the data collected or generated. It is all the sensitive data we need to protect but a base level of generic data. 

Coming up next

In the next installment I will talk about how to design scanning systems that will scale to scan petabyte large files in a reasonable amount of time. We use Lambdas, the AWS serverless architecture that allows us to spin up thousands of scanners in parallel and chunk through petabytes of data. But, to do this, you must be able to track which parts of the file you are scanning, how much of the file the Lambda has burned down, and manage things like Lambda resource pools so you don't take down other systems. We will explain it all. We will also explain techniques for sampling data and using data structures themselves to optimize scanning speeds.

Don't miss a post

Get stories about data and cloud security, straight to your inbox.