/.article

Introduction to Regex Based Data Classification for the Cloud

January 13, 2022

Abstract

Intro

Data Classification using regular expressions (regex) is showing to be a scalable solution to the problem of classifying data at scale. Many products using regular expression based approaches for data classification have employed similar techniques, such as keywords, keyword distance, and validators. This paper assumes little knowledge in the field of contemporary data classification research, and focuses on consolidating what’s known given our work into the problem at hand. We then follow up the paper giving an example of how to write dataclasses using regular expressions using the process we’ve developed internally, which has shown effectiveness at increasing the efficacy of a dataclass.

Jump to section:

Introduction: Data Classification

Some History

Data Classification is a fairly new field, and has gone by different names in the past, such as “Data Identification” when used within a data-loss-prevention (DLP) context. The field of data classification has grown beyond DLP, as organizations are no longer concerned with just data loss prevention regarding “data in motion” (data going through API endpoints). “Data at rest” plays a larger role in new and growing fields, most generally governed by regulations such as data compliance (GDPR, CCPA), data lineage, which are generally more concerned with “data at rest” rather than “data in motion”, as cloud-storage has made the conversation regarding how data is allowed to “live” is becoming a national security concern. Whereas the input-data for endpoint security is ephemeral (traffic didn’t require storage: it could be left on a cache until processed, then discarded), cloud security is vastly different and poses a different set of issues: it’s rarely the case a CFOs reports that their cloud storage bill decreased year after year. The size of environments only continues to grow, and scanners need to continuously monitor environments, as they’re continuously changing faster than security teams can keep track of. 

Problem Statement: “Data Classification” 

Without diving into semantics of English, we use the Oxford Definition of “a piece of information” to be the best fitting definition to use when referring to data. Having said that, the data we’re interested in classifying is data that has informational value to a human or a computer system in one way or another. Under this interpretation, an AdobePDF file can be seen as data, a comma separated value (CSV) spreadsheet also can be seen as a structured collection of data,  and a text file containing grandma’s recipes can be seen as a piece of data as well. 

On the most broad level, data classification as a field concerns itself with the problem of knowing what data contains. For example, knowing if a PDF file contains information such as social security numbers or phone numbers, or if a CSV file has credit cards in one of the columns. While a deceptively trivial problem statement, the task at hand becomes inordinately more difficult as the complexity of the data one wishes to capture gets more complex. For example, how does one differentiate social security numbers from 9 digit long phone numbers? How does one separate a hash-digest from a string of text that happens to be 56 characters long? 

Furthermore, the time/cost of scanning grows linearly in proportion to the amount of data scanned - while this is obvious, suppose for example it takes 1 minute to scan 1gb, then scanning a petabyte of data will take nearly two years. For individuals familiar with large scale cloud storage, a petabyte is on the smaller end of what a cloud-instance can possibly extend to. In conjunction with the complexity of data one is trying to classify, as well as the magnitude of cloud environments, data classification doesn’t provide trivial solutions to complex questions. Any scan that takes a few years to return results back is worthless, at the same time, it’s essential that providing accurate results is also necessary. Dataclassification for modern data has thus become an uphill battle on two different fronts: cloud storage is clunky to search and traverse through and the amount of data growing is outpacing the speed at which cloud infrastructure can scan files.

Prior Art

Here we discuss some notable examples of regex based data classification used in different products and organizations. What is worth noting is that the general structure of how these products claim they operate has a degree of invariance associated with it: different companies with different approaches seem to have evolved to incorporate similar design principles, which we’ll discuss in the future section. 

Symantec's DLP concerns itself with data in motion, i.e detecting and classifying data moving through a given API endpoint. Symantec describes it’s service using a regex based approach, in conjunction with optional keywords and keyword distances, as well as a validator function. 

Amazon Macie, concerns itself with data classification of data at rest, and provides a means of scanning and classifying data within Amazon S3 (Amazon’s cloud storage service). Macie takes advantage of keywords and keyword distances, but has made no mention of a validator system. Furthermore, some dataclasses (target data) for Macie don't require keywords (i.e credit cards don’t necessarily need to have a keyword nearby a piece of target data). 

Regular Expression Based Classification: Dataclass Autonomy

A “dataclass” is the abstraction in which a data classification engine uses in order to produce a classification. Analogous to how we use height, name, and birthdate to identify humans, dataclasses use an atomic set of features to make classifications, which are generally human defined (we’ll talk about this later in the “Dataclass Design” section). 

Regex based data classification has historically involved the following atomic units to drive a classification engine. From the documentation that others have published, as well as what we use (superficially) internally, the most atomic elements of a dataclass be described with these characteristics. What’s worth mentioning is that there may be more atomic features that can be used in dataclass (which we are sure will happen as the field evolves), but these are the most universal approach to data classification, given our work in the field so far. 

Regular Expression(s): One or more regular expressions describing the appearance of the target data. 

Keyword(s): One or more keywords to use as context to match regular expressions with. For example, only match a regular expression if it’s within N characters of an included keyword. 

Validator(s): A Turing complete language allowing parsed target data to be evaluated using a set of rules. The validator accepts a string matched by a regular expression, and then performs various computations on the string, resulting in a boolean “yes” or “no” on whether or not a string is satisfactory.  

Uses of Regular Expressions in Data Classification

Regular Expressions, often abbreviated as “regex”, is a mathematical language that allows one to model and describe a set of patterns in strings, which are useful for describing a set of possibilities of a given string pattern. For example, all (modern) social security numbers can be captured by [0-9]{9}, which is translatable to “all social security numbers will be 9 digits long”. Having a mathematical language readily available to describe the set of all possible desired strings of a given target data is desirable for many reasons: mathematical languages provide rigor and unambiguous definitions when used properly. Likewise, it allows for the modeling of overlapping patterns (something we’ll discuss in a future paper), which is a way of understanding overlaps within dataclasses. 

Explaining regexes will not be covered in this paper, as the topic goes fairly deep. When examples are used, the regular expression “[0-9]{9}” (match 9 digits) will be used to keep the technical understanding of regexes to a minimum. Likewise, we’ll use social security numbers (9 digits) as the goto example for data classification, when applicable.

The most telling of reasons why we believe regexes to be the most popular and optimal solution for data classification at scale is that regexes can be compiled and processed fairly cheaply.  Converting regex to a finite state machine is memory and computationally inexpensive to run at scale, granted that the “wildcard” operator isn’t abused, which allows for an infinite amount of characters. What follows from this is that processed bytes does not have to be processed holistically (i.e processing an entire file), but rather streamed byte-by-byte, a very convenient feature for working with data at scale. When scanning large file types (i.e very large big-data formats, such as parquet / avro), this optimization is essential for cost-effective scanning. This is in stark contrast to most neural network based approaches which require a holistic view of the data in order to generate a prediction. 

One of the biggest drawbacks of regexes is the lack of turing completeness (the ability to perform logical computations). In most cases, this is generally compensated for in some use of a validator function, which we’ll discuss in the future section. Computations are needed in many data classification scenarios, especially capturing data that uses some sort of checksum (error detection) in order to prevent user-error. 

Uses of Keywords and Keyword Distance

The unsupervised use of regular expressions is generally ill advised as it will lead to endless noise by design. For example, suppose a developer uses “[0-9]{9}” to try and capture social security numbers. This regex will inadvertently capture many phone numbers with the same format: it would be foolish to consider any group of 9 digit numbers to be social security numbers. 

Most data classification tools employ some use of a “keyword” in order to pair a regular expression to a list of keywords that we, as humans, generally associate to the target item data. For example, “ssn” appearing near “[0-9]{9}” will likely be a social security number. This is a pretty straightforward example, but becomes more nuanced once different languages and file formats are introduced. 

The notion of a “keyword distance” refers to the distance between the regex and the keyword. Within unstructured data, this is prevalent as keywords may (and commonly are) inadvertently near target data quite often, so restricting the distance is a way of reducing false positives. For example, suppose a text document of containing a message between two users, where one user sends a social security number over an email:

“As per request, my social security number is 555555555.”

This is an example of a short keyword distance between the target data and a keyword. Note that the notion of keyword distance is not used in structured data, as there’s no notion of distance between characters.

Arbitrarily setting keyword distances is important for reducing the opportunity for false positives to appear. For example, an abnormally high keyword distance (say 500) in the below example would produce an undesirable result:

“Addressing your request, I will not be sending over my social security number. I will instead be sending over my phone number at the end of this paragraph. I hope you can read 9 digits of numbers well without any dashes in it, as that’s how I send them. Well, ok here it is. 555555555. This is my phone number.”

Keeping the distance to a minimum is a balancing act, as if it’s too short, it’ll miss a lot of examples, but if left too large, will generate noisy/undesirable results.  

Validators

We made a mention of regular expressions not being “turing complete”, but didn’t expand on what this implied and how validators can remediate this issue. For example, suppose we wanted to capture target data of 6 digit long strings, with the added requirement of each digit being successively larger than the previous. For example, a string passing this requirement would be: 

“134789”

It’s self-evident that “[0-9]{6}” would not successfully differentiate strings given the requirements I laid out above. A toy example like this seems useless, but if a validation component can compute something like this, it can also perform arbitrary computational checks, such checksums. 

I cannot currently provide proof that a regular expression lacks the expressive features to capture all strings governed by this requirement (aside from brute force), but I can assert that this can be done with any Turing complete language. This is the primary motivation behind a validator: providing a means of carrying out computations on the string a regular expression had captured. 

Many forms of personal identifiable information (PII) have a computational aspect for verification, mostly to prevent human errors. For example, credit cards contain a Luhn check, which reduces the prevalence of human error when handling credit card numbers. For us as developers seeking to classify target data containing this computational feature, being able to perform these computational checks is a core mechanism in how our data classification engine works to reduce noise. Some validators can get even more elaborate, such as polling an external API to check if a string is a valid key.

Regular Expression Based Classification: Understanding Dataclasses For the Cloud - Writing, Maintaining, and Testing Dataclasses

Section Overview

What needs to be understood upfront is that dataclasses share the same ambiguous problems that also face antivirus software: arbitrary decisions in regards to what constitutes a positive hit is human defined and can be hard to pinpoint what separates a harmless program from that of a malicious program. Furthermore, concept drift is prevalent in both fields of study, making benchmarking efficacy an imperfect art. 

For example, anti-cheat software (Riot Vanguard, ESEA, etc) sometimes gets flagged as malware: kernel level (highest level permission) anticheats typically monitor open processes, search for files on disk, and analyze user input (such as mouse and keyboard presses). For the sake of argument, different individuals would constitute this behavior as one similar to malware, on the other hand, some would argue that it’s just a piece of software. The distinction is left to whoever needs to categorize the program. 

The creation of dataclasses bears the same fundamental issues. Whereas there is some more obvious cases of how dataclasses can be defined in terms of a regular expression and validators, such as ssh-keys (these have a very, structured format that’s purposefully designed with thought), there are also less well thought out structures, such as drivers license numbers, vary from state to state, country to country. For example, the string “driver license 72” is ambiguous, as “72” is a valid driver license number in some states. Does this refer to “72 driver licenses”, or directly referring to the driver license number with the string “72”?. To us, we would find it useless for us to meaningfully call “72” a drivers license, but at the same time, it would be systematically incorrect to say that “72” is not a valid driver license somewhere (according to documentation in some states, this is valid driver license somewhere).

Likewise, what constitutes a match may change or be updated. If a dataclass claimed to capture “bitcoin addresses”, what constitutes a bitcoin address is non-static and changes based on community behavior. New address standards can get voted in by the community as deemed necessary. This is a perfect demonstration of concept-drifting that can occur within data classes. A dataclass capturing bitcoin addresses in 2011 (take for example, “1A1zP1eP5QGefi2DMPTfTL5SLmv7DivfNa”, a hash digest in base64) differs in regular expression than those of bitcoins developed using the segregated witness address (starts with “bc1” and uses lowercase alphanumeric symbols only, i.e “bc1qar0srrr7xfkvy5l643lydnw9re59gtzzwf5mdq”). The “efficacy” of a dataclass can change at any given moment depending on what the agreed upon definition is, which can be left up to independent individuals, decentralized communities, or official governments. 

This is where the ambiguity of what is considered a dataclass comes into play and serves as the primary motivation of this chapter. Namely, understanding how to write and maintain dataclasses, and learning the different types of errors one can make.

Dataclass Concepts: Logical Classes and Types of Errors

Dataclasses can be broken into a few logical classes, each with their own caveats and distinctions that makes it easier or more difficult to work with. Each distinction deserves some recognition in that dataclasses aren’t always directly comparable with, let's say, labeled images of animals, which bear unambiguous classifications (given that the photograph actually is a class of animal). 

Class 1: Patternless Human Data

Human data can be the hardest to match. First names, last names, date of birth, all of which follow different conventions around the world. What is considered a name 50 years from now may not be seen as a name today: concept drift is prevalent within this class of data, and inherently makes consistently writing dataclasses for this data difficult. 

The most notable example of this class are names for humans. The string “hunter” has multiple lexical categories, and as it can be a noun or a proper noun. Other examples include street addresses (there’s some nuance here in that this can be defined as loosely defined data, but street names and city names fall into this logical category due to their arbitrary and usually cultural based origins).

This obviously poses difficulties for those trying to create a science out of dataclasses, as a country's way of date keeping may be entirely different from the way Americans maintain their calendar. To reiterate, for this category, there can be many interpretations over the same set of data, depending on the user interpreting the data source. This class of data is difficult to capture effectively, and is generally not pursued due to the high signal to noise ratio capturing these types of data usually see. This category is also difficult to grade without some form of human intervention.

Class 2: Loosely Patterned Data

We’ve defined loosely patterned data to be: 

Data that has a structure assigned to it, but the pattern structure is:

  • is loosely enforced, and/or,
  • loosely enforced and/or is so commonplace, it cannot be differentiated from other strings without further context.

For example, social security numbers (nine digits) would fit into this category, as United States Bank Account Numbers may be nine digits as well. Without further context (there’s some exceptions to this example, i.e checksums and invalid ssn patterns, as 666-xx-xxxx is banned), the loosely patterned structure of either dataclass is extremely vague, hence context is required, and finding a match pattern in a document on it’s own isn’t enough to draw conclusions (i.e specific keywords need to be employed to accurately classify the target data). 

Other examples:

  • Credit Card Numbers (ignoring the Luhn check)
  • US Passport Numbers
  • Phone Numbers

To make matters worse, Class 2 data generally doesn’t have a defined structure the strings need to appear in. For example, phone numbers can be formatted a number of different ways, depending on how the user/system stores the numbers. Below are some examples of the same phone number being stored in a few different ways. 

  • +55 98765–4321
  • +55 98765 4321
  • 55987654321
  • +55 987654321

From a development perspective, ambiguities in Class 2 data can be a nightmare to deal with, as trying to incorporate different formats eventually leads to errors (Type 1 Error, which we’ll discuss later). Defining specific cases in regular expressions is a painfully annoying experience to debug and maintain. 

Furthermore, concept drifting is relevant in this class, although it is not as problematic as it used to be. From the best of our knowledge, concept drift appears when previous issuers of a format underestimate the growth of an industry: automobile license plates used to be only a few digits, and had to progressively add more digits and alphanumeric characters as the amount of cars purchased increased. Most modern issuers are aware of this design challenge, and proactively define formats that scale as the number of issued values increases. For example, IPV6 formats were designed to allow for a more robust range of IP addresses, as we began to run out of IPV4 addresses. 

Class 3: Well-Patterned Data

In direct contrast to loosely defined data, strictly defined data embodies the opposite characteristics: data that was methodically curated to be differentiable from other strings, as well as having a uniquely defined format. The “IBAN”, or International Bank Account Number, is vastly different from US Bank Numbers, in that there’s a very definite structure to the format (the IBAN includes metadata within the number to takes into account country codes as well as multiple checksums) which makes it easily for users to discern what is an IBAN and what is not.

Other examples of strictly defined data:

  • PGP-Keys: (While the body of a PGP contains a large semi-prime number in base64, the headers and declaration leaves the string unambiguous for its intended meaning) 
  • IBAN Number
  • SSH-Keys
  • AWS-Keys (all aws access keys begin with a deliberate string, such as AKIA).

These strings are the easiest to capture and can be written in a matter of minutes typically. 

Types of Errors

Errors occur when using written dataclasses for a number of reasons. Given the ambiguous nature over what dataclasses can seek to capture, differentiating the types of errors (as noted before, the domain of classification we’re operating in is a bit more complicated than those of binary classifications) has been a useful tool in debugging and making determinations with respect to decision making.

Type 1 Error: Overly Permissive

Type 1 Errors are the most common type of error encountered, as it is the one that produces the most noise for the end user (in differentiation to false-negatives, which aren’t noisey). “Overlay Permissive'' is a general description given to a dataclass, but most commonly relates back to the regular expressions, and is a blanket term for explaining “the regex is trying to capture too much”. 

An example of this would be a developer trying to capture all social security numbers (which falls under a our definition of a Class Type 2) and allows for any set of 9 numbers, with delimiters allowed between characters. While this will allow for a wide range of social security numbers to be found, the developer indirectly made the regular expression too permissive, and the developer finds themselves inadvertently capturing zip+4 codes. 

  • 555 55-5555
  • 5555+4444

These are found in the Class 1 and Class 2 categories, as the lack of explicit format can lead developers to try and have a “capture all” regex to incorporate all the variations a given target pattern can be expressed. 

Digression/Clarification

We make some distinctions and differentiate Type 1 Errors defined from Type 1 Errors in a traditional data science setting: whereas a confusion matrix has a binary heuristic of what a piece of information belongs to categorically, Type 1 Errors within the scope of data classification provides arbitrary “goalposts” of what the target data should not look like. An overly permissive error differs from a false positive if all the scanning mechanisms operated as expected, and a Type 1 Error occurred, a human would need to have to manually decide if a given string is a piece of target data or not. Even then, it may be impractical for a human to define whether a string categorically falls into a dataclass or not. 

For example, if a 16 digit number is found near the word “credit card” in a pdf file, and also pass the luhn check, and someone declared this to not be a credit card, then one needs to re-evaluate the keywords used, the matching patterns used, or the validator itself. In other words, everything is working as anticipated, and the real issue being discussed is reducing the overly permissive nature of the dataclass. 

Type 2 Error: Overly Strict

In stark contrast to Type 1 Errors, Type 2 Errors are not capturing enough.  This error is less seen by the end-user, as a regular expression not capturing enough generally does not draw one’s attention to things to address. That is to say however, Type 2 Errors are not as important. 

Type 2 Errors generally are found mostly in Class 2 Dataclasses, as these loosely defined classes are generally ambiguous about how to format a given set of data. For example, social security numbers can be formatted a variety of different ways, to enumerate a few:

  • 555-55-5555
  • 555 55 5555
  • 555555555
  • 55 - 55 - 5555

In these examples, simply matching 9 digits (no spaces or separators) will result in a Type 2 Error, as the ambiguity of how a user can store a social security number leaves arbitrary decisions on what to try and match.

These are analogous to False Negatives (Type 2 Errors). These are most common in the Class 2 and Class 3 categories, less so in Class 3 due to the explicitly defined behavior, but still arises. 

Type 3 Error: Definition Error

Type 3 Errors are more so a fault of human error rather than implementation, in that the definition of what the dataclass is trying to classify is either poorly defined or impossible to correctly label. A loose definition of a Type 3 Error can be seen as a “the dataclass or target has multiple definitions, with no universally agreed upon definition for either.”  

For example, an API key that doesn’t have a definition by the API provider (due to mismanagement) would be hard to classify, or the developers have many different formats for how an API can be accepted, but all of them are in incongruence with documentation. Another can be one attempting to create a “Covid Vaccine Passport ID”: there’s no universally accepted definition of what this entails (currently), and attempting to create a dataclass for it would be impossible.

Type 3 Errors generally occur when one doesn’t understand the type of data they’re trying to target and as a result writes a dataclass that doesn’t provide the desired functionality, as what constitutes the dataclass isn’t well defined (in contrast to social security numbers or AWS keys). These can contribute to mostly Type 1 Errors, as developers instructed to make these dataclasses may try to over-capture in order to compensate for the lack of agreed upon definitions. 

Writing Dataclasses

Considering dataclasses are the “meat” of what drives a data classification engine, being judicious with the research and development of the dataclasses themselves is as important as the scanning engine deployed. Hence, the practice of well thought out and well maintained dataclasses is of vital importance to any dataclassification software. 

Having defined the various logical classes and types of errors within the domain of the dataclasses gives us an oversimplified but model-theoretic means of expressing the concerns and steps one should take when writing a new dataclass, as most decisions end up becoming a “totem pole balancing act” one considers with each addition or removal of a dataclass’s features. We’ll discuss in a future paper the mathematical modeling of dataclasses, and how this provides value to the development and writing of current and future dataclasses, but we reserve that for a future time. 

Step 0: Pre-Research and Scoping

Identifying which logical class the dataclass is a member of serves as the first step to consider when writing a dataclass. Analogous to Big O notation, dataclasses will generally have a different time complexity in regards to cost-of-labor to procure a meaningful dataclass. Class 1 dataclasses are the most difficult, while Class 3 are the easiest. The focused goal of pre-research involves getting a feel for the problem at hand, as well as understanding one’s priorities and intentions for the dataclass itself. For example, one of the factors to consider when scoping is one’s priorities within the domain of a confusion matrix, as shown below:



While most (non-data science driven) individuals would naively shoot for a perfect f1-score, the field of data classification innately has arbitrarily chosen parameters as to what constitutes a true positive, hence we regress to the types of errors defined in the previous section. Internally, we have to communicate between teams that “we never shoot for perfect precision, because that means we’re having a low recall”, which is historically true for most classification models. However, by taking into account one’s priorities during the scoping phase, it makes answering engineering related questions on what to prioritize much more explicit by understanding one’s priorities in regards to a confusion matrix. 

For example:

  • If one is aiming for high precision, accepting more Type 2 Errors naturally will occur due to the tighter regex (granted the type of dataclass permits such types of errors). 
  • If shooting for high recall, Type 1 errors may occur, as including too many keywords and too wide of a regular expression naturally produces more overlaps between other dataclasses. 

Step 1: Researching References

Compiling a list of references to use during the creation of a dataclass is the foundation of creating quality dataclasses, as well as maintenance for years down the road. Most dataclasses repositories are in the double if triple digit counts, so maintaining a library of sorts of how each dataclass was created is required for proper maintenance, as well as for others to correct previous dataclasses. A side effect of Step 1 is to also best minimize Type 3 Errors, as there may be competing definitions as to what constitutes a classification for a given data type. Having resources and developing an agreed upon definition as to what a true positive looks like allows one to understand what others agree upon in regards to what matches a definition of a dataclass, and also understanding the potential types of Type 3 errors that may occur (there may be disagreement between references as to what is considered a type of data and what is not). Explaining which definition a developer is using for a given dataclass and expressing why is quickly lost in tribal knowledge. 

On a related matter, the inability to find quality references during this step is generally indicative of a poor dataclass down the road - a poor foundation here may be indicative that one may not want to commit the time to developing a dataclass, as the end result may be poor due to the lack of resources. A lack of documentation has been a tell tale sign that a dataclass will perform poorly for us as OpenRaven, with the converse of well documented dataclasses being the highest scoring efficacy. 

Step 2: Developing a Corpus

The next step involves generating a corpus of true positives, either by researching examples online, or synthetically crafting them oneself using definitions. This can be the most difficult process, as it requires a lot of examples, is prone to mistakes, and is inherently arbitrary (deciding on what’s considered a true positive can have many interpretations). 

Look online and try to develop a corpus of examples governing what a true positive looks like. This part generally has a ranging degree of difficulty for various reasons: sometimes, examples of target data aren’t open for the public to view, sometimes examples are readily available for viewing, sometimes they’re redacted for security reasons, or sometimes examples just can't be found without requesting them from a vendor specifically. For example, a real ITIN number isn’t given on an ITIN information website, but this redacted string is: 9XX-7X-XXXX. On the other hand, many complete IBAN numbers can be found on various banking websites, giving complete examples of what valid IBAN numbers look like. Likewise, sample AWS keys are available on AWS’s website. 

Finding examples such as this and hand-crafting true positive examples to use in the future steps is a trivially tedious experience, but has been essential for producing quality dataclasses in our experience. 

Step 3: Identify and Decide Variation Acceptance

In the previous example, we gave “9XX-7X-XXXX” as a sample ITIN number. Depending on what we decided to favor, higher recall or higher precision, we’ll either decide for or against allowing “9XX 7X XXXX”, ““9XX7XXXXX”,  “9XX-7X XXXX”, or variations thereof. By allowing any of these into the mix, the quantity of Type 1 Errors and Type 2 Errors will inevitably go up or down. 

The work needed in Step 3 is most pronounced in Class 2 dataclasses, as these are the ones that adhere to a “structured format” but aren’t explicitly clear of how it should be formatted across the board. For Class 3 dataclasses, this generally isn’t as prominent, as there’s a very explicit structure onto how exactly the string should appear. Exceptions do occur however. Likewise, for Class 1, this isn’t as much of an issue, as ambiguous data innately has variation on its own, and is expected to allow variation. 

Step 4: Crafting True Negatives

After having decided on true positives, as well as adding our own true positives with consideration of variations, crafting a set of true negatives provides a window into ensuring that our regexes are performing as intended, and not overmatching. For example, while we may craft a regex that can perfectly capture the target data we’re interested in, it’s as crucial to make sure it doesn’t capture anything more. 

Crafting true negatives requires some human creativity, as this will vary data to data. For example, a social security number would never have a letter in it, so grabbing items from our sample corpus and embedding various strings before, in the middle, after true positive strings would in effect create true negatives. An example of some true negative social security numbers would be:

  • 55a-55-5555
  • A555-55-5555
  • 555-55a5555

This step is unbounded within time complexity, the more the better: the data used here will be used when crafting the regular expressions. Tests like this provide a means of “unit testing” the regular expressions provided in the following section. 

Step 5: Crafting Regular Expressions

We won’t talk about how to actually write regexes in this section, as that’s a study on it’s own, but rather how to use the tools we’ve developed here to best write regexs with a focus on data classification. 

Having completed all the prior steps, we now have a corpus of true positives and true negatives, we can see why we write the regular expressions last. Having visibility into what we SHOULD match and what we SHOULD NOT is much easier to work with than working regular-expression first, then crafting true positives and true negatives. Although there’s some exceptions, the true negative corpus should be as important of a driving point when crafting regular expressions as the true positive set. 

This step involves using a regex editor, and seeing what regexes do/do not capture from the true positive and true negative categories. There’s not much we can describe in this step other than writing regexs and ensuring that false positives aren’t being caught, and the true positives are. 

Conclusion‍

In this paper, we shared the internal knowledge we at Open Raven have collected on understanding dataclassification. In very short detail, we discussed why we believed a regex based model is the best approach for cloud dataclassification, as it bears many convenient properties that scale with how large files can be on cloud environments: regex-based classification compliments well with how cloud architecture works (streaming bytes), and how very little memory needs to be used. 

The remainder of the paper was spent discussing a model-theoretic view of dataclasses (dataclasses are an abstraction of how we and others use regular expressions coupled with other atomic features to make classifications), which gives us a metamodel of sorts to describe the different types of data we’re interested in capturing. Although imperfect, having a metamodel allows developers to discuss and quantify issues that naturally arise when using dataclasses to perform dataclassification on cloud environments. We then go into how the metamodel is useful as a language of sorts when writing dataclasses, as prioritizing needs and wants based on the abstractions described allows dataclasses to be written more efficiently.