Discover and Classify Data

Open Raven vs Macie: Getting the Data for Benchmarking - Part 2

Bele

Chief Corvus Officer

April 6, 2021

In the early days of image processing, Lenna Forsén, a 1970’s era Playboy model, inadvertently became the de facto test image used to benchmark image processing algorithms. Unbeknownst to Lenna and to her later regret, her photo had all the properties that the image processing research community needed at the time: a human face and a vibrant, dynamic range of colors. However, the ubiquitous use of an erotic image to benchmark software contributed to an ongoing 50-year debate surrounding sexism in computer science. Fortunately, controversy over Lenna’s image has also catalyzed many attempts to change the default benchmark for image processing to something less degrading and more inclusive.

The story of Lena also highlights a pertinent issue within the relatively new field of data classification — that unbiased and repeatable testing standards for data classification tools have yet to be set. For Open Raven, this situation poses a problem when we try to make objective statements about how well our software is performing. What we and others in our industry urgently need is a set of non-controversial, distributable data with the desired file properties to form conclusions about the efficacy of a piece of software. Whereas the field of image processing has trillions of images to choose from for a benchmarking collection, the options available to those in the field of data classification are much more limited. Manually searching for data leaks online is not suitable or ethical. Similarly, finding enough data to perform quantitative tests for a quality assurance team to deliver our product with confidence remains a significant challenge.

The absence of reliable test data compounds one of the biggest problems the team at Open Raven faces when testing our scanner: what data do we benchmark against? We need a wide distribution of file types, file sizes, and file contents to test against, and we need to answer questions such as how data sources affect efficacy, performance, and stability. However, to preserve privacy and data sovereignty, the data we use must also never leave a customer's environment. With that in mind, we sought to build our own corpus of source data that contains all the properties needed to ensure our program works in bounded but unforeseeable scenarios. In doing so, we also hope to give you the ability to distribute these files without fear of backlash — no piece of test data should cause controversy like the infamous Lenna test image did.

Seeing the unseeable

Bele sitting on an iceberg with 'data' hiding underneath the water's surface.

Our solution for a problem that could be defined as “seeing the unseeable” is Mockingbird (GitHub), our open-source data benchmarking project. We use Mockingbird to generate randomized documents containing embedded data (such as Social Security numbers) in large enough quantities to effectively test our program inside and out. Rather than trying to amass a small collection of realistic testing documents, we test against large magnitudes of randomized files. These files may or may not be representative of how customers store their structured or unstructured documents. Ultimately, however, they give us the ability to test against a large and diverse enough collection to gain visibility into the unknown.

Three CSV files with randomized information. — Three different unique permutations of CSV files, all of which vary in placement and size. Unexpected and expected behaviors start to show patterns once generated and tested in large enough quantities.

Large circle with 'All Permutations' label. Inside is a vin-diagram with Mockingbird and User on either side and Environments in the overlapping section in the middle. This is labeled as 'Area of Visibility.' — The sizes, structure, and overall layout of files are a small subset of all possible permutations of files. Mockingbird can only generate a bounded number of these possibilities. Nonetheless, in effect, it can capture a large enough net of realistic testing scenarios to test our program to a high degree of confidence.

‍

Since all the files produced by Mockingbird are generated using verifiable seed data, having control over what content is inserted allows us to freely distribute the files used during testing. Because our benchmarking files are completely free of association (and controversy) with any real-life individuals or organizations, we are able to release them so others may validate both our results and other solutions’ claimed performance.

Inside visibility

A key benefit of using a synthetic versus an “in-the-wild” dataset is the elimination of time spent manually labeling test data as such. The process of checking individual documents and finding the exact counts of target data occurrences is automated during Mockingbird’s file generation process. When Mockingbird produces a file, it logs the quantity of injected target data placed within each document into a metadata file. This automatic data logging gives us a quantitative ground-truth view over what each file contains. Because performing statistical calculations requires having the source of truth of a file's contents, this feature also plays a fundamental role during benchmarking.

Metadata from a file with a JSON file selected showing "credit card": 66 in the comments. — Since Mockingbird tracks each file's contents, the resulting meta-data can be used for may different internal uses. Being able to metaphorically "right-click" and know what's inside any file produced by Mockingbird allows us to perform tests and experiments.

Mockingbird also allows large volumes of files to be generated within a bounded set of parameters to enable scanner benchmarking over a given distribution of files. For example, you can configure how many Social Security numbers to embed within a document or how many rows to generate within a .xlsx file. With this capability, issues such as over-identifying or under-identifying counts within a file (having known the exact counts of embedded target data) become a known statistic, a constant in a sea of variables. Furthermore, by allowing files to be generated in large enough quantities, we can give a margin of error on the expected accuracy for any given classification because of given metrics over a wide range of files. Missed files can also be flagged and manually inspected to see where and why our scanner failed, thus allowing us to redesign our scanning approach appropriately.

Addressing false positives with null files

In addition to generating documents with embedded information, Mockingbird can also be used to generate documents containing harmless information, i.e., keywords and target data that contain no sensitive data. By embedding the string “null” into documents in place of actual target data, we effectively create “placebo” documents that the scanner should not return any positive labelings on.

Files that show a chat – one contains the full conversation asking for and giving a SSN. The other replaces any mention of SSN or a string of 9 numbers with 'nulldata'. — Two nearly identical documents, on containing a Social Security number and the other containing "null data." Being able to generate files that do NOT contain anything of significance is crucial for ensuring that false positives will not be produced regularly.

The ability to generate these “null” or “clean” files help us determine the scanner’s rate of false positives, i.e., instances when the scanner finds sensitive data even though the document doesn’t contain anything sensitive. When it comes to reducing the meaningless alerts that cause alert fatigue, ensuring that a scanner works against null files is just as important as detecting positive files.

Generating seed data

As developers, not only are we faced with the challenge of generating documents containing sensitive information, but we also have to create harmless but still applicable data for our scanner to find. As mentioned before, sourcing this data legally is not feasible. Luckily, much of the data we’re looking for is documented online, in human-verifiable form.

Current format section of DEA numbers from Wikipedia. — A DEA Number's (the U.S. Drug Enforcement Agency's Registration Number) specifications are easily found online and provided on the DEA's website. Translating these specifications into human-readable Javascript to generate target data is an ongoing effort.

Finding human-readable formats and creating engines to generate target data for testing data classes is an important part of Open Raven’s ongoing OpenSource effort to make our scientific approach as transparent as possible. To increase transparency, users can download the same Javascript code we use to generate benign target data to both verify that our implementation is correct and create their own target data to verify our classification engines.

JS file titled 'dea_generator.js'. — Translating the DEA's code's specifications into human-readable code to generate new and harmless codes is how we generate testing data to inject into documents.

As we release our benchmarks on specific data classes in the future, we will continue to share both our generation scripts and the actual documents we generate on GitHub.

When we release the final white paper, we will release a repository on GitHub that will contain data and instructions to repeat our scientific process, independently.

Up Next: Using Testing Documents: Benchmarking, Metrics, and More.

Now that we’ve covered some fairly technical details let’s get back to the original problem: collecting data to perform and validate our scanner’s benchmarking. The approach we’ve taken allows us to circumvent unintended controversy surrounding the data being tested and create a robust framework we’ll incorporate into our software development cycle.

The diagram below shows how the tools we described in this post are used to benchmark our product (and other products) against a set of specific cases. This benching marking process provides confidence in our program through meaningful metrics in categories such as precision (our real-world ability to reduce false alerts) and recall (how many missed files we have).

The end-to-end framework we use to use our generated data to benchmark software.

While just a preview of the architecture we’ve designed, all of this is made possible with the plethora of testing data we can quickly and ethically distribute, something that’s made possible by our OpenSource initiative. Throughout this series, we will expand upon how we perform the benchmarking and show you more of the exciting experiments we’ve been doing here at Open Raven.

If you haven’t already, check out our previous post, Open Sourcing Mockingbird, and the respective GitHub repository for our code.