In the early days of image processing, Lenna Forsén, a 1970’s era Playboy model, inadvertently became the de facto test image used to benchmark image processing algorithms. Unbeknownst to Lenna and to her later regret, her photo had all the properties that the image processing research community needed at the time: a human face and a vibrant, dynamic range of colors. However, the ubiquitous use of an erotic image to benchmark software contributed to an ongoing 50-year debate surrounding sexism in computer science. Fortunately, controversy over Lenna’s image has also catalyzed many attempts to change the default benchmark for image processing to something less degrading and more inclusive.
The story of Lena also highlights a pertinent issue within the relatively new field of data classification — that unbiased and repeatable testing standards for data classification tools have yet to be set. For Open Raven, this situation poses a problem when we try to make objective statements about how well our software is performing. What we and others in our industry urgently need is a set of non-controversial, distributable data with the desired file properties to form conclusions about the efficacy of a piece of software. Whereas the field of image processing has trillions of images to choose from for a benchmarking collection, the options available to those in the field of data classification are much more limited. Manually searching for data leaks online is not suitable or ethical. Similarly, finding enough data to perform quantitative tests for a quality assurance team to deliver our product with confidence remains a significant challenge.
The absence of reliable test data compounds one of the biggest problems the team at Open Raven faces when testing our scanner: what data do we benchmark against? We need a wide distribution of file types, file sizes, and file contents to test against, and we need to answer questions such as how data sources affect efficacy, performance, and stability. However, to preserve privacy and data sovereignty, the data we use must also never leave a customer's environment. With that in mind, we sought to build our own corpus of source data that contains all the properties needed to ensure our program works in bounded but unforeseeable scenarios. In doing so, we also hope to give you the ability to distribute these files without fear of backlash — no piece of test data should cause controversy like the infamous Lenna test image did.
Our solution for a problem that could be defined as “seeing the unseeable” is Mockingbird (GitHub), our open-source data benchmarking project. We use Mockingbird to generate randomized documents containing embedded data (such as Social Security numbers) in large enough quantities to effectively test our program inside and out. Rather than trying to amass a small collection of realistic testing documents, we test against large magnitudes of randomized files. These files may or may not be representative of how customers store their structured or unstructured documents. Ultimately, however, they give us the ability to test against a large and diverse enough collection to gain visibility into the unknown.
Since all the files produced by Mockingbird are generated using verifiable seed data, having control over what content is inserted allows us to freely distribute the files used during testing. Because our benchmarking files are completely free of association (and controversy) with any real-life individuals or organizations, we are able to release them so others may validate both our results and other solutions’ claimed performance.
A key benefit of using a synthetic versus an “in-the-wild” dataset is the elimination of time spent manually labeling test data as such. The process of checking individual documents and finding the exact counts of target data occurrences is automated during Mockingbird’s file generation process. When Mockingbird produces a file, it logs the quantity of injected target data placed within each document into a metadata file. This automatic data logging gives us a quantitative ground-truth view over what each file contains. Because performing statistical calculations requires having the source of truth of a file's contents, this feature also plays a fundamental role during benchmarking.
Mockingbird also allows large volumes of files to be generated within a bounded set of parameters to enable scanner benchmarking over a given distribution of files. For example, you can configure how many Social Security numbers to embed within a document or how many rows to generate within a .xlsx file. With this capability, issues such as over-identifying or under-identifying counts within a file (having known the exact counts of embedded target data) become a known statistic, a constant in a sea of variables. Furthermore, by allowing files to be generated in large enough quantities, we can give a margin of error on the expected accuracy for any given classification because of given metrics over a wide range of files. Missed files can also be flagged and manually inspected to see where and why our scanner failed, thus allowing us to redesign our scanning approach appropriately.
In addition to generating documents with embedded information, Mockingbird can also be used to generate documents containing harmless information, i.e., keywords and target data that contain no sensitive data. By embedding the string “null” into documents in place of actual target data, we effectively create “placebo” documents that the scanner should not return any positive labelings on.
The ability to generate these “null” or “clean” files help us determine the scanner’s rate of false positives, i.e., instances when the scanner finds sensitive data even though the document doesn’t contain anything sensitive. When it comes to reducing the meaningless alerts that cause alert fatigue, ensuring that a scanner works against null files is just as important as detecting positive files.
As developers, not only are we faced with the challenge of generating documents containing sensitive information, but we also have to create harmless but still applicable data for our scanner to find. As mentioned before, sourcing this data legally is not feasible. Luckily, much of the data we’re looking for is documented online, in human-verifiable form.
As we release our benchmarks on specific data classes in the future, we will continue to share both our generation scripts and the actual documents we generate on GitHub.
When we release the final white paper, we will release a repository on GitHub that will contain data and instructions to repeat our scientific process, independently.
Now that we’ve covered some fairly technical details let’s get back to the original problem: collecting data to perform and validate our scanner’s benchmarking. The approach we’ve taken allows us to circumvent unintended controversy surrounding the data being tested and create a robust framework we’ll incorporate into our software development cycle.
The diagram below shows how the tools we described in this post are used to benchmark our product (and other products) against a set of specific cases. This benching marking process provides confidence in our program through meaningful metrics in categories such as precision (our real-world ability to reduce false alerts) and recall (how many missed files we have).
While just a preview of the architecture we’ve designed, all of this is made possible with the plethora of testing data we can quickly and ethically distribute, something that’s made possible by our OpenSource initiative. Throughout this series, we will expand upon how we perform the benchmarking and show you more of the exciting experiments we’ve been doing here at Open Raven.