We’re pleased to announce the open source release of Mockingbird, a Python library for generating mock documents that are used to test data classification software.
The Mockingbird package is now available for download on PyPI, installable via `pip install mockingbird` on the command line, and the source code has been released under the permissive Apache v2.0 license on GitHub.
Writing data classification software for locating sensitive data in both unstructured and structured text is a challenging technical problem, not unlike finding a needle in a haystack.
When tackled at a large scale, the problem becomes even more complex: imagine that the position of the needle changes, the needle may jump around various haystacks, and the haystack contains extra additives that look like needles, too. What began as a simple metaphor for finding a regex match in a file is now non-trivial, and we have the added complexity of needing to handle different formatting standards and file formats.
Data is required to evaluate data classification software. To increase our accuracy and performance, it’s important to test against documents that contain unique and random characteristics, or continuing our metaphor, belong to different haystacks.
But where do test documents come from? Using real-world data for the purposes of testing can raise ethical and privacy concerns. Testing on an organization’s internal sensitive files can be an operational security nightmare. And often, organizations don’t know where their sensitive data is stored to begin with.
At Open Raven, our approach to testing our data classification software has been to use synthetic data for testing. This data is generated by Mockingbird, and with this open source release we’re making it possible and easier for others to do the same.
The process of generating documents with Mockingbird begins with inputting seed data that’s user-defined or fabricated, and ends by embedding it within mocked documents.
Seed data can be any information that the user wishes to embed within documents, and typically this data would range from falsified social security numbers, to credit card information or personal information. Mockingbird does not ship with seed data, but instead has a simple interface to allow the user to provide this information themselves.
Users can provide a CSV file as input, and Mockingbird also also supports Mockaroo as a way of generating labeled data and easily embedding it within documents. If a user is running the Mockingbird Python package during Python’s runtime, they can also input structured lists. In the future we’d like to support the Faker Python library as an additional way to generate seed data.
With seed data as its input, Mockingbird will begin generating documents in multiple formats, placing it in randomized locations within documents – for example, embedding the seed data in the footnote of a docx file, or at the tail-end of a yaml configuration. Mockingbird’s output is a unique set of documents that can be used for testing our data classification software.
A different way of thinking about this process of randomly shuffling-in seed data is that it’s a kind of fuzz testing for data classification engines. Randomly embedding seed data not only helps identify edge cases in our data classification process, but brings to light potential false positives by synthetically embedding noise into documents.
Mockingbird is responsible for generating both structured and unstructured documents (.json vs .docx), while ensuring that each document is randomized in both format and appearance.
To address this, Mockingbird acts as a coordinator throughout the process of generating new documents and its logic is designed to help keep randomness consistent across different document types. Mockingbird sets the direction of how a document should be produced: how long a document should be, how many embedded-data pieces are placed in each document, and where the embedded data should occur are all controlled by a central codebase.
A benefit of allowing Mockingbird to have a birds-eye view over its files and folders, Mockingbird meticulously ensures that each file is properly produced and the embedded contents are properly logged across a variety of different circumstances.
Mockingbird generates unstructured data by simulating text commonly found in unstructured data formats. Unstructured data can come in many forms, such common plain text files, chat logs, or emails. Being able to generate data in a multitude of formats is crucial for testing data classification software, as user data can come in many different shapes and sizes, so having a robust test case network is critical for evaluating a data classification engine.
In practice, any Python library capable of writing document files can quickly implement the needed methods to begin writing embedded unstructured data files in less than five minutes. The architecture of Mockingbird abstracts away the underlying mechanisms that randomize the length and style of the unstructured document under the hood, allowing for high-level modules to quickly and cleanly use Mockingbird’s interface.
At the time of its initial release, Mockingbird can generate files in the following unstructured-data formats:
A general rule of thumb for Mockingbird - if there’s a Python library that will write a dictionary to your desired file format, then it can be wired up to Mockingbird in less than 10 lines of code. The process of randomizing column and row lengths is a large part of Mockingbird's internal logic, allowing for external structured-data libraries to quickly accept and convert any data-base-like object into a number of different formats.
The Python ecosystem already includes many libraries that provide this functionality, so Mockingbird’s compatibility with a larger set of file formats is already extensive.
At the time of its initial release, Mockingbird can generate files in the following structured-data formats:
We’ve broadly described Mockingbird as a tool for testing data classification, however as a generic tool, we have multiple uses in mind, including:
Today Mockingbird is being released as a mature piece of software with its 1.0 release, however there’s still work to be done. We’ve identified several areas of future work, including:
As an open source project, Mockingbird will be developed in the open on GitHub. We encourage community contributions and your participation in the project!
Tyler Szeto as lead engineer of the Mockingbird project and lead author of this blog post.
Additional members of the Open Raven team who supported the release include Brady Boyle, Matthew Daniel, Waverly Hsiao, Dave Lester, Oliver Ferrigni, and Igor Shvartser.
Thank you to AJ Venturella (@logix812) who graciously shared the Mockingbird package name on PyPI.