Company

Open Sourcing Mockingbird

Bele
Chief Corvus Officer
January 25, 2021

We’re pleased to announce the open source release of Mockingbird, a Python library for generating mock documents that are used to test data classification software.

The Mockingbird package is now available for download on PyPI, installable via `pip install mockingbird` on the command line, and the source code has been released under the permissive Apache v2.0 license on GitHub.

Tackling the Problem

Writing data classification software for locating sensitive data in both unstructured and structured text is a challenging technical problem, not unlike finding a needle in a haystack.

When tackled at a large scale, the problem becomes even more complex: imagine that the position of the needle changes, the needle may jump around various haystacks, and the haystack contains extra additives that look like needles, too. What began as a simple metaphor for finding a regex match in a file is now non-trivial, and we have the added complexity of needing to handle different formatting standards and file formats.

Data is required to evaluate data classification software. To increase our accuracy and performance, it’s important to test against documents that contain unique and random characteristics, or continuing our metaphor, belong to different haystacks.

But where do test documents come from? Using real-world data for the purposes of testing can raise ethical and privacy concerns. Testing on an organization’s internal sensitive files can be an operational security nightmare. And often, organizations don’t know where their sensitive data is stored to begin with.

At Open Raven, our approach to testing our data classification software has been to use synthetic data for testing. This data is generated by Mockingbird, and with this open source release we’re making it possible and easier for others to do the same.

How it Works

The Big Picture

The process of generating documents with Mockingbird begins with inputting seed data that’s user-defined or fabricated, and ends by embedding it within mocked documents.

Seed data can be any information that the user wishes to embed within documents, and typically this data would range from falsified social security numbers, to credit card information or personal information. Mockingbird does not ship with seed data, but instead has a simple interface to allow the user to provide this information themselves.

Users can provide a CSV file as input, and Mockingbird also also supports Mockaroo as a way of generating labeled data and easily embedding it within documents. If a user is running the Mockingbird Python package during Python’s runtime, they can also input structured lists. In the future we’d like to support the Faker Python library as an additional way to generate seed data.

On the left is Seed Data including Mockaroo, Faker PyPl Library, and User-Defined CSV which feeds into Mockingbird. Mockingbird then feeds into Generated Documents including JSON, YAML, and Plain Text.

With seed data as its input, Mockingbird will begin generating documents in multiple formats, placing it in randomized locations within documents – for example, embedding the seed data in the footnote of a docx file, or at the tail-end of a yaml configuration. Mockingbird’s output is a unique set of documents that can be used for testing our data classification software.

A different way of thinking about this process of randomly shuffling-in seed data is that it’s a kind of fuzz testing for data classification engines. Randomly embedding seed data not only helps identify edge cases in our data classification process, but brings to light potential false positives by synthetically embedding noise into documents.

Under the Hood

Document Generation Process

Mockingbird is responsible for generating both structured and unstructured documents (.json vs .docx), while ensuring that each document is randomized in both format and appearance.

To address this, Mockingbird acts as a coordinator throughout the process of generating new documents and its logic is designed to help keep randomness consistent across different document types. Mockingbird sets the direction of how a document should be produced: how long a document should be, how many embedded-data pieces are placed in each document, and where the embedded data should occur are all controlled by a central codebase.

A benefit of allowing Mockingbird to have a birds-eye view over its files and folders, Mockingbird meticulously ensures that each file is properly produced and the embedded contents are properly logged across a variety of different circumstances.

Image: Mockingbird will track relevant metadata for each file generated. Knowing what embedded data was placed in each file is key to grading classification engines.

Generating Unstructured Data

Mockingbird generates unstructured data by simulating text commonly found in unstructured data formats. Unstructured data can come in many forms, such common plain text files, chat logs, or emails. Being able to generate data in a multitude of formats is crucial for testing data classification software, as user data can come in many different shapes and sizes, so having a robust test case network is critical for evaluating a data classification engine.

Image: Mockingbird can generate mocked kubernetes log documents, mimicking how sensitive information can be leaked and potentially exposed in data breaches
Image: Unstructured data is incredibly robust in how data can appear. In this example, numbers resembling the pattern of Social Security Numbers are embedded in a PowerPoint document, one of many potential sources for an unstructured data leak.

In practice, any Python library capable of writing document files can quickly implement the needed methods to begin writing embedded unstructured data files in less than five minutes. The architecture of Mockingbird abstracts away the underlying mechanisms that randomize the length and style of the unstructured document under the hood, allowing for high-level modules to quickly and cleanly use Mockingbird’s interface.

At the time of its initial release, Mockingbird can generate files in the following unstructured-data formats:

  • Microsoft Docs (.docx)
  • Adobe PDF (.pdf)
  • Microsoft PowerPoint (.pptx)
  • Plain-Text files (.txt)
  • Kubernetes Log Files (.log)

Generating Structured Data

A general rule of thumb for Mockingbird - if there’s a Python library that will write a dictionary to your desired file format, then it can be wired up to Mockingbird in less than 10 lines of code. The process of randomizing column and row lengths is a large part of Mockingbird's internal logic, allowing for external structured-data libraries to quickly accept and convert any data-base-like object into a number of different formats.

The Python ecosystem already includes many libraries that provide this functionality, so Mockingbird’s compatibility with a larger set of file formats is already extensive.

Code: The “avro” module for Mockingbird, which allows Mockingbird to generate Avro files in less than 10 lines of code. Each time this is run, a new and unique avro file, with different headers and columns, will be created each time.

At the time of its initial release, Mockingbird can generate files in the following structured-data formats:

  • Avro
  • CSV
  • JSON
  • Kubernetes Logs
  • ODS Spreadsheets
  • XLSX Spreadsheets
  • YAML

Mockingbird Use Cases

We’ve broadly described Mockingbird as a tool for testing data classification, however as a generic tool, we have multiple uses in mind, including:

  • Testing the Open Raven Platform. Mockingbird is what we use to test Open Raven’s data scanning and classification features. You can use it to generate your own demo data, or use a previously generated dataset to get started on Open Raven.
  • Sourcing privacy data. Mockingbird provides an alternative to using your production data for testing data workflows generally. You can now have endless amounts of data that appears sensitive, but isn’t.
  • CI and testing workflows. Mockingbird is lightweight and easily plugs into your existing CI pipelines that rely on real data. Regression, performance, and load testing are easier due to straightforward data generation.
  • Machine learning. Mockingbird can be used to prevent overfitting by generating a larger variety of data for machine learning workflows. It can also generate a variety of test sets easily.

Future Work and Open Source Extensibility

Today Mockingbird is being released as a mature piece of software with its 1.0 release, however there’s still work to be done. We’ve identified several areas of future work, including:

  • Additional file types: Additional text-based file formats including parquet files, synthetically generated websites, and image formats such as PNG.
  • Database records: In addition to files, the ability to generate databases of data.
  • Input data sources: We currently offer the ability to read data from CSV files as well as from the Mockaroo API, but the more ways to input seed data the merrier.
  • Setting arbitrary file sizes: Today Mockingbird is configurable to specify the number of files generated, however future goals include the ability to specify the size of generated files, such as 500mb CSV files or 20mb PDFs.
  • Performance enhancements: Mockingbird was designed with future performance in mind, including an architecture to enable parallelization. Future work includes using multiple threads to generate files faster.

As an open source project, Mockingbird will be developed in the open on GitHub. We encourage community contributions and your participation in the project!

Acknowledgements

Tyler Szeto as lead engineer of the Mockingbird project.

Additional members of the Open Raven team who supported the release include Brady Boyle, Matthew Daniel, Waverly Hsiao, Dave Lester, Oliver Ferrigni, and Igor Shvartser.

Thank you to AJ Venturella (@logix812) who graciously shared the Mockingbird package name on PyPI.

Don't miss a post

Get stories about data and cloud security, straight to your inbox.