Discover and Classify Data

Transforming Data Classification Testing with Generative AI

Chief Corvus Officer
June 5, 2023

As the landscape of information technology continually evolves, data classification stands as a vital tool for enhancing data management, data security, and compliance. For those working on data classification software, testing is an inevitable, yet challenging task. This challenge stems primarily from the difficulty of procuring a diverse array of testing documents. However, with the advent of advanced generative AI models such as GPT-4, this hurdle can be substantially overcome.

The Dilemma of Real Documents

Traditionally, real documents serve as the foundation for testing data classification software. Yet, this approach harbors ethical concerns and practical issues. 

Ethically, real documents often house sensitive or confidential information. Utilizing these documents for testing purposes runs the risk of breaching privacy laws and ethical guidelines. Any inadvertent leak of such information during testing could provoke severe legal and reputational repercussions.

Practically, obtaining a diverse set of real documents is a formidable task. Gathering an ample variety of documents across different domains and formats, such as financial reports, emails, or vehicle shipment reports, is nearly impossible. Even if we gather a few, they may not cover the full range of potential data types and formats that our software might encounter in the real world.

Generative AI Solving a Ethical Dilemma

Generative AI emerges as a potent solution, effectively addressing these ethical and pragmatic concerns. Using AI models like GPT-4, we can generate synthetic documents across various formats, such as markdown, plaintext, or latex. These documents can be converted into various file formats like PDF, DOCX, TXT, and MD for testing purposes.

Generative AI exhibits the flexibility to generate documents based on specified hyperparameters, including the type and complexity of the document required. This adaptability allows us to simulate a range of scenarios, from generating technical financial reports to creating informal chat messages, and everything in between.

The use of generative AI presents numerous significant advantages:

1. Privacy Preservation: Since the documents are AI-generated and contain no real sensitive information, we can thoroughly test our software without the risk of exposing any private data.

2. Extensive Coverage: The ability to generate virtually any type of document ensures our testing process covers a broad spectrum of data types and formats, thereby enhancing the overall reliability of our data classification software.

3. Efficiency: The quick and automatic generation of testing files through AI saves time and resources compared to manually procuring and preparing testing files.


To illustrate the practical applications and benefits of using generative AI in data classification testing, let's explore some specific scenarios. These examples showcase how synthetic documents generated by advanced AI models like GPT-4 can simulate real-world data and enable thorough testing of data classification software. By leveraging the power of generative AI, these scenarios offer a glimpse into the vast potential of this technology to transform the testing landscape.

With this technology, we can generate a broad spectrum of documents that mirror those used in different industries. For instance, while serving a car manufacturing company, we can generate vehicle-related documents such as assembly instructions, parts inventories, quality control reports, and vehicle shipment schedules. Similarly, for a healthcare company, we can generate synthetic documents typical in the healthcare sector, including medical reports, prescriptions, patient records, and insurance claims. For this blog post, we focus on Vehicle Identification Numbers (VIN) and how the same piece of sensitive data can be embedded in different scenarios.

Plaintext Chat File with Mentioned VIN:

In this example, we simulate a chat conversation between two users discussing a car purchase. The conversation includes a mention of a VIN. By generating a synthetic plaintext file, our generative AI engine can create similar chat messages with various scenarios, enabling us to test our data classification software's ability to identify and classify sensitive information like VINs accurately.

Example of a chat conversation that contains a VIN number for a Toyota RAV4.

VIN Referenced in API Response:

This example represents an API endpoint response payload. It includes a sensitive exposure of a VIN within the text. By generating a synthetic JSON file, our engine can simulate similar API JSON responses with varying content, allowing us to assess the performance of our data classification software in detecting and protecting sensitive information within structured data.

Within the context of Open Raven, this ensures that our JSON parser can work in a plethora of different JSON setups. 

Invoice Document with VIN:

The provided invoice contains a VIN along with other transactional details. By generating synthetic invoice documents, our generative AI engine can create diverse invoices, mimicking real-world scenarios. This allows us to evaluate the accuracy and reliability of our data classification software in identifying and safeguarding sensitive information within document formats like PDF or DOCX.

Misconfigured Spring Boot Log with Serialized Object: 

This example demonstrates an error log generated by a misconfigured Spring Boot application. The log accidentally logs a serialized object that includes a VIN. By generating similar log files with synthetic data, our generative AI engine can simulate various error scenarios, enabling us to test our data classification software's ability to detect and prevent inadvertent exposure of sensitive information within log files or other structured data formats.

These examples collectively demonstrate the power of generative AI in creating a wide range of testing documents. By leveraging synthetic data generated by our engine, we can simulate diverse data breaches and ensure our data classification software performs optimally in identifying and protecting sensitive information across different domains, formats, and scenarios. This approach enhances the reliability, efficiency, and privacy of the data classification testing process, leading to robust software solutions in the field of data management, security, and compliance.


In conclusion, the application of generative AI, like GPT-4, in the realm of data classification testing marks a significant milestone. Its ability to create diverse synthetic documents eliminates the need for real, sensitive documents in testing, addressing privacy concerns and ethical dilemmas that have long been associated with traditional testing methodologies.

Moreover, the time and resource efficiency brought forth by AI-generated test documents allows for an optimized testing process. By automating the generation of testing files, we can focus on refining the core features of our software, fostering continuous innovation, and reducing the time to market.

With generative AI's capability to create industry-specific documents, we can cater to the unique needs of different sectors, thereby offering a customized, high-performance data classification solution to our customers.

Don't miss a post

Get stories about data and cloud security, straight to your inbox.