open raven blog

Open Raven vs. Macie: Data Classification Benchmarking (Part 1)

April 1, 2021
Engineering
Product

Open Raven’s core focus is to secure our customers’ data by solving a key security problem — visibility. To create an effective, transparent security posture, security and cloud teams must have answers ready for both “where is our data?” followed by “what type of data do we have?”. The answer to the latter is found in classification.

Why Data Classification Matters

If cloud and security teams only see the existence of various S3 buckets without knowing the types of sensitive data within each (financial, health, personal, etc.), they’ll work hard to fall short of their ultimate goal: securing the data. Classification identifies the types of data within each bucket so teams can then configure and monitor for mismatches between data types and policies, like never having payment card details in an open S3 bucket.

The Challenges of Data Classification

Data classification has two core challenges: accuracy (does it categorize data correctly, minimizing false-positives) and coverage (does it find all instances of the data category). These challenges are universal, extending across all the different file formats and structures that exist.  

In this blog post series, we’ll describe how we benchmark data classification efficacy and share supporting metrics in detail. We believe such transparency is core to building trust and confidence in the performance of our product with our customers. Mistakes in classification can lead to data leaks that could turn into serious incidents if not remedied, not to mention “alert fatigue” due to false-positives.

Why Our Industry Needs Independent Testing

Independent testing allows customers to trust their vendors, and vendors to improve their quality assurance. Anti-virus products have been put through the independent testing gauntlet for decades, giving potential customers invaluable insights into the real-world effectiveness of those services. Regrettably, there are no such equivalents for data classification/visibility/governance solutions. By taking a first step toward greater transparency in our industry, we hope to see independent benchmarking take form and improve all solutions over time.

Changing the Status Quo

Open Raven is releasing and providing (as open-source) our methodology, tools, and benchmarking data classification results comparing our product against one of the most well-known solutions in the space —  Amazon Macie. This blog series will detail all of the steps, allowing both other solution providers and our customers to independently reproduce testing data, run benchmarking, and come to their own conclusions. Initially, we’ll focus on unstructured data (objects in S3 buckets) followed by structured data in a later series.

This series will consist of the following: 

  • Part 1 - Intro to data classification (this blog post)
  • Part 2 - Getting the data: How to generate a corpus of test data
  • Part 3 - How we perform the benchmarking: Precision/Recall/F1 metrics
  • Parts 4 thru 7 - Benchmarking sensitive data types
  • Part 8 - Head-to-Head: Open Raven vs. Amazon Macie
  • Part 9 - Performance and Cost: How Fast and Costly Is Data Scanning
  • Part 10 - Future improvements and approaches in data classification 

author
Mike Andrews
Head of Engineering at Open Raven
Back to the Blog