This is the first part of a seven part series about designing and building data classification systems for security and privacy. It is a technical series with code level examples, written specifically for security professionals who want to peek under the hood. It is unapologetically about AWS, the undisputed king of the clouds and about data warehouses, lakes and lake houses. The important data breaches these days just aren't from Word docs on your file server or a spreadsheet in an email box.
You could certainly take this series as a blue-print to roll your own but as I am sure you will see it just ain’t as simple as you might think. “Chapeau” if anyone does. Let me know and I’ll buy you a beer or better still hire you and pay you very well. You will deserve it. We’ve come to learn that if you want to look at big sophisticated data then you have to build a big sophisticated system, it is that simple. One of the most common phrases seen on our Slack is “we are gonna need a bigger boat”, usually preceded by some expletive. There have also been some pleasant surprises such as the fact that regexes perform better (we will define better in that section) on a lot of sensitive data classes than fancy pants ML and we cover those too but spoiler alert, you can’t just run a regex and expect to get anywhere near useful results.
The series goes like this:
- Part One - Intro
- Part Two - Warehouse, lakes, lake houses and data technology landscape
- Part Three - Finding and determining what to data to classify
- Part Four - Practical classification techniques
- Part Five - Scaling and optimizing analysis
- Part Six - Testing for accuracy and performance
- Part Seven - Open Raven vs AWS Macie
Part one, this post, is the only non-technical post written to frame up the series.
In part two we explain what has happened in the data world over the last decade, so you know where and why you need to be looking in certain places for sensitive data and what technologies to keep an eye on. We explain why AWS S3, an unbounded blob storage system that you have no doubt seen time and time again in data breaches, is the king of most modern data systems.
In part three we describe how to find where the interesting data is being stored. We start by explaining how to find all the S3 buckets in your AWS accounts complete with code examples and IAM roles. We then dive into creating an object index across all the buckets, something that can grow to billions and even trillions of objects in a decent size AWS infrastructure and frankly something we are still and will likely always be fighting. We talk about bloom filters, finding the scale limits of indexing in Elastic search, using Apache Tika and other file handling libraries to examine MIME types and how to scope down the interesting files to look at. We finish by explaining why Apache Parquet, a columnar file format is where most of the important data is stored today and we look at how columnar file formats work.
In part four we dive into how to classify data. We look at the differences between structured, unstructured and semi-structured data, the techniques to match data and why data adjacency is so important. We deep dive into regular expressions, a technique that works surprisingly well, examining patterns to use and patterns to avoid. We then look at why data validation is needed to reduce noise and why we built a data validation API and how it works.
Part five gets into how to design scanning systems that will scale and by ‘scale’ we mean being able to scan petabyte large files in a reasonable amount of time. We use Lambdas, the AWS server-less architecture that allows us to spin up thousands of scanners in parallel and chunk though petabytes of data but to do this you have to have to be able to track what parts of the file you are scanning, how much of the file the lambda has burned down and manage things like Lambda resource pools so you don't take down other systems. We explain it all. We also explain techniques for sampling data and using data structures themselves to optimize scanning speeds.
Part six then describes how we have built a complete test harness to determine accuracy and performance across all of the data classes we support and all the file types we support. If you want to learn about F numbers, precision scores and recall rates this is the post. It's cool stuff and one part of this work has resulted in an open source tool called MockingBird.
Part seven is the bragging rights post, Open Raven vs Macie. It's a summary of a white-paper in which we have taken Open Raven and AWS Macie and compared them side by side. We have taken a scientific approach and an experiment that you can repeat in your own lab (or in your home lab I guess these days). When we first built our testing framework we matched or beat Macie in most areas but lost in others. Today we beat it in almost every area and in most cases massively.
We are planning to publish the rest of this series about once a week and throughout the series I will be co-authoring with other members of the Open Raven product team including Mike Andrews (Chief Architect), Igor Shvarster (Technical Product Manager), Tyler Szeto (Engineer) and a number of other folks that have done the hard work.