Discover and Classify Data

Signal vs Noise - The Cycle of Innovation in Data Classification - Part 4 of 5

Chief Corvus Officer
April 8, 2021

This is the fourth part of a five-part series (Part One | Part Two | Part Three | Part Five).

I have been building security code scanning tools for over a decade. I started in the late 2000’s at Microsoft where I owned CAT.NET, an intra-procedural static analysis tool for .NET MSIL code that plugged in to Visual Studio as an extension. Visual Studio already had a security Linter (see below) that was built in. In recent years, at SourceClear (acquired by Veracode), our entire product was built on an inter-procedural static analysis engine to determine if custom code was calling methods in vulnerable open source libraries. 

I like to think I have been around the block enough to have seen things come in and go out of fashion. 

If you don’t know about code analysis then there are generally a few types of security analysis techniques tools that have evolved in roughly the following innovation cycles.

Cycle one “Linting” - simple grep over parts of code

Cycle two “Intra-procedural” - analyzing parts of the code

Cycle three “Inter-procedural” - analyzing the entire code (including all the dependencies) 

Cycle four “Data flow analysis” - tracing data though the entire code base

Each technique built on the previous innovation cycle. Cycle one tools were fast and familiar but generated a lot of noise. Cycle two tools were better but slower, and started to miss increasingly more sophisticated vulnerabilities by attackers as time went on. Cycle three tools were able to find more sophisticated vulnerabilities but in doing so generated lots of noise which led to cycle four that improved efficacy at the expense of very slow scanning. For cycle four tools to work, software had to build, the analysis engine had to build an inter-procedural call graph, a data flow graph and then run complex algorithms across them. Some analysis runs would literally run overnight.  

When I take a step back, I learned three important lessons building security code scanners over the years. 

  1. You can choose to do fast and cheap scanning which generally results in low efficacy results with lots of false negatives and lots of false positives, unless you narrow down scanning to a very limited set of specific issues that are known to perform well using the technique. You can also choose to do slow and expensive scanning which will generally result in a higher efficacy of results with less false positives and false negatives. The choices are not binary and there are many optimizations between the two approaches. 
  1. A users' appetite for signal vs noise usually boils down to their risk tolerance. The willingness to accept the increased cost of scanning (time, money, friction, etc.) is in return for an improved signal to noise ratio. If the user has a low tolerance for risk i.e., it is very important that results are as accurate as they can be, then they have to be willing to accept the increased cost of scanning. Conversely, if they have a high tolerance for risk i.e., it's only really important to catch low-hanging fruit, then they trade efficacy for speed and cost. 
  1. Lastly what I learned was that developers generally didn’t care for security (maybe controversial but I think is a truism) and so in order to get developers to adopt code scanning tools you need to find a balance between cost, which usually matters most to developers, and efficacy, which usually matters most to security people. Some code analysis tools vendors who shall remain nameless failed miserably at providing this balance and became peiraias to developers.

What happened with code scanning tools is that the more sophisticated they got, the slower and more computationally expensive they became to run and the more they got in the way of innovation. The end result was people stopped using them and have started to swing back to a new generation of simpler cycle two tools. So why does this all matter for a data security company?

"Those who fail to learn from history are condemned to repeat it." – Winston Churchill

It is just over two years ago since Dave and I set out to build the best damn data security platform ™ we could imagine, and consistently from those early days until today, we have heard the same core use cases from users:

  1. Tell me where my data is
  2. Tell me what type of data I have 
  3. Tell me how my data is protected
  4. Tell me who has access to my data

Tell me what type of data I have is of course a platform feature called data classification and call it what you will, that means data scanning. We are writing lots of content these days about designing and building data classification systems and our Macie vs Open Raven series in which we openly boast about our results. To let you into a little secret, our data classification engine design has leaned heavily on my experience building code scanning tools. 

The evolution of data classification tools are following an eerie similarity to the cycles of code scanning tools. 

Cycle One - Regex - simple data pattern matching

Cycle Two - Data Adjacency - analysis of patterns and their surrounding context

Cycle Three - ML - sophisticated analysis to look for specific data 

Cycle Four - Data Validation - checking that matches are actually real data and not just theoretical pattern matches 

Like code scanning tools, cycle one tools are fast but have a very low level of efficacy, generating a lot of noise. These are the privacy tool vendors, the first legacy generation of the technology. Cycle two tools are much better but are undoubtedly more expensive to run as you need to perform additional processing. In reality this is today's state of the art. If cycle three is about ML then everyone claims to do ML and everyone knows it's largely marketing spin. In practice it's not needed for most data classification but there are of course legitimate uses for things like full names and addresses, and yes we use it for those. As far as cycle four is concerned, as far as I am aware Open Raven is the only current tool with live data validation, which enables a significantly higher efficacy with low false positives but at a cost. 

So how does this apply to the lessons learned from the history of code scanning tools?

  1. No one wants or needs cycle one data classification tools. Never. Ever. They sucked then and they suck now. People deserve better. 
  1. Users are sophisticated and understand the difference between cheap and expensive analysis. They want to be in control and be able to choose between them based on their tolerance for risk. A data classification tool has to cover innovation cycles two, three and four. 
  1. Never get in the way of developers, DevOps or cloud architects. Imperfect but widespread adoption will have a much bigger impact on your security posture than shelfware. Make it easy and painless for people to do the right thing or they will just side-step you. 
Don't miss a post

Get stories about data and cloud security, straight to your inbox.