I have been building security code scanning tools for over a decade. I started in the late 2000’s at Microsoft where I owned CAT.NET, an intra-procedural static analysis tool for .NET MSIL code that plugged in to Visual Studio as an extension. Visual Studio already had a security Linter (see below) that was built in. In recent years, at SourceClear (acquired by Veracode), our entire product was built on an inter-procedural static analysis engine to determine if custom code was calling methods in vulnerable open source libraries.
I like to think I have been around the block enough to have seen things come in and go out of fashion.
If you don’t know about code analysis then there are generally a few types of security analysis techniques tools that have evolved in roughly the following innovation cycles.
Cycle one “Linting” - simple grep over parts of code
Cycle two “Intra-procedural” - analyzing parts of the code
Cycle three “Inter-procedural” - analyzing the entire code (including all the dependencies)
Cycle four “Data flow analysis” - tracing data though the entire code base
Each technique built on the previous innovation cycle. Cycle one tools were fast and familiar but generated a lot of noise. Cycle two tools were better but slower, and started to miss increasingly more sophisticated vulnerabilities by attackers as time went on. Cycle three tools were able to find more sophisticated vulnerabilities but in doing so generated lots of noise which led to cycle four that improved efficacy at the expense of very slow scanning. For cycle four tools to work, software had to build, the analysis engine had to build an inter-procedural call graph, a data flow graph and then run complex algorithms across them. Some analysis runs would literally run overnight.
When I take a step back, I learned three important lessons building security code scanners over the years.
What happened with code scanning tools is that the more sophisticated they got, the slower and more computationally expensive they became to run and the more they got in the way of innovation. The end result was people stopped using them and have started to swing back to a new generation of simpler cycle two tools. So why does this all matter for a data security company?
It is just over two years ago since Dave and I set out to build the best damn data security platform ™ we could imagine, and consistently from those early days until today, we have heard the same core use cases from users:
Tell me what type of data I have is of course a platform feature called data classification and call it what you will, that means data scanning. We are writing lots of content these days about designing and building data classification systems and our Macie vs Open Raven series in which we openly boast about our results. To let you into a little secret, our data classification engine design has leaned heavily on my experience building code scanning tools.
The evolution of data classification tools are following an eerie similarity to the cycles of code scanning tools.
Cycle One - Regex - simple data pattern matching
Cycle Two - Data Adjacency - analysis of patterns and their surrounding context
Cycle Three - ML - sophisticated analysis to look for specific data
Cycle Four - Data Validation - checking that matches are actually real data and not just theoretical pattern matches
Like code scanning tools, cycle one tools are fast but have a very low level of efficacy, generating a lot of noise. These are the privacy tool vendors, the first legacy generation of the technology. Cycle two tools are much better but are undoubtedly more expensive to run as you need to perform additional processing. In reality this is today's state of the art. If cycle three is about ML then everyone claims to do ML and everyone knows it's largely marketing spin. In practice it's not needed for most data classification but there are of course legitimate uses for things like full names and addresses, and yes we use it for those. As far as cycle four is concerned, as far as I am aware Open Raven is the only current tool with live data validation, which enables a significantly higher efficacy with low false positives but at a cost.
So how does this apply to the lessons learned from the history of code scanning tools?