In our previous post of this series, we covered the test data requirements and how to generate such datasets. Now, on to how we test and benchmark both Open Raven’s and AWS Macie’s data scanners for equitable comparison. To clarify our methodology, we will describe the approach and metrics used to document and understand the efficacy and performance of data classification.
To know if a data classification scanner is performing well, we must first understand if it identifies all the sensitive data in a given target data file vs. any incorrect or missed findings. We do this by monitoring the number of true and false positives and negatives generated.
For example, we may find that a data classification scanner flagged 70 “true positives” out of a possible 100 actual instances of sensitive data. While this information is useful, it’s just a part of an overall efficacy measure. To be useful for benchmarking, metrics must provide clear, unambiguous comparable values, at any scale, vs single values where more context may be needed.
Precision and recall are commonly used by data scientists to indicate the effectiveness of the models they have produced, with precision being the percentage of results that are classified accurately, whereas recall refers to the percentage of total relevant results that were classified correctly.
While both metrics are vitally important, there is a subtle, yet profound difference between the two. Depending on the role of the person consuming these metrics, and the use-case at hand, an individual assessing a model's efficacy may care much more about one metric than the other. For example, operations teams may be more interested in precision because they must sift through, investigate, and validate each finding--false positives are expensive. On the other hand, security and compliance staff are likely to care much more about recall as they don’t want to miss any true findings since each record increases the associated risk for which they are responsible.
To explain how different metric priorities can produce different results, let’s take a file that contains credit card numbers. From our test data generation, we know that there are 100 such instances in the file. If the data scanner finds 70 of these, misidentifies 5, and does not find the remaining 30, its precision is 93% (70/70+5), whereas its recall is 70% (70/70+30).
But what if the data scanner was overly permissive, finding many more “false positive” matches - say 30 - while still finding 70 “true positive” matches? In this case, the precision comes out to be 70%, but the recall is still 70%.
Now, let’s say the scanner still has the same false-positive error rate but doesn’t find as many of the “true positive” matches as it should - only 50 of them. Now, precision comes out to be 63% and recall 50%.
Therefore, depending on your use case and tolerance for “noise”, it is certainly possible to optimize one of these metrics over the other to the detriment of the scanner’s overall effectiveness. If your scanner finds all of the “true positive” matches, but you are less concerned about the amount of noise generated through false positive matches, then you can easily get the recall metric to be a perfect 100% by being very permissive on what is matched. Conversely, if you were to optimize for less noise and consequently miss real data that your scanner should have discovered, the precision metric can be coaxed to approach 100%.
To avoid specific bias our benchmarking process uses an aggregate metric called the F1-score, which is the harmonic mean of precision and recall. A good F1-score means that there is a low false-positive rate combined with a low false-negative rate. As a result we are able to correctly identify data without contending with false alarms. An F1 score is considered perfect when it’s 1.0, while data classification is a total failure when it’s 0.
It’s important to first describe the way that Open Raven is calculating precision and recall over our test data as others may use them slightly differently (and appropriately, for their use case).
Open Raven benchmarks our data classification engine’s efficacy over the full quantity of (possible) findings across all data sources (files) being scanned, ie. 2643 credit card numbers found across 147 S3 bucket objects. Contrast this with a binary approach at the data source level where S3 buckets, or the objects contained within, are singularly classified as containing sensitive data or not, regardless of the number of findings in each data source.
Benchmarking data classification efficacy on a datasource-by-datasource level is appropriate when simply surveying an environment for where sensitive data is stored, and is very similar to the independent antivirus testing methodology, where one is concerned in identifying if a file is malicious in any way, and therefore should be quarantined, not reporting that the file may be infected multiple ways. Open Raven’s stance is that customers need to prioritize finding, and possibly remediating, based on the number of data records exposed and not simply that a data store has sensitive data contained within it or not. The distinction is important because statutes, like the California Consumer Privacy Act (CCPA), can fine companies on a per-record basis for non-compliance. With such penalties, a low recall score means companies would severely under report their risk, translating to less effective remediation efforts thereafter.
Finally, going back to the prior blog post where we describe the data generated to test classification, we also create “null files”, where no sensitive data findings should be expected, and in these cases Open Raven can and does revert to a binary classification approach. In fact, attempting to calculate the recall metric will result in a divide-by-zero error! (as the denominator, the number of relevant ‘positive’ elements is empty). Therefore, any files that are known to be in the ‘null’ set, but have findings, are investigated.
Now that we have explained how we measure and report benchmarking, let’s work through a specific example. Later in this series, we will go through several unique data categories - from financial data to developer secrets - benchmarking how each performs and comparing them against our AWS Macie benchmarking. For now, however, let’s take a single example - identifying credit card numbers, per above.
We first use Mockingbird to create a set of test files. In this example, we generate JSON documents - one of a number of different file formats and sizes Mockingbird allows users to generate.
In our example, we don’t know exactly where (or what) credit card numbers were placed in each file, but we do know how many such numbers are present. In addition to generating files where credit card numbers were added, we also generate “null files'' - valid files where Mockingbird may inject data but expect zero findings to be discovered. We do this for two reasons:
In this way, ”null files” allow us to create a “placebo” test - if we found a reaction to this test we’d want to verify the source material and Mockingbird’s code. In addition, it checks the data classifier itself as to why it's finding sensitive data that doesn't exist.
We then place both the “null files” and “vulnerable files” into a specific target S3 bucket. Thereafter, we create a data scan job in Open Raven to run data classification on the bucket using just the data class we are benchmarking. Similarly, we can go to AWS Macie and run a job against the same target S3 bucket. Once both jobs are complete, we can retrieve the findings.
To extract findings from Open Raven, we execute a query like the one below, which lists all the files in the target S3 bucket that recorded findings for the credit card number data class. For each test file, we compare the count found to the number of entries that Mockingbird actually injected within each file. We can now calculate the precision, recall, and F1-score from each of the test files before aggregating these results together for a mean average metric.
To retrieve findings from within AWS Macie, we go to Macie’s main landing page, select “S3 Buckets,” find the target bucket with the test files, identify the last JobID that was run, and click “show findings.” Doing so leads us to a page like the one below, upon which we select all the findings and export the JSON (note: be very careful with this screen - we have found that using the “select all” checkbox DOES NOT actually select all items for export, only those that have been “seen” in the scrolling view, and you MUST scroll to the bottom of the list. Instead, we recommend you use the get-findings CLI).
The exported JSON will look like the example to the left, which we then filter for sections related to the data classification we are benchmarking (ie. classificationDetails.result.sensitiveData.detections.type = “CREDIT_CARD_NUMBER” or “CREDIT_CARD_NUMBER_(NO_KEYWORD)”).
This is because Macie runs all of its data classifiers at once with no option to limit for a particular data class, and we don’t want to possibly pollute the benchmarking calculations with findings from other data classes.
We then process both Open Raven’s and Macie’s JSON files, calculate the precision, recall, and F1-scores, and output results into a table as seen below. We’ve re-included the diagram explaining precision and recall for reference.
From the results above, we can see that Open Raven has a slightly higher degree of efficacy than Macie, as reflected in our precision and balanced F1-score.
Now we have a methodology, comparable benchmarking metrics, and an automated process, we find that Open Raven’s and Macie’s classification engines are quite well matched (on precision and recall). In the next blog post of this series, we’ll run through a matrix of different data classes (financial data, healthcare data, personal data, and developer secrets) across different file types (text, Office documents, Parquet, etc.) and continue to compare Open Raven’s results with those of AWS Macie, as well as going into performance (scanning speed) and cost. Stay tuned!