In 2020, the retail landscape changed almost overnight. Even though e-commerce was already growing steadily pre COVID-19, last year saw e-commerce expand its share of overall retail spend by a massive 44% year-over-year. As online retail boomed in 2020, payment applications and services, such as Venmo and Patreon, saw similarly dramatic growth in both user numbers and transaction volume as well.
With people’s daily lives put on hold, many payments that were previously made in-person shifted online, and more financial transactions and information started being stored on the cloud. Perhaps unsurprisingly, this paradigm shift in financial transactions also triggered a data breach pandemic. With businesses rushing to house an ever-increasing volume of sensitive data online, mismanaged configurations and unforeseen vulnerabilities continued to grow as a leading source of data breaches.
As many businesses have discovered, inadvertently leaking financial information can be incredibly costly. In 2014, Home Depot, the largest home improvement store in the world, suffered a hack that exposed over 50 million credit cards. As a result, Home Depot had to pay $134 million to financial institutions in reparations for compromised customer data and over $17 million in settlements. More recently, Marriott International was fined $23.8 million (£14.4 million) due to a breach that resulted in traveling guests’ personal and payment information being exposed by hackers — an incident that had severe repercussions under the General Data Protection Regulation (GDPR).
With the growth in e-commerce unlikely to slow down after the pandemic ends and financial services firms transitioning more of what they do into the cloud, knowing where customers’ financial information lives will only become more important.
Finding out where sensitive customer data resides and whether any of it is exposed are among the most important questions that need to be answered by any business today. Answering these questions is the key to prioritizing the actionable steps that are vital for preventing the next big data breach. Unfortunately, finding exposed vulnerabilities across petabytes of files is no simple task.
Given its simplistic structure, financial data can be difficult to find, let alone classify. For example, U.S. Bank Account Numbers range anywhere from 6-17 digits, meaning that not every 12 digit sequence is a bank account number. Simply checking for any 6-17 digit number in an ocean of petabytes will result in many false positives and endless noise. On the other hand, IBANs (International Bank Account Numbers) consist of country codes (such as BR, US, GB), check digits (redundancy bits), and alphabetical characters. IBANs are starkly different from U.S. Bank Account Numbers, as they have a very structured format that’s explicitly defined, making it easier to identify them within a sea of possible strings. In the same vein, although credit cards can be 16 digits long, not all 16 digit numbers are credit cards due to the Luhn check defined in each number.
In the past, log files have been the source of many financial breaches. As a result, being able to sort through toxic logs efficiently plays a considerable role in benchmarking. However, Parquet, Avro, and CSV files are all commonly used for long-term, big data storage, whereas XLSX files are widely used among financial institutions. Being able to quickly and accurately find financial data among these formats is key in preventing a financial data breach.
Open Raven’s validation function for data classes is a core feature that helps minimize false positives during data classification, which is especially useful for financial data. As many countries and organizations use some variety of checksum for financial data to reduce human mistakes, being able to accurately identify a data class is a core part of benchmarking.
Such a validation function is crucial for identifying credit card numbers with a high degree of efficacy: we match against all patterns that conform to some form of known credit card numbers (granted it’s near a keyword), and we can also parse the number to ensure it passes a Luhn check. Doing so ensures high precision and high recall: we are able to find as many credit-card-like numbers as we can and then run these numbers through a validation engine to ensure they’re actually valid credit card numbers.
In order to perform a meaningful, side-by-side comparison between Open Raven and Macie, we modified our program to use the same keywords that Macie (reportedly) uses to find financial information. In other words, if Macie lists that they use “bank account” as a keyword to find bank account numbers, our modified data classification engine will also use “bank account” as a keyword and nothing else.
Our comparison, therefore, includes the same files and uses the same keywords, thus allowing us to objectively judge how both Open Raven and Macie perform over the exact same population of files. The data classes we’re comparing are:
(editor note: we removed Canada Bank Account due to complications of overlapping keywords/datasets when trying to make comparisons. See editorial comments at the end for transparency into the logistical decision to do this)
Comparing Open Raven with Macie is no trivial task due to the number of metrics we could present, as well as the number of dataclasses to cover. To keep this blog post to a reasonable length and to avoid overloading it with charts, we will use a dense heatmap that encompasses all of our dataclasses and their F1-score analysis (within the binary context), and then show the numerical recalls on a file by file basis. If you haven’t already read our post on Performing the Benchmarking, we cover what F1-score means in that post and why we use it. To recap quickly: F1-score is a “one-size-fits-all” metric that says how effectively, not just accurately, the matching behaves.
We use F1-scores because it’s the metric most data scientists are accustomed to seeing. However, we also discuss and present recall metrics on a numerical file-by-file basis. We believe that in this context, recall metrics include the information we need to correctly prioritize buckets based on the projected magnitude of a data breach, a relevant statistic considering that fines are on a “per leaked record basis.”
So, let’s take a look at F1-scores across the financial data classes (note: the higher the F1-score, the better)
From a basic F1-score comparison, we can immediately identify which data classes and file types Macie struggles with, and likewise, where Macie outperforms Open Raven. Open Raven struggles with the docx format consistently compared to Macie but provides more reliable results across other file formats. In contrast, Macie encounters issues with pptx. More interestingly, however, Macie struggles with plaintext formats for text files. This is unusual to us as developers, seeing that plaintext formats have the simplest structure of these file types. Inconsistent F1-scores for some data classes among Macie’s results are worth noting, as well.
Taking a look at the non-binary domain of recall, which measures both Open Raven’s and Macie’s ability to correctly tell the quantitative value (i.e., non-binary) of matched content, we see a much different story.
Macie struggles to find all correct counts across most if not all data classes. This hints at the fact that Macie’s regex for parsing credit cards is not robust enough to capture a wide variety of data. What this means is that Open Raven is closer to the true count when reporting that a file claims to have X number of credit cards than Macie. For our customers, this has been an important feature, as setting actionable priorities based on the projected magnitude of a leak is how decisions are generally made.
Focusing on credit cards (we decided to compare credit cards because we thought it would be a perfect data class to highlight Open Raven’s ability to use validation functions). When using the validation function we see a much higher recall on a file by file basis in comparison with Macie. As discussed earlier, we’re able to perform computations on the matched credit cards we identify, which allows us to cast our regex net extremely wide, i.e., high recall, while also giving a second-pass verification that offers high precision.
We can see that Open Raven collects and meets the expected credit card count consistently in all file extension cases (see editorial comments for complications in this statement), which stems from our service’s ability to check for checksums. In contrast, Macie doesn’t seem to be able to capture all the credit cards placed near keywords. We speculate that this may be because Macie is using a stricter regex to compensate for their lack of (known) validation ability. Instead, we presume Macie places the logic in their parsing methods by having a more rigid regex (ie. directly matching the first digits of a credit card number to a set of issuing banks), opting for a less flexible way of differentiating 15 digits from credit cards.
This is the first post of many in which we plan on comparing Open Raven with Macie. We're happy to release the first public metrics on how well both platforms are performing through benchmarking, as we believe transparency such as this will only improve our and other companies’ software over time. As we continue to improve and refine our benchmarking process, our end goal will be for others to be able to conduct these tests independently as a part of our open-source initiative and to verify our results themselves.
Being able to share our results publicly is something that we take very seriously at Open Raven. As a company with a firm belief in transparency, we are very excited to release more benchmarking posts of this nature in the coming weeks.
We had to make some arbitrary decisions to make this comparison fair, even at the expense of making Macie look better. As we are the company generating the comparison, we realize there may be a conflict of interest, but to gain the reader’s confidence a fair comparison was needed nonetheless.
Not using Macie’s “CREDIT_CARD_NUMBER_(NO_KEYWORD)” version of credit card matching. I (the author) ultimately think this is a subjective decision, and no person should assume a 15 digit number is a credit card number without the appearance of a keyword. However, Macie has a separate data class for finding credit cards without keywords. Since Open Raven doesn’t search for no-keyword credit cards, we don’t include them in the results.
This boosted Macie’s precision score in this category significantly, as random digits are placed in our testing documents to test for performance and misclassifications.
Macie groups US and Canada bank account numbers into the same category. However, since US and Canada bank account numbers are different dataclasses in our repository, this made it difficult to identify which dataset the target-data was supposed to belong to in our classification schema. To make things even more difficult, Macie uses the same keywords for both countries, so differentiating the two countries from a benchmarking perspective was troublesome: we didn’t know which country the bank account number belonged to in Macie’s results, while US and Canadian bank account numbers are clearly differentiated in our system. The decision to only benchmark US Bank Accounts seemed to be the easiest workaround.