open raven blog

Open Raven vs Macie: Financial Data Results (Part 4)

Engineering
Product
May 18, 2021

Why locate and classify financial data?

In 2020, the retail landscape changed almost overnight. Even though e-commerce was already growing steadily pre COVID-19, last year saw e-commerce expand its share of overall retail spend by a massive 44% year-over-year. As online retail boomed in 2020, payment applications and services, such as Venmo and Patreon, saw similarly dramatic growth in both user numbers and transaction volume as well. 

With people’s daily lives put on hold, many payments that were previously made in-person shifted online, and more financial transactions and information started being stored on the cloud. Perhaps unsurprisingly, this paradigm shift in financial transactions also triggered a data breach pandemic. With businesses rushing to house an ever-increasing volume of sensitive data online, mismanaged configurations and unforeseen vulnerabilities continued to grow as a leading source of data breaches. 

As many businesses have discovered, inadvertently leaking financial information can be incredibly costly. In 2014, Home Depot, the largest home improvement store in the world, suffered a hack that exposed over 50 million credit cards. As a result, Home Depot had to pay $134 million to financial institutions in reparations for compromised customer data and over $17 million in settlements. More recently, Marriott International was fined $23.8 million (£14.4 million) due to a breach that resulted in traveling guests’ personal and payment information being exposed by hackers — an incident that had severe repercussions under the General Data Protection Regulation (GDPR).

With the growth in e-commerce unlikely to slow down after the pandemic ends and financial services firms transitioning more of what they do into the cloud, knowing where customers’ financial information lives will only become more important. 

Finding out where sensitive customer data resides and whether any of it is exposed are among the most important questions that need to be answered by any business today. Answering these questions is the key to prioritizing the actionable steps that are vital for preventing the next big data breach. Unfortunately, finding exposed vulnerabilities across petabytes of files is no simple task. 

What does financial data look like, and where does it live? 

Given its simplistic structure, financial data can be difficult to find, let alone classify. For example, U.S. Bank Account Numbers range anywhere from 6-17 digits, meaning that not every 12 digit sequence is a bank account number. Simply checking for any 6-17 digit number in an ocean of petabytes will result in many false positives and endless noise. On the other hand, IBANs (International Bank Account Numbers) consist of country codes (such as BR, US, GB), check digits (redundancy bits), and alphabetical characters. IBANs are starkly different from U.S. Bank Account Numbers, as they have a very structured format that’s explicitly defined, making it easier to identify them within a sea of possible strings. In the same vein, although credit cards can be 16 digits long, not all 16 digit numbers are credit cards due to the Luhn check defined in each number. 

Financial Data can appear in many different forms, some simpler than others. This variety can pose a problem when trying to correctly identify financial information across millions of possible permutations.


In the past, log files have been the source of many financial breaches. As a result, being able to sort through toxic logs efficiently plays a considerable role in benchmarking. However, Parquet, Avro, and CSV files are all commonly used for long-term, big data storage, whereas XLSX files are widely used among financial institutions. Being able to quickly and accurately find financial data among these formats is key in preventing a financial data breach. 

A screenshot provided by Arstechnica during a 2020 Razer breach. We modeled some of our benchmarking documents (generated by Mockingbird, click here to read how we perform benchmarking) to resemble actual data breaches we could find in news reports ethically. 

Open Raven Use Case: Validation Function

Open Raven’s validation function for data classes is a core feature that helps minimize false positives during data classification, which is especially useful for financial data. As many countries and organizations use some variety of checksum for financial data to reduce human mistakes, being able to accurately identify a data class is a core part of benchmarking.

Our classification engine allows users to provide a validation function in the form of JavaScript code. This means that the found strings can be cleaned and checked to verify validity against common checksums and correction tests, which helps prevent noisey false positives from surfacing.

Some of the source code for our validation function for credit cards. We can perform calculations, such as checking if it’s a Discover Card or American Express (and more), as well as performing Luhn checks. 

Such a validation function is crucial for identifying credit card numbers with a high degree of efficacy: we match against all patterns that conform to some form of known credit card numbers (granted it’s near a keyword), and we can also parse the number to ensure it passes a Luhn check. Doing so ensures high precision and high recall: we are able to find as many credit-card-like numbers as we can and then run these numbers through a validation engine to ensure they’re actually valid credit card numbers. 

The ability to perform computations on discovered data allows us to trust the precision and accuracy of our scan results. Ensuring our alerts are meaningful is necessary when trying to prevent the next big data breach.

Benchmarking Procedures: Side by Side Comparisons 

In order to perform a meaningful, side-by-side comparison between Open Raven and Macie, we modified our program to use the same keywords that Macie (reportedly) uses to find financial information. In other words, if Macie lists that they use “bank account” as a keyword to find bank account numbers, our modified data classification engine will also use “bank account” as a keyword and nothing else. 

“accountno#,” a keyword that both Open Raven and Amazon Macie use to detect United States Bank Account Numbers, being placed in a log file (a semi-structured file format) to benchmark our program’s efficiency. We modeled this leak after common “Spring Framework” log file leaks. 

Our comparison, therefore, includes the same files and uses the same keywords, thus allowing us to objectively judge how both Open Raven and Macie perform over the exact same population of files. The data classes we’re comparing are:

  • US Bank Account
  • UK Bank Account
  • France Bank Account
  • Germany Bank Account
  • Italy Bank Account
  • Spain Bank Account
  • Card Magnetic Strip Data (Tracks 1 and 2)
  • Credit Card Number 
Macie lists the keywords that they use as a part of their documentation. We match our dataclasses to use the same keywords during head-to-head benchmarking. 

(editor note: we removed Canada Bank Account due to complications of overlapping keywords/datasets when trying to make comparisons. See editorial comments at the end for transparency into the logistical decision to do this)

Side By Side Comparisons: Open Raven and Macie

Comparing Open Raven with Macie is no trivial task due to the number of metrics we could present, as well as the number of dataclasses to cover. To keep this blog post to a reasonable length and to avoid overloading it with charts, we will use a dense heatmap that encompasses all of our dataclasses and their F1-score analysis (within the binary context), and then show the numerical recalls on a file by file basis. If you haven’t already read our post on Performing the Benchmarking, we cover what F1-score means in that post and why we use it. To recap quickly: F1-score is a “one-size-fits-all” metric that says how effectively, not just accurately, the matching behaves.

We use F1-scores because it’s the metric most data scientists are accustomed to seeing. However, we also discuss and present recall metrics on a numerical file-by-file basis. We believe that in this context, recall metrics include the information we need to correctly prioritize buckets based on the projected magnitude of a data breach, a relevant statistic considering that fines are on a “per leaked record basis.”

So, let’s take a look at F1-scores across the financial data classes (note: the higher the F1-score, the better)

Bokeh Plot


Bokeh Plot

From a basic F1-score comparison, we can immediately identify which data classes and file types Macie struggles with, and likewise, where Macie outperforms Open Raven. Open Raven struggles with the docx format consistently compared to Macie but provides more reliable results across other file formats. In contrast, Macie encounters issues with pptx. More interestingly, however, Macie struggles with plaintext formats for text files. This is unusual to us as developers, seeing that plaintext formats have the simplest structure of these file types. Inconsistent F1-scores for some data classes among Macie’s results are worth noting, as well. 

A file Macie missed when trying to find a magnetic “stripe” embedded within a text file placed near randomized text. 

Taking a look at the non-binary domain of recall, which measures both Open Raven’s and Macie’s ability to correctly tell the quantitative value (i.e., non-binary) of matched content, we see a much different story. 

Bokeh Plot
Bokeh Plot

Macie struggles to find all correct counts across most if not all data classes. This hints at the fact that Macie’s regex for parsing credit cards is not robust enough to capture a wide variety of data. What this means is that Open Raven is closer to the true count when reporting that a file claims to have X number of credit cards than Macie. For our customers, this has been an important feature, as setting actionable priorities based on the projected magnitude of a leak is how decisions are generally made.

Validation Function In Action: High Recall in Credit Cards

Focusing on credit cards (we decided to compare credit cards because we thought it would be a perfect data class to highlight Open Raven’s ability to use validation functions). When using the validation function we see a much higher recall on a file by file basis in comparison with Macie. As discussed earlier, we’re able to perform computations on the matched credit cards we identify, which allows us to cast our regex net extremely wide, i.e., high recall, while also giving a second-pass verification that offers high precision. 

Bokeh Plot

We can see that Open Raven collects and meets the expected credit card count consistently in all file extension cases (see editorial comments for complications in this statement), which stems from our service’s ability to check for checksums. In contrast, Macie doesn’t seem to be able to capture all the credit cards placed near keywords. We speculate that this may be because Macie is using a stricter regex to compensate for their lack of (known) validation ability. Instead, we presume Macie places the logic in their parsing methods by having a more rigid regex (ie. directly matching the first digits of a credit card number to a set of issuing banks), opting for a less flexible way of differentiating 15 digits from credit cards. 

Conclusion + What’s Next

This is the first post of many in which we plan on comparing Open Raven with Macie. We're happy to release the first public metrics on how well both platforms are performing through benchmarking, as we believe transparency such as this will only improve our and other companies’ software over time. As we continue to improve and refine our benchmarking process, our end goal will be for others to be able to conduct these tests independently as a part of our open-source initiative and to verify our results themselves. 

Being able to share our results publicly is something that we take very seriously at Open Raven. As a company with a firm belief in transparency, we are very excited to release more benchmarking posts of this nature in the coming weeks.

Editorial Comments:

We had to make some arbitrary decisions to make this comparison fair, even at the expense of making Macie look better. As we are the company generating the comparison, we realize there may be a conflict of interest, but to gain the reader’s confidence a fair comparison was needed nonetheless. 

Credit Card Matching

Not using Macie’s “CREDIT_CARD_NUMBER_(NO_KEYWORD)” version of credit card matching. I (the author) ultimately think this is a subjective decision, and no person should assume a 15 digit number is a credit card number without the appearance of a keyword. However, Macie has a separate data class for finding credit cards without keywords. Since Open Raven doesn’t search for no-keyword credit cards, we don’t include them in the results. 

This boosted Macie’s precision score in this category significantly, as random digits are placed in our testing documents to test for performance and misclassifications. 

Canada Bank Account Matching

Macie groups US and Canada bank account numbers into the same category. However, since US and Canada bank account numbers are different dataclasses in our repository, this made it difficult to identify which dataset the target-data was supposed to belong to in our classification schema. To make things even more difficult, Macie uses the same keywords for both countries, so differentiating the two countries from a benchmarking perspective was troublesome: we didn’t know which country the bank account number belonged to in Macie’s results, while US and Canadian bank account numbers are clearly differentiated in our system. The decision to only benchmark US Bank Accounts seemed to be the easiest workaround.

author
Tyler Szeto
Software Engineer
author
Mike Andrews
Head of Engineering
Back to the Blog