Discover and Classify Data

Open Raven vs Macie: Healthcare Data Results - Part 5

Bele

Chief Corvus Officer

June 2, 2021

Why Locate and Classify Health Data?

A few months ago, my mom was confirmed to be one of the many employees whose personal information was leaked in the UCSF Medical Center Hack in 2020. Although this particular attack hit close to home, it was not an isolated incident.

Cybercriminals have always targeted medical institutions. Before 2017, the US Department of Health and Human Services Office for Civil Rights’ Breach Portal reported over 325,000 healthcare data breaches, each affecting 500 or more people. However, it was not until 2016, when 14 US hospitals were attacked with ransomware, that healthcare organizations became hackers’ favorite target. From here on, threat actors emboldened. In 2017, a ransomware attack wreaked havoc on the British health care services, shutting down non-critical hospital services for a few days, and as recently as May 2021, Ireland’s health service experienced the “biggest ever cyber attack in the State’s history,” compromising the personal information of hundreds of thousands of Irish people and causing continued disruption to patient care.

Attacks such as these, which affect critical infrastructure, reveal the outdated nature of much of the modern world’s medical care system, running on systems that have passed their end of life due date. The scarily real portrayal of this status quo even plays a small role in the show Mr Robot, in which the protagonist, realizing that the hospital he is admitted to runs on Windows98, succeeds in accessing his own medical records and editing his way out of a hospitalization using old and known exploits.

Modernizing healthcare is no simple task, but most hospitals and healthcare providers have already begun transitioning away from on-premise storage and into the cloud. Kaiser health insurance has taken the lead and partnered with Microsoft Azure to integrate cloud solutions into their healthcare system. The benefits of moving to modern infrastructure (the cloud) are innumerable. With healthcare leveraging cloud technology, providers are able to benefit from features such as applied machine learning for a better healthcare experience, anomaly detection for bad actor prevention, and routine backups to help mitigate the impact of future ransomware attacks.

With digitization transforming critical industries like healthcare, ensuring a smooth and secure transition to the cloud is vital to both protecting individuals’ personal information and maximizing the ability of these organizations to provide critical, real-world patient care.

What does Health Data Look Like?

Health data follows many of the caveats that make finding and classifying financial data difficult. Checksums — strings of random digits of variable length — make up the majority of the health data that we are using to benchmark Open Raven and AWS Macie.

Wikipedia page on how NHS numbers are set up

Our last post in this series highlighted the usefulness of a checksum validation function in validating credit cards with a Luhn check. Here too, performing computations on the data collected to verify validity plays a significant role in Open Raven’s scanner’s ability to deliver high precision.

The checksum for checking “DEA” numbers that we employ in our program. Being able to clean the string and perform computations on the data collected is prevalent within verifying many different types of health data.

The health data we’re classifying in this post are:

US National Provider Identifier
US National Drug Code
US Medical Beneficiary Number
US Healthcare Common Procedure
US Health Insurance Claim Number
US Drug Enforcement Agency Registration Number
UK National Health Service Number
EU Health Insurance Card Number
Canada Personal Health Number

Use Case: Patient-ID lookup via API Calls

In our recent post on data classification techniques, Open Raven’s founder, Mark Curphey, gives an example of how Open Raven’s validation function can do external API calls, something that is especially relevant to our healthcare customers. Many health institutions have some form of patient-ID lookup, which allows their employees to check if a provided credential belongs to one of their patients.

Being able to apply logic to the strings collected during benchmarking gives users the ability to have branching conditionals, which is useful when dealing with two different sets of data. In this case, a string is sent to a different URL endpoint, depending on if it conforms to the older NHS format, or the newer one, based on the length of the string.

For medical institutions, the ability to verify that any NHS numbers discovered belong to their patients is vital for prioritizing actionable tasks and preventing patient data breaches.

Side by Side Comparison

In our Benchmarking Financial Data blog post, we discussed how we performed side by side metrics against Macie, and which metrics we used to compare the two (and why).

The Benchmarking Financial Data blog post explained the logic and reasoning behind the choices made by the Open Raven team. If you want to revisit these sections, and receive a more detailed explanation of our benchmarking score methodology you can read our preceding blog posts.

Now, let’s dive straight into binary F1-scores,

Bokeh Plot

‍

Bokeh Plot

‍

Bokeh Plot

‍

Bokeh Plot

‍

And now into non-binary recall,

Bokeh Plot

‍

Bokeh Plot

‍

Bokeh Plot

‍

Bokeh Plot

‍

Investigating why Macie’s Recall is lacking on an individual basis, we can conjecture that Macie doesn’t support dataclass matching across the different variations of each data class. Rather, it only provides data class matching for one set class (hence why the recall percentage is generally near common fractions). For example, for NHS numbers, we verified that Macie isn’t matching against codes generated prior to 1995, which is when the NHS format changed.

Bokeh Plot

‍

Below is some of the data we tested against, showcasing the variety of both the content and formats NHS codes can appear in. We ended up placing some of the embedded NHS numbers into our testing documents. Notice how the numbers appear in multiple formats, including the outdated NHS format that was deprecated in 1995.

Analysis

We’ve strived to remain as impartial as possible during the entire benchmarking process. As a result, pointing out “consequences and effects” of how we perform benchmarking is something we at Open Raven have set aside time in each article for.

We have two teams working independently at Open Raven: one writing scripts to generate synthetic data, and one producing the regex’s Open Raven uses to actually match these scripts. Siloing our two teams in this way helps minimize the influence of bias between them, and as both teams are in continuous competition, it also creates a cycle of continuous improvement.

Macie’s low recall on a per file basis shows the benefits of this approach: there are extended variations of the same classification of health data that Macie does not cover. If we remove those from the benchmarking, then the end results for Macie will be better. However, both teams independently researched and played war games with each other until we arrived at the scores we have now.

This seems to be the case in most instances not only where Macie performs poorly but where they perform well also. For data classes that only have one format, Macie generally ties with Open Raven, but in data classes that have multiple formats (as a result of deprecation or changes over the years), Macie doesn’t account for older formats. This can be seen in US NDC Codes, HICN, and UK NHS.

Editorial Comments

We didn’t conduct benchmarking on “Universal Device Identifier (UDI)” due to the structure of the strings: in the xlsx format the way the strings were formatted were illegal in the xlsx format ("=/AVLogeSNJ4SFPMjg=,CKhshB=}009215"). Rather than making branching logic in the benchmarking for a single case, removing this data class in particular opted to be the most simple solution.

Furthermore, we didn’t benchmark specific countries such as “France Health Insurance Numbers” or “Finland Health Insurance Numbers” as they shared the same formats as EHIC and performed the same on both platforms, only differing by keywords.