Discover and Classify Data

Operationalizing Data Security in the Cloud at Scale - Data Classification Explained Part 3

Jeffrey Binder
Security Researcher
March 8, 2024

This blog series aims to explain our core principles, systems, and methods for classifying sensitive data residing in unstructured and structured cloud data stores at scale. In part one, we revisited the principles behind Open Raven's data classification system, our methods, and improvements made since we launched our platform. In part two, we discussed the role of ML in data classification, as well as its benefits and drawbacks. In part three, we examine data classes and classification through the lens of SaaS DLP use cases. 

SaaS platforms like Google Workspace are integral to how many companies do business, increasing collaboration and productivity. For security teams, they are a new data surface which requires data visibility and control. Far too often, employees inadvertently share sensitive information with people who shouldn't have access to it. SaaS platforms can also lead to trouble when offboarding employees or partners. Merely revoking access and transferring file and folder ownership upon departure is insufficient to remediate all data security risks. 

Gaining visibility into and control of sensitive data in SaaS platforms requires understanding what sorts of sensitive information is present and at risk. Sensitive data can mean different things to different companies: personal ID numbers, financial account details, API keys, health information, geolocation data, trade secrets, customer lists, and more. HR departments routinely gather personal information, such as social security numbers and tax IDs, that hackers can use in identity theft. The exposure of API keys can lead to broader data breaches and ransomware attacks. Code, CAD files, and design documents can contain intellectual property that must be kept from the eyes of competitors.

Even when stored in secure databases, sensitive data can spill out into SaaS platforms by accident. Google Sheets, for example, provides a convenient way of analyzing data, but it also creates cloud-based copies that can increase the attack surface. PII can also find its way into meeting notes, chat transcripts, and collaborative documents when employees paste in examples or cases they are working on. These copies may soon be forgotten, but they can remain in the cloud for years, creating the risk of a data breach due to incorrect permissions.

The Open Raven Data Security Platform can detect a wide range of data types that raise security concerns. We've developed over 300 data classes for detecting common and specialized types of personal, health, financial, and developer data, ranging from names, addresses, and credit card numbers to medical diagnostic codes, MAC addresses, and numerous API key types. Our customers can also create custom data classes to handle company-specific data types. Our data catalog gives visibility into where this information resides within IaaS, PaaS, and SaaS services. 

Open Raven approaches data classification in SaaS much like we do in IaaS and PaaS. We often use the same matching logic whether we are scanning a cloud storage bucket, a spreadsheet, or a database table: an RSA key is an RSA key, no matter where it occurs. In some cases, however, different contexts require different approaches. That's why we recently rolled out the ability to tailor matching logic to different platforms and file types. For example, when scanning databases, we can detect payment card numbers with high confidence without the need for keyword matching; we can thus find where credit card information is stored even if the column is simply called "column1."

Free-form text files like Google Docs require unique data validation methods. For example, we often encounter strings in Docs that look like payment card numbers but are not. For instance, URLs pasted from social media platforms often contain post IDs of around the length of a card number (typically 12–19 digits). Even though real card numbers can be verified using check digits, randomly selected numbers will coincidentally pass the test around 10% of the time, thus leaving a large potential for false positives. For this and other reasons, when scanning Google Docs and other types of unstructured text, we examine the context to ensure that we've really found what we're looking for.

This approach allows us to detect sensitive data even in places where it's unexpected. It's one thing to spot a database column called "customer_ssn." It's another to detect SSNs that find their way into more informal documents, such as transcripts of customer service chats or spreadsheets—or to find SSNs in a generically named database column like "field1." By tailoring our methods to the various types of cloud data companies have, we provide comprehensive classification in a wide range of places where sensitive data can be found.

Our platform can also scan file metadata for sensitive information. Identifying inherently sensitive file types such as CAD/CAM drawings and sensitive information such as geolocation data provide valuable context for understanding risk. Some types of sensitive data can also be spotted simply by checking the filename. Workplace data file names often reflect the document subject or business purpose. Red flags include terms such as Quarterly Earnings, Customer List, Employment Contract, and Board Update—likely signs that a file could lead to trouble if exposed to people who shouldn't see it. 

As more companies adopt SaaS platforms for their day-to-day work, cloud data security is becoming a must. Open Raven's platform provides much-needed visibility into the massive amounts of data residing in SaaS accounts, enabling our customers to find security risks before they become serious liabilities.

Don't miss a post

Get stories about data and cloud security, straight to your inbox.