The Field Guide to Modern Data Security
The rise of the data economy has driven the creation of a massive number of jobs for data scientists and engineers and created extraordinary wealth for the companies who equip them, from Alteryx to Snowflake. Meanwhile, on security and cloud teams, growth is equally evident but much of it has been chasing problems quite outside the rise of big data. Ransomware. Transition to Zero Trust architectures. Shift to remote work. Cloud migrations. And so on.
Yes, these efforts are related to what the data teams are doing. However, just as an entire market has evolved for equipping organizations for managing the full life cycle of massive troves of data, security and cloud teams require focused solutions for data as well. Behind the mandate for data specialization is a series of gaudy, eye-popping statistics such as this: the average person in 2020 created 1.7MB per second (IBM). The challenge is at a scale that we can scarcely comprehend and this means that none of the solutions can afford to be anything less than wholly focused on solving the modern data security problem.
Security, privacy and related companies have stepped forward to meet the challenge, but the result thus far has been more confusion than clarity. From overheated marketing claims to overlapping solutions, the most common reaction to the market is confusion. This is certainly not unique to data security. The brief history of the cyber security industry shows us that in times of considerable change and growth there is often a period where things are decidedly murky and it’s hard to distinguish exactly who’s doing what and what if anything is sensible to do.
For example, the “next generation endpoint” market was difficult to comprehend early last decade with a flood of new entrants (40+ companies) who all had different starting points (AI-driven antivirus, EDR, IoC scanning, etc.) and enough funding to create a haze of confusion. The fog lifted only years later when a clear, winning model emerged that consolidated the most successful concepts into unified offerings. At which point we simply returned to referring to it as “the endpoint security market”.
In an effort to drive greater clarity, this blog series breaks down the key problems modern data security products are aiming to solve. For each problem or challenge, we’ve created a decision tree that takes you through relevant criteria (cloud vs. on-premises, security vs. privacy, etc.) which would steer you towards one approach and away from other, less suitable options.
The below are a list of 6 key challenges in modern data security and privacy. Bluntly, the majority are a carry over from the last generation of problems but given the scale, speed, variety and myriad of other changes related to data today the solutions required for each have shifted dramatically in nearly every instance.
We’ll walk through each entry in order, starting with the first in this post which focuses on having enough visibility to take on the remaining 5 with confidence.
- Establish baseline data visibility through location, inventory and classification.
- Control data usage from a centralized policy across multiple locations, providers. (Governance across clouds). Data access control/policy enforcement.
- Detect and respond to data specific threats (e.g., external attacks, insider) and anomalies (Next Gen DLP).
- Incident Management of data leaks
- Remediation & clean-up
- Manage data security posture - assess and respond to risk by type.
- Remediation & clean-up
- Both internal & “external”
- Meet compliance requirements for data privacy, governance or security.
- Privacy operations automation - key processes such as DSAR, RTBF, consent management and streamlined reporting
- Encrypt or tokenize data.
Establish baseline data visibility through location, inventory and classification
Arguably the outset of any effort to improve security begins with an understanding of what you’re trying to protect. In simple terms, this means knowing what data you have, where it’s located and other basics. The effort can be broken down into 3 core activities:
- Location - the act of finding all data inside an environment. At its most comprehensive, it takes as little information as possible (e.g., the organization info of a public cloud, a domain name, etc.) and returns a complete list of active and inactive data services. Most location efforts are less complete than this idealized view, omitting either inactive services (minor limitation) or being limited to data in motion only (major).
This is akin to the time-honored penetration testing practice of identifying all the assets that compose the network perimeter. Without it, a critical exposure can be missed. With everything located, a solid foundation is established for any actions that follow. As one person has framed it, the closing aperture of locating data is the opening aperture for the next step.
An example of the output here would be a visual map of all data services arranged by network (VPC, domain, etc.).
- Inventory - the act of listing all the data available. This often includes relevant metadata that can be harnessed in the following step such as size, type, date last accessed, etc. Inventory tells you what data you have, but not necessarily what it is (or what’s inside it). An example of the result here would be a full listing of data objects (files, repositories, etc.)
- Classification - the act of identifying all of the data inventoried. Different types of data classification exist, from sensitivity labeling according to internal policy or objective standards to naming the type of data discovered. The latter approach of naming the data type (personal data, developer secret, payment card, etc.) is more generally applicable and will be the main focus of this guide. The most common and arguably useful output of a classification initiative is a data catalog.
Data in motion or at rest?
The 1st question dictates just how much data you’re dealing with and how your solution will be deployed. It also typically drives how real-time your understanding is of your data.
Solving for data in motion means you’re likely going to deploy an inline or sidecar based solution such as a proxy or an agent to observe data in near real-time. It also means that any place you do not deploy such a solution will become a visibility gap unless remedied elsewhere.
Solving for data at rest offers the widest and most complete view available, but at the expense of being real-time as there is typically time between analysis. Data at rest analysis is usually deployed as some form of a scanning or monitoring solution.
Estimates on the amount of “dark data” vary from 70%->90%, this is the stationary and often neglected data at rest that will be omitted from a data in motion strategy. Depending on your use case, this is either perfectly ok or a grave limitation.
Cloud vs. On premises?
Yes, there are a number of companies that claim to cover both cloud and on premises data visibility equally well. Technically, it is entirely possible. Practically speaking, there’s little evidence to suggest that any solution provider has an effective, hybrid solution.
On premises solutions have struggled to keep pace with scale as data exploded and they had little means to match the increased volume, variety and velocity of data given their hardware and software constraints. Their race to simply maintain a reasonable level of efficacy on premises means cloud expansion, requiring an entirely new approach to development, architecture, pricing, etc. is often an afterthought or constrained by the pressing needs of the existing on-premises business.
Cloud solutions providers also meaningfully differ, as seen in the next step. The sheer variety of cloud platform providers, SaaS services and data services means that few cloud data visibility solutions can claim to be comprehensive一 let alone a hybrid offering that would bridge both on-premises and all of cloud.
Vendors who are emblematic of on-premises data visibility include Varonis, Imperva, Informatica, Titus, Boldon James and many others. Cloud-born vendors range from companies such as Privacera and Immuta to Open Raven and platform “primitives” such as Google Data Loss Prevention (DLP) and AWS Macie.
SaaS vs IaaS/PaaS?
The early days of SaaS were eye-opening with respect to how quickly we could lose a grip on our data. User generated data slipped out to services like Box and Dropbox and completely outside the visibility of our tooling. The words “shadow IT” entered our lexicon and Cloud Access Security Brokers (CASB) were born shortly thereafter in an attempt to corral wayward Marketing, Sales and other teams’ SaaS services back into a basic level of manageability.
The introduction of IaaS/PaaS to replace on-premises infrastructure doubled down on the shadow IT problem, as developers could now stand up entire applications and data centers unbeknownst to IT and security teams alike. This problem in turn spawned a number of solutions, all that functioned and felt different than CASB offerings, much to the chagrin of people looking for a “single pane of glass” for all things cloud.
Why are solutions for SaaS so different than those of PaaS and especially IaaS visibility and governance? The answer is in the nature of the underlying service itself. SaaS services are discrete offerings aimed at solving a specific problem vs. IaaS or PaaS that offer generalized services (e.g., compute, platforms, etc.) restricted mainly by the imagination of the developers building upon them. As such, solutions for SaaS (e.g., CASB) are largely based on using the APIs provided by a SaaS provider, whether it’s Figma or Asana. Their abilities are largely constrained by the service provider itself and what they permit through their APIsー the only view afforded into a SaaS application. Since the SaaS provider has an incentive to play nicely with others, the basics are usually covered but there’s often little motivation for a SaaS service to hand over full featured APIs to others that allow for truly advanced features.
In contrast to the limited capabilities available to CASB and related solutions via APIs, PaaS and especially IaaS allow for a full range of visibility and governance solutions. While the situation varies from Snowflake (cloud data platform) to AWS EC2 (Generic Compute) or Google Cloud Storage, these are offerings that are intended for developers to build upon them as opposed to focused services where the APIs are secondary. Thus, a full range of solutions are possible, from point in time scanners to always on agent-based monitoring solutions and everything in between.
This fundamental difference means that there is a narrow set of offerings for SaaS security solutions, dominated largely by CASB with innovation “around the edges” by companies like AppOmni who endeavor to check for important misconfigurations and exposure. Unsurprisingly, this category of products is mature and stand-alone offerings like Netskope face pressure from products like Microsoft’s CASB offering which is included along with an E5 license and often “good enough” for many organizations.
IaaS and PaaS offerings are as diverse as the platform and Cloud Service Providers (CSPs) themselves. There are products such as Open Raven that operate much like a CSPM and focus on data analysis over serverless functions in combination with native APIs, offering the broadest view of what’s present. There are data governance offerings that are agent or sidecar based that focus on data in motion and deploy much like traditional data security products alongside or in front of data stores themselves. There are others that are part of the application code (as libraries or APIs) and used primarily by developers instead of cloud or security professionals. In summary, the solutions for IaaS/PaaS are as diverse as what you can do with the platform itself and what you choose should depend heavily on what you need. For example, if you’re concerned about baseline visibility and governance across your entire cloud environment, a solution like Open Raven’s fits the bill. If you’re primarily interested in examining application traffic on a few key data stores, a data in motion offering such as those offered by Satori or Cyral may be a fit.
Given these profound differences, one should be skeptical of a solution built for SaaS (CASB) which claims IaaS and PaaS capabilities. And to be fair, the reverse is likely also true. SaaS, IaaS and PaaS all are moving incredibly fast and require the focused attention of any vendor that endeavors to solve such a large problem. Neither can be done effectively as a part-time “hobby”. In fact, there are no solution providers who can credibly claim to cover even all of one service type equally wellー meaningful differences exist across vendor’s coverage of AWS, GCP and Azure, for example, that reflect both the unique nature of the CSP but also the vendor’s own priorities. And the same is true for coverage of data services/stores themselves which range broadly.
In summary, if you want to cover SaaS data issues, look closely at what the SaaS service providers offer (e.g., Box.com’s Shield offering) and if you need something across many services, a CASB is likely in order. If your primary interest is IaaS or PaaS, look past CASB’s coverage claims and instead look to solutions intentionally designed for the problem you’re aiming to solve. The more clearly you can pinpoint your exact needs and examine a vendor’s fit beyond the hyperbole, the better likely the fit.
Security vs. Privacy vs. Data Teams?
The final but perhaps the most important indicator of the right tool for the job is exactly who is trying to solve the problem. While each person may use similar words to express a pain point, they typically will mean different things given the diverse nature of their respective jobs. The table below is a crude, but clear summary of the varying motivations and interests:
There are undoubtedly similarities and overlap with each of their needs. Each requires some level of data discovery. Each would claim classification is important. Each would tell you reporting is essential. But a closer look reveals each one of these is quite different depending on who is asking for it.
Visibility for data teams generally starts with known data stores and inventorying their contents. Looking for unknown data is less interesting as there is already vast amounts of known data that needs to be understood and processed. Data teams also typically “live closer to the applications” than security and privacy teams so they are less likely to be unaware of where data is located, as well as less interested in failure states and mishaps.
The type of data being inventoried for later usage will range across all types, but commonly biases towards semi-structured and structured data that are produced by applications vs. people.
Data team classification needs are likely to be specific to the business itself, aimed at whatever is needed to drive data science and business intelligence efforts. Flexibility is key here and some manual classification work, while not necessarily welcome, is acceptable as it falls well within the abilities of the typical data engineer.
Visibility solutions for data teams include companies such as Informatica, Collibra and Alation.
Visibility for privacy lies somewhere between the needs of data and security. While they are primarily concerned about making sure baseline processes work with known data sources, they also care about completeness of fulfilling key tasks such as data subject access requests. Thus, locating unknown data stores is of interest, albeit not a primary concern.
Data inventory and classification for privacy needs centers on personal data of all varieties. While this somewhat narrows the focus (at least compared to security teams), the broad definition of personal data (by industry, geography, etc.) means the net will still be cast broadly and likely need customization to be successful. The more international and complex the business, the greater the need for services and a robust, flexible solution.
The lack of skilled privacy professionals--and especially privacy engineers—means solutions will either need to be as automated and straightforward as possible or require a healthy amount of services assistance. The latter has more often been the case historically.
Privacy offerings in this category are offered by companies such as OneTrust, BigID, and Privaci.ai.
Unlike privacy and data teams, security teams’ principal need to manage risk means they care more about unknown data stores than any other group. Why? It is the most likely to be unmanaged and hence unprotected. It’s the data that was copied for test purposes and never deleted. The data used for a customer support incident that was never cleaned up… or properly protected. Hence, security teams have the most stringent requirements for data location—they begin with the assumption that all data stores must be proactively discovered. And given the dynamic nature of modern environments, it must be refreshed repeatedly (e.g., hourly, daily, weekly) to ensure results are still accurate.
Data inventory for security biases towards unstructured and semi-structured data, the opposite of most data teams’ interest. This is due to the typically more diverse, less controlled nature of unstructured data. Thus the solution must inventory the objects and determine their type prior to classification. And the types can vary wildly from a compressed 3GB JSON file to multimedia content and source code. The well-defined columns and rows of classic structured data are nowhere to be seen and few assumptions can be made safely.
Data classification efforts are equally challenging as many types of data are of interest, from the personal data emphasized by privacy professionals to categories that are mainly of interest to security professionals such as developer secrets. Custom data classes are also common, such as special project names and application specific keys.
Locating data services for security teams is typically done via tools that are not specifically designed for the task, such as a CSPM or cloud management or observability offering. Locating the information of interest is doable in this fashion, if not time consuming and imperfect. Open Raven is one of the few, if not the only, service built specifically for locating known and unknown data stores.
There are more data inventory and classification offerings suitable for security teams, such as Google DLP and AWS Macie. Open Raven does both as well and like the previously mentioned platform primitives, emphasizes data services and classes of interest to security and cloud teams alike.
While there are overlaps among privacy, governance and security tools and services, the nuances found within each do require specific identification before a proper solution should be selected. We hope the first part of this guide serves the purpose of helping teams navigate through seemingly similar but different service types when it comes to addressing data security at the speed and scale of the cloud. To read more about what we’re building for security and cloud teams, visit our website or schedule a demo to see it for yourself.