Data Store Fingerprinting with DMAP

Jason Nichols
Principal Engineer
February 9, 2020

For all the time we spend concerned over getting hacked, it turns out the biggest threat to our data is, well…us. In 2019, accidental data breaches eclipsed the number of intentional hacks for sheer amount of data exposed. The blame game is expected to catch up and by 2025, 99% of cloud security failures will be attributed to end users, not their cloud service providers according to Gartner.

Fixing the problem starts with answering a simple question that is deceptively hard to answer: Where’s our data? We’ve met few people inside organizations of any size who feel comfortable they have a good answer to the question and even less the follow on question of “And is it being protected appropriately?”

How did we get here?

  • Data is growing at unprecedented rates
  • Most organizations straddle on-premise and the cloud infrastructure (IaaS, SaaS) with no unified view into what is moving between the two or how data is being stored
  • Data is commonly duplicated for not only backups, but also for technical support and data science efforts to glean insights the organization can use across business functions, like customer support and sales lead-generation
  • A large (and growing) number of people handle data – from DevOps, to IT, to security teams and now even the boardroom Risk & Insurance
  • Responsible data handling practices are often not a priority within an organization, lacking in training, tools and general awareness

Building a map that identifies and plots the data stores of a modern organization can be incredibly challenging, requiring you to explore many different areas, from cloud to on-premises to partner networks, while encountering an incredibly diverse set of repositories, each with its own unique attributes. The tools we have historically to tackle this problem typically leave us with a best guess at the operating system along with the running ports and services. This is a far cry from a clear label of a data repository and leaves considerable manual effort to the user to determine what’s actually running on a server, instance, container, etc.

If data is “the new oil”, why haven’t we put more effort into properly locating and labeling our oil wells?

We built DMAP in order to eliminate the guesswork with finding data repositories within cloud and corporate (a.k.a. on-premises) networks. DMAP is a machine-learning based service that uses a wide variety of data store attributes to determine what it is and provide it with a clear label. So instead of “Linux OS running 3 services on ports X, Y and Z” we tell you it’s MySQL, ElasticSearch, etc. It’s sort of like going from people being described by their height, weight and age to simply being told their name. Much easier.

DMAP works using the techniques and permission that you make available. Network scanning only? No problem. Cloud APIs? Great. Authenticated or not, DMAP will use the type and level of access provided to make a determination as to what the data repository is and provide the probability as well.

Don't miss a post

Get stories about data and cloud security, straight to your inbox.