Discover and Classify Data

Data Warehouses, Data Lakes, Data Lakehouses and the Data Engineering Landscape - Part 2 of 5

Bele
Chief Corvus Officer
February 23, 2021

This is the second part of a five-part (Part One | Part Three | Part Four | Part Five) series about designing and building data classification systems for security and privacy. It is a technical series written specifically for security professionals who want to peek under the hood. It is unapologetically about AWS, the undisputed king of the clouds and about data warehouses, data lakes and data lakehouses. The important data breaches these days just aren't from word docs on your file server or a spreadsheet in an email box. 

In this second part I want to lay the ground for the next five posts, the technical ones, so that you have the context of why certain things are important and know what to look for and where to look in your environment. 

To reiterate, these posts are written for security professionals, an industry that is simply unable to stay up-to-date with a broad range of important topics, hence the reason I felt the first two posts of this series were needed to set the scene. 

I will explain why data warehouses are more interesting to security people than data lakes today (breakers) and why that will change (builders). 

If you are a data engineer you can skip straight to the next post in the series where it gets fun. Nothing to learn here. 

I don't think I need to sell anyone on why security people should care about data, but it is worth pointing out that in what seems like a distant time, before the cloud, attempts to secure data were centered around the infrastructure that persists and provides access to it. As everything moved to the cloud, access and other security controls moved as well but they are largely ephemeral in nature where you no longer have direct and certainly not total control. When combined with the reality that companies want to enable as much access to the data as possible to fuel innovation, data is the new endpoint and one that you simply can't put an agent on. It's a new world. 

In this post I cover:

  • What are data warehouses, data lakes and data lakehouses?
  • Why is Snowflake such a big deal?
  • Why AWS S3 rules the roost
  • The resurgence of SQL

What are data warehouses, data lakes, and data lakehouses?

First came the database, a transactional system which creates and updates data in real time. Databases themselves have undergone significant change, not least from old school Oracle systems that typically required a dedicated DBA to provide care and feeding (usually grumpy old bar stewards in my experience). I have since seen developer staples like MySQL and PostgreSQL, in part because they were open-source and saw developer lead adoption and the NoSQL movement with technologies like MongoDB leading the way. 

Sensitive data has always been stored in databases and likely will for the practical future. As I said, they create and update data in real time and even in serverless and data-pipeline architectures a level of persistence close to the code is likely to remain for a long time. AWS offers Oracle and PostgreSQL as a managed database in a service called RDS (Amazon Relational Database Service). There is probably an entire blog series about sensitive data in databases, but there is a much bigger and much more important fish to fry, the data warehouse.  

In contrast to databases, a data warehouse stores aggregated structured data that has been put through an extraction, transformation and warehouse loading process called an ETL. Unlike a database, data doesn’t originate and isn't processed in a data warehouse; instead, it is sourced from different data sources and analyzed outside. A data warehouse architecture is used to extract and filter data that can be used for data analytics and reporting, and so it is important to recognize that specific data is extracted from several sources, transformed into a format that is usable and stored in the warehouse in a time-series fashion.

CRM, RDBMS, and Supply Chain all feed into ETL which then feeds to a Data Warehouse.

Data sources are usually much more diverse than just a database including document stores, CRM systems, log files and streaming data. As you can see, a data warehouse “can'' (and not necessarily) house sensitive data, because it is specific data that is extracted from a data source. And, there are very few if any security controls (see the data lakehouse later).  

A data lake is different as it is essentially a dumping ground for all your data. It's a centralized repository that allows you to store structured and unstructured data at any scale. The rise of the data lake was born from the strategy that you simply don't know what you might want to do with data in the future and that the cost of storage is so small that simply storing it all “in case” makes sense. 

If a data warehouse is a bottle of mineral water from a specific source in the French alps, a data lake is a mass volume of dirty water with runoffs from rivers, streams and power plants. 

CRM, RDBMS, and Supply Chain feed into to 1) an ETL then a Data Warehouse and 2) a Data Lake in the Center. The Data Warehouse also feeds into the Data Lake. File/Logs and Streaming also feed into the Data Lake.


In recent years I have seen the birth of the term the data lakehouse. A data lakehouse comes from the analogy that to get on and off the lake you go through the lakehouse. A lakehouse is essentially a control plane for the data lake providing standard file formats to write data (proprietary in the case of Snowflake), schema support for data governance, direct access to the data itself (rather than fishing for it) and separation of storage and compute resources for scaling. 


Lakehouses will undoubtedly become key security control planes in the future controlling identity and access to data, but I generally see end-to-end pipelines yet to be secured, valuable stashes of crown jewels scattered around companies in warehouses and sparse security controls. 

This is why I believe that data lakes and not data lakehouses are where the security industry needs to focus its attention for practical purposes for the next 12 to 18 months. 

Data lakehouses offer a longer term opportunity for us to build security controls into data pipelines (the builder security mentality) whereas data warehouses offer an immediate opportunity to find and secure the sensitive data and prevent breaches today. 

Why is Snowflake such a big deal?

You have probably heard of Snowflake, if not because of the technology itself but because of the financial success of the company IPO. It's currently worth around $75 billion dollars. Snowflake is a data lake that has become a data lakehouse providing near infinite storage and a control plane on top of the data. It's a data platform. 

It is important to keep an eye on Snowflake and other emerging data platforms because many people believe we are in the second wave of the move to the cloud where companies like Snowflake and Netflify build specialist services on top of cloud providers like AWS and GCP to get further economies of scale, better than the likes of what AWS and Microsoft can do themselves. In this second wave of data platforms you don't even need to manage the cloud infrastructure to do your data engineering and Snowflake has proven this to be true. It's a parasite on AWS – but one that has a symbiotic relationship. 

With the move from databases to data warehouses to data lakes, has come the incremental ability to store more data. With more data we store more crown jewels and create higher and higher value targets for hackers. 

Why AWS S3 rules the roost

The excellent Last Week in AWS publishes a regular Leaky Bucket Wall of Shame but you can just Google S3 bucket breach. Most readers won't need to, it's all over the press almost every day. So why do people keep using S3? The answer is simple. Reciting AWS marketing, “Amazon S3 has a simple web services interface that you can use to store and retrieve any amount of data, at any time, from anywhere on the web”. 

That's right. Universal, unbounded data storage that is dirt cheap to use, high performance and ultimately flexible. Dump it in S3 and use it however you want. Make it available to anyone or anything with a simple click or a quick bucket policy and never worry about it again. Oh yeah and it's 99.999% reliable, although these days when the 0.001% kicks in bad things happen. S3 buckets have become the ubiquitous technology behind data warehouses and data lakes because you simply don't need to worry about object storage anymore. In later parts of this series, I will talk about and show you how to use columnar file formats like Apache Parquet to store big data in buckets. 

The resurgence of SQL 

As a closing note, it’s worth understanding that despite the mass aggregation of structured, semi-structured and unstructured data (things I will cover in later posts) there is an interesting trend; SQL is now trendy again. Technologies like DBT have embraced peoples familiarity with SQL and its now standard to ETL data in and out of a warehouse using SQL. SQL has become very relevant to data lakes and this excellent article on some of the technical reasons why that is true avoids me having to drag this out any longer. 

SQL injection on a data lake anyone?

Don't miss a post

Get stories about data and cloud security, straight to your inbox.