Discover and Classify Data

Finding and Determining What Data To Classify - Part 3 of 5

Chief Corvus Officer
March 29, 2021

In this third part of a five-part series, (Part One | Part Two | Part Four | Part Five) I describe how to find where the interesting data is being stored. As with all of these posts, they are unapologetically focused on AWS S3, the unbounded data storage behind data warehouses and data lakes. The rationale is simple: this is where you are almost guaranteed to find the crown jewels and it’s the first port of call for attackers. 

In the first section, What's In Your AWS Environment?, I explain the ways to find all the S3 buckets in your AWS accounts complete with IAM roles for the ‘assume role’ method we prefer at Open Raven.

In the second section, Now You Know What Buckets You Have, What’s Next?, I explain how to look inside the buckets themselves and the challenges of examining objects across all the buckets. As the number of objects can grow to billions and even trillions of objects in many AWS infrastructures, it’s more complicated than it seems at the surface and frankly scaling to meet the challenge (at speed, at a reasonable cost, etc.) is something we are still (and will likely always be) working on. In this post, I cover how we approach determining the right files to analyze, how to deal with massive files, how Apache Parquet and columnar file formats work, and why this is where most of the most sensitive data are stored.

For this post I am joined by our Head of Engineering, Oliver Ferrigni and our SRE Manager, Chris Webber. They make the magic work in the product.

What's in your AWS Environment?

While this section describes finding AWS S3, the same principles apply to any AWS asset. 

There are four ways you can get access to an AWS environment:

  1. Using an external identity provider (IDP) e.g., SAML / SSO
  2. Being ‘inside’ the AWS environment i.e., on top of AWS Lambda or EC2
  3. As an Identity & Access Management (IAM) user
  4. Using the root account - If you are using the root account for anything other than initial account setup, stop right now and go create yourself an IAM user and slap yourself in the face, hard.  

Ultimately, whichever way you come in, you have an identity. Once you have an identity, you can either make calls directly as yourself or you can assume a specific IAM role to use a different set of privileges. The concept behind this is a tried and tested security model of least privilege, where you constrain the assumed identity to the minimal set of privileges to do what it needs to accomplish and tightly control who is able to assume that role. You can grant access to this specific role across your entire AWS environment or solely to parts. 

Note: AWS IAM is a complex beast and describing how to determine exactly who has access to what is more of a book than a blog post. We are building capabilities to answer these questions now inside the Open Raven platform and we will blog about them in due course.

For the IAM role we use to perform discovery, we use the ReadOnlyAccess AWS managed policy which grants read only access to everything in your environment. We use ReadOnlyAccess instead of ViewOnlyAccess because we want to be able to look at the contents of things. For example, both policies grant `s3:ListBucket` but ReadOnlyAccess also grants `s3:GetObject` which allows us to look at the content of the S3 object. 

Knowing the specific details about exactly what privileges will give you access to what information is complex and one of the main reasons why many systems are set up granting privileged access at will. Developers and cloud architects tell themselves that when they come up for oxygen they will whittle things back down to least privilege and guess what? You have seen this movie headline before. It has a predictable ending. 

At Open Raven we have built a technology called DMAP that enables our users to quickly determine what data services and applications are running on AWS EC2 instances. This helps users identify non-native data stores like MySQL or MongoDB running on AWS EC2. DMAP runs as a serverless Lambda function and the AWS managed AWSLambdaVPCAccessExecutionRole policy is used to enable the Lambda functions to operate and because we apply these permissions to an IAM Role it can also be used by Lambda directly.

Below is an example of the policy document we apply:

"Version": "2012-10-17",
    "Statement": [
           "Action": [
            "Resource": [
                "arn:aws:lambda:*:<AWS ACCOUNT ID>:function:dmap-*"
            "Effect": "Allow"
           "Action": "iam:PassRole",
            "Resource": [
              "arn:aws:iam::<AWS ACCOUNT ID>:role/openraven-cross-account-00g1khued7w30nmYU0h8"
          ],            "Effect": "Allow"

This is what allows us to create and run the Lambda functions themselves. 

Note to the cautious: AWS provisions a quota of Lambdas that you can run but does not constrain anyone inside the account from consuming them all. It's a nice DoS attack if you are inside of an account, can spin up a Lambda and want to stop other applications using Lambdas from running. Seriously, take note of this constraint or drop us a line for advice.

As we don’t have an identity in the customer’s AWS Account, we need to apply an Assume Role policy to an IAM Role so we can do the discovery work. 

This is the Assume Role policy we use:

  "Version": "2008-10-17",
  "Statement": [
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::230888199284:role/orvn-00g1khued7w30nmYU0h8-cross-account"
      "Action": "sts:AssumeRole",
      "Condition": {
        "StringEquals": {
          "sts:ExternalId": "f48ecc21-0a24-4a3e-98f9-64d399929c65"
      "Effect": "Allow",
      "Principal": {
       "Service": ""
      "Action": "sts:AssumeRole"

First, we allow `arn:aws:iam::230888199284:role/orvn-00g1khued7w30nmYU0h8-cross-account` to Assume Role. This is important because we are limiting access to a specific IAM Role. That IAM Role lives in Open Raven’s Account which we can see as 230888199284. Each customer is assigned a unique IAM Role to ensure separation. We also separate each user into their own AWS VPC as an additional safeguard. Lastly, we use an ExternalId to protect against the confused deputy problem

In addition to Open Raven using the role, we allow the Lambda functions we create to use this role, which can be seen by granting the Lambda service the ability to assume the role.

AWS Orgs vs AWS Accounts

AWS supports Organizations, a feature that allows you to manage accounts centrally as a group, including their billing and applying security policies. AWS orgs are especially useful for the security practitioner. If you have access to an AWS environment at the org level, then any new accounts that are created automatically are added to the org. The result is that you are aware that new accounts exist, thereby avoiding the “shadow account” problem, or if you choose, you can automatically add the IAM role to any new accounts by policy so that they have the same coverage as other accounts in the org.

When you add an account you are able to query the account properties to determine if it is a master account.

final Organization org = client.describeOrganization().organization();

You can also determine the current account type to resolve if it is a member of an org and if so what the org is. 

final Organization org = client.describeOrganization().organization();
org.arn(); // The regex pattern  for an organization ID string requires "o-" followed by from 10 to 32 lower-case letters or digits.
if(myAccountId == org.masterAccountId()) {
// Then we are the root account

At Open Raven we have strong guidance based on security and operational best practices to create the roles yourself and hand us back the ARN. 

Now you have access to the AWS accounts you can list the AWS S3 buckets in each account.

var client = S3Client.builder().credentialsProvider(credentialsProvider).region(Region.US_EAST_1)).build();
>var bucketNames = client.listBuckets().buckets().stream().map(Bucket::name).collect(Collectors.toList());

Now you know what buckets you have, what next?

OK, so you now have a list of your thousands and thousands of buckets (yes literally hundreds of thousands in some environments) that your developers and cloud team have spun up over the years, what do you do next? Scan them all right away? Not quite; you need a solid game plan for a large scale environment. The reality is that given the number of buckets, amount of files they will contain and the sheer amount of data in those files you will need to prioritize in order to obtain fast results you can readily work with. 

Above I described our use of AWS Lambda functions for DMAP that identifies non-native data stores on generic compute. We’ve also built our data scanning technology harnessing AWS Lambda so we can elastically scale up as much as AWS can handle. Even with a specialized architecture purpose built for data lake analysis, there aren’t enough Lambda functions available in most AWS org quotas to handle the load all at once. Even if the resources are available, the cost involved is unlikely to make sense. 

Instead, we recommend you build a profile of what you care about which typically includes thinking through the following:

  • Target buckets in accounts and regions of interest 
  • Target buckets with security configurations of interest 
  • Choose what type of data you want to scan for 
  • Target buckets with files of interest 

Remember AWS S3 buckets don’t belong to a VPC, but instead belong to an account. 

We have just released a streamlined interface for building these profiles and there is a blog describing it here

When thinking about targeting buckets in accounts and regions of interest, you should consider regions that have specific regulatory requirements. AWS provides the regions Show request return with the account and region associated with the bucket. 

When targeting buckets with security configurations of interest, you can pick from the obvious attributes such as buckets that are open to the internet, those that are acting as web servers, and those that are unencrypted. Additionally, you can also filter by any attribute that can apply to an S3 bucket. If you are interested in exploring what's available you can use the AWS Boto3 framework and look at S3 specifically here. This simple code snippet will show you the ACL for a specific bucket as an example. 

import boto3
s3 = boto3.resource('s3')
bucket_acl = s3.BucketAcl('bucket_name')

When I originally planned this series I decided to make this post about both how to gain access to AWS and how to access the objects / files inside of AWS S3. File access is both hard and interesting because you need to understand why listing objects through AWS API’s is not sufficient and how Apache Parquet and columnar file formats work as well as the practicalities of opening and scanning multi-terabyte files. Given the importance of this and how core it is to data classification, I am going to break it out into a separate post and publish that as the next blog in the series. 

As a footnote, we are building a free open-source CPSM or Cloud Posture Security Management tool called Magpie that can both discover your AWS environment and check it for common security misconfigurations. If you want to join the fun, sign up for our Slack channel and either follow along and join the development community along with our full-time developers working on the effort. Much of the discovery work I am talking about here is implemented in the Magpie project. Learn more at

Don't miss a post

Get stories about data and cloud security, straight to your inbox.