Finding & Eliminating Sensitive Data in Logs
While mistakes certainly happen, remembering to protect sensitive data in data stores is typically not that hard. We often have internal processes and checks throughout our pipeline that prevent such obvious exposures. We have audits and assessments from 3rd parties that can flag them. And so on. But what about sensitive data in log files? That’s another story entirely.
Look no further than social media giant Twitter who wrote this following an incident where passwords were written to a log before being hashed:
"When you set a password for your Twitter account, we use technology that masks it so no one at the company can see it. We recently identified a bug that stored passwords unmasked in an internal log.”
How about Facebook? Nearly the identical problem:
"As part of a routine security review in January, we found that some user passwords were being stored in a readable format within our internal data storage systems.”
Pedro Canahuati, Facebook’s Vice President of Engineering, Security, and Privacy wrote in a statement.
"Our login systems are designed to mask passwords using techniques that make them unreadable.”
The Facebook incident was then followed by a similar incident at Github. These companies have large, mature, and well funded security organizations. Why is it so easy for even them to suffer exposure incidents through their log files?
How it happens
While there are many ways in which sensitive data (e.g., personal data, health data, credentials, etc.) can end up exposed in log files, there are a few root cause issues at the heart of the matter.
Rate of change - Modern software development practices result in incredibly dynamic environments. They are built using agile methodologies that lean towards weekly or biweekly sprints. They use a continuous integration / continuous deployment approach that results in regular updates in production. They’re composed of many microservices which are all changing according to the needs of whatever is being delivered during the current sprint. When it all works properly, the efficiency and speed at which you can operate is breathtaking. But it’s also very easy to make mistakes, such as failing to remove debug statements that place sensitive data in logs.
3rd party services - Few, if any, modern applications are built without the benefit of 3rd party services. It’s common to use them for everything from authentication to handling payments and user telemetry. If our own services and applications are changing rapidly,the same is also true for the services we use from others. It effectively services as a force multiplier for the rate of change, driving up the probability of an all too easy mistake.
Wide variety of sources and types - In order to wrangle all of the different sources for centralized control and analysis, we lean on purpose-built solutions and pay them a hefty sum for it. The services are only as good as they are fed with all of our constantly shifting sources; keeping them current is a challenge. If we assume all of the logs are in the same place, then we are left to sort between the various types of logs which are typically some form of unstructured text data. From protobuf to JSON, getting all of the logs together is simply the start. Making sense of it – and especially finding leaked data – is another task entirely and rarely at the top of anyone’s priority list.
There are a number of other factors that come into play, but the final one is simply this: given limited time, resources and attention, we focus on the things that seem to matter the most. In this case, we give our attention to making sure our data stores are properly configured and the sensitive data at the heart of our applications and infrastructure is secured. It’s rational, yet leaves us exposed to all too easy mistakes that leak the very same sensitive data through our logs.
Why it matters
Security incidents resulting from leaked data from logs are common enough to earn their own CWE (Common Weakness Enumeration) number under the name of CWE-532: Insertion of Sensitive Information into Log File3. So what can happen if you suffer a logbased data leak?
Article 32 of GDPR covers “authentication and the control of logging” and has been used in the past to levy hefty fines for organizations that leak sensitive data through logs. In fact, the very first GDPR fine assessed in the Netherlands was EUR 460,000 for the Dutch Haga Hospital’s data exposure. While GDPR is wide reaching, the newly enacted CPRA in California promises to cover the remainder of organizations who may not be concerned about GDPR while industry specific regulations such as those for financials and healthcare also cover sensitive data that may find its way into logs.
The time spent responding to a data leak cannot be recovered. A couple days is common and more is not unusual depending on the severity of the incident. It is unpleasant, high stress work that commonly involves a number of people across an organization, including legal, compliance and PR.
No one plans to have an incident. They typically happen at the worst possible moments and create work that pushes work into the evenings and weekends. Other commitments are dropped as all attention is given to critical response tasks. Your reward for the clean up? Doing the work that was previously dropped at double pace to make up for lost time.
PR / Embarassment
Reporting out an exposure incident is embarrassing. Impact ranges from uncomfortable conversations with a handful of customers and partners to full blown public apologies to the masses such as the ones Facebook and Twitter penned in the stories from the introduction. The loss of credibility that follows is hard to measure, but it’s tangible.
How Engineering Team at Mobile App Company Responds to Leaked Data in Logs
A mobile app that generates roughly 250,000 logs per second was leaking sensitive data, such as passwords, usernames, authentication tokens and internal secrets, in log files. The data was in different formats (plain text, JSON, unstructured, etc.) and from a multitude of sources which made it difficult to triage and clean-up when a leak was discovered. Mobile logging systems in particular are difficult to manage, as there are multiple phone clients (IoS, Android, Windows), and with each new release, developers change the structure of the log files. However, users do always update their phones, causing their devices to continue running logs in old formats.
In one incident, the mobile application inadvertently began logging user passwords from one of its mobile clients. In order to respond to the incident, the team had to parse through all the mobile logs to identify which mobile app version the issue was coming from. Given the many versions and clients they were running, the clean-up task was robust and tedious, hijacking resources from the engineering team, and setting back other high priority work.
Finding and eliminating sensitive data in logs
You can automatically detect and prevent sensitive data exposure through logs with Open Raven. Here’s how it works:
Select the location where the logs are being stored. This can be one or many different data stores.
Select a data collection for developer secrets, personal data, health data, or other classification types to be included.
Configure where alerts should be sent automatically via email, Slack, AWS EventBridge, or through webhooks integration. Or enable Jira integration to create tickets with full violation and data details.
When sensitive data is found in your logs, detailed alerts make it easy to locate and resolve the problem before it becomes an incident.