What is Data Control?

We'll send you updates from the blog and monthly release notes.

Soumyadeb Mitra

Founder and CEO of RudderStack

December 31, 2020

The concept of data control has become a big topic over the last few years, especially as it relates to customer data. It’s increasingly discussed by companies, often at the C-suite and board levels, and the issue is top of mind for consumers. Data control is such an important topic that some companies have built their entire value proposition on top of it. But what exactly does it mean to have control over your data?

According to Merriam Webster, to have control means to exercise restraining or directing influence over, or to have power over. So, how does this apply to data? In this post, we argue that there are three aspects to having power or authority over data:

Data access aperture
Data security control
Data privacy control

To illustrate our point, let’s consider a concrete example. Imagine a company stores its customer data–specifically the clickstream data from their app–in four different locations:

Google Analytics
Snowflake
AWS S3
Their own data center

Below, we’ll illustrate how each of these storage locations provides different levels of access, security, and privacy controls.

Data access aperture

Data access aperture defines the various methods you can use to access and work with your data.

Consider the first three examples of data storage above: Google Analytics, Snowflake, and S3. In all of these cases, the user stores the data to a cloud provider, but the level of control you have over how you interact with the data varies significantly.

The data stored in Google Analytics is only accessible through the dashboards that Google provides. So, you are out of luck if you want something outside of the reporting they offer you. You won’t have access to something you might typically achieve with a few SQL commands on your warehouse (unless you want to pay Google a ridiculous amount of money for a dump of the raw data).

When you store the data in a Snowflake warehouse, you can interact with it via complex SQL. As a result, you can leverage the unlimited computing power of a modern warehouse for advanced analysis. On the other hand, if you want to run a Spark job, you’re out of luck (or more accurately, it would be relatively inefficient and costly to achieve via Snowflake).

In this example, S3 provides the widest aperture of access. Not only can you connect S3 to BI tools like Tableau or Looker and build analyses in SQL, but you can also load the data into Spark or your applications.

This example also makes it clear that exposure of the data to other users and systems is, on some level, a function of access aperture: the wider the aperture, the higher the functionality around exposing the data.

Data access control also defines how much control you have over data portability. Data portability simply means how easy (or difficult) it is to take your data elsewhere, say, to a different storage or SaaS system. In our example, S3 provides the highest portability support. You can easily move data from S3 to Google Cloud, for example, all you have to do it pay the network transfer cost. At the other end of the spectrum, Google Analytics does not provide direct access to the raw event stream data, so this data is not portable.

Data security control

Data security control means having visibility and control over all access to the data.

Security is a crucial aspect of data control, as evidenced by repeated headlines about data breaches in the past few years.

By storing the data with any cloud provider, whether it’s a SaaS application like Google Analytics or an Infrastructure SaaS like S3, you give up some security control. You’re limited to the access control policies supported by the tools housing your data. For example, Google Analytics only supports user-based access, while S3 supports both user-based and Identity and Access Management (IAM) based access. Furthermore, when you trust a cloud provider with your data, you’re also trusting them to implement their policies effectively.

On the other hand, when you store data in your own data center you can set up arbitrary security control policies. But remember, how effectively you can enforce these controls depends on the sophistication and bandwidth of your IT security teams.

It’s important to carefully consider what data you entrust to third-party SaaS vendors and do you research to understand their security practices, especially when it comes to sensitive data and PII. The more comprehensive and critical the data is, the more stringent should be the security practices.

Data security control needs vary by industry and company. For example, a consumer gaming startup doesn’t collect much PII on their users and likely has a small user base, meaning they can remain secure without significant effort. On the other hand, a big bank or a healthcare company might not even be comfortable storing data with a cloud provider like AWS.

Data privacy control

The relationship between aperture and exposure necessitates the third aspect of data control: data privacy control. Specifically, this relates to for what do you use the data. While there is a close relationship between data privacy control and data security control, there is a clear distinction. Security control manages who has access to the data. Privacy control considers what the data is used for and whether or not that usage meets the end user’s expectation of privacy and/or complies with legal standards such as GDPR or CCPA.

Let’s go back to our example from above. Google Analytics, Snowflake, and S3, are all cloud providers. This means someone inside those companies technically has access to your raw data, but what they can potentially do (or are already doing) with that data varies.

For example, at least some employees at Google can likely look at the same charts and reports you have set up in your Google Analytics account (the system automatically generates them from the clickstream data). More likely than not, Google uses that data to create a user profile and use it for marketing purposes. The fact that the user anonymizes the data before sending it to Google doesn’t help much if the end-user is logged into some Google property like Gmail. Google can still target the users with cookies. Cookie targeting is being addressed by many platforms, but it will continue on in some form.

In the case of Snowflake and S3, their employees probably have some level of access to your raw data, but for them to make sense of clickstream data, there would need to be some reverse engineering to understand event semantics, etc. Furthermore, they don’t have any way of tying data across customers and creating a user profile similar to Google. Also, in S3, you can store the data encrypted, which would make it even harder for someone inside of AWS to reverse engineer—they would need to understand your app, get the keys, and then make sense of the data.
This spectrum highlights the challenge of balancing aperture and exposure. It’s clear that from a vendor standpoint, S3 provides far more data privacy control than Google Analytics. The practical challenge for every business is that the tooling teams across the organization need to do their jobs and requires access data but, as we’ve seen, the level of data control afforded by these tools varies significantly. Lucky, there are an increasing number of options for ‘fully owned’ analytics and data tooling.

Balancing data control requirements

When companies consider data control as a topic, they tend to focus on one or two of the aspects mentioned above, depending on their industry, stage, and even company culture.

At a bank, the security control aspect is paramount. At a fast-growing consumer startup with a sophisticated data science team, achieving more systems exposure is critical.

Data privacy control has become an exciting topic over the last few years. While heavily influenced by industry regulations, there is significant growth in the number of developers passionate about full protection of customer data, refusing to send any data to vendors like Google. This awareness also comes from company culture—several of our customers at RudderStack have strict customer data requirements staying within their VPCs.

There is no prescription for data control that fits every company. Differing needs, beliefs, and regulations dictate placing differing levels of emphasis on each of the three aspects of data control. It’s the job of data leadership (and the data engineers who work with them) to navigate and manage this complexity.

We built RudderStack to help simplify data control

RudderStack’s foundational product design decisions focus on privacy and security to help companies take back control of their data. Our flexible, HIPAA compliant, tool provides data pipelines for data collection from every data source and data activation in every tool. Features like our data governance API and data transformations help you collect and activate data while meeting data management objectives, so you can confidently control your data across all three aspects of data control.