🗓️ Live Webinar November 9: How HealthMatch.io Used Customer.io and RudderStack to Launch Their New Business Model in 24 Hours

Pricing
Log in

Blog

COMPANY

The Complete Customer Data Stack: Data Collection (Part 1)

Blog banner
Subscription

Subscribe

We'll send you updates from the blog and monthly release notes.

Kostas Pardalis

Kostas Pardalis

Head of Developer Experience at RudderStack

February 22, 2021

The Importance of Categories

Even the best possible data stack is completely useless without data. For this reason, the first problem we always face when building data infrastructure is what data we are going to be collecting, from where, and how we should do it.

Of course, there are also other things we should keep in mind while trying to figure out the data we will be working with. For example, what kind of delivery semantics we need or how we will be processing the data later on.

In the end, the data we will be working with and the infrastructure we will be using are just different sides of the same coin. For this reason, we should always try to have a holistic view of both the data and the infrastructure.

But let’s start with the basics and build a practical taxonomy of the different data that can cover most use cases for most companies.

The data we will be working with can be categorized into a small set of categories, as we will see. Depending on the category, we need different infrastructure, and we can support different business objectives. This post will cover the first major category, event data, and how to collect it. In part two, we’ll cover relational data and note a few other commonly used sources of data.

Event Data

In this category, we are dealing mainly with clickstream data. We need the right infrastructure to capture, route, and deliver the data to various destinations. In most cases, this data represents some behavior, with the most common customer behavior.

Characteristics of event data:

  • Event data do not get updated.
  • Usually, they are coming in high volumes and velocity.
  • Relatively low dimensionality.
  • They share many characteristics with time-series data.

Event data is not immutable, and they can change but mainly by adding dimensions to the data or correcting something that went wrong during the capturing process. They rarely change, though, especially if we compare them with the rows of an OLTP database.

This data is also coming in high volumes and velocity. It’s not uncommon for even moderately large companies to handle billions of events per month, especially in the B2C space.

The dimensionality of the event data is also low. Most of this data can be represented in a database using one table, and their attributes are in most cases measured in the 10s.

Finally, time is an important dimension of this data; that’s why they share many characteristics with time-series data. Actually, they are multidimensional time-series data.

Download the guide: event data collection best practices

Get these insights distilled in a best practices guide to help you do event data collection the right way.

How to Collect Event Data

When it comes to event data, it’s all about streaming. You need infrastructure that can deal with streams of data in a reliable and scalable way.

You need reliability and high availability because capturing data makes it easy to end up with data loss. Networks tend to break, especially when you are collecting data from mobile devices. The infrastructure must be built with reliability and high availability, starting from the point where the data is captured and ending with delivering the data to the final destination.

The good thing with event data, though, is that a small data loss is not going to kill you; customer behavioral data is not P&L report entries.

Capture

Working with event data starts at the point where they are captured. This usually happens by using some SDK, which can be incorporated into your web app, mobile app, or website. Then you use the SDK to generate an event every time an action happens and enrich it with useful metadata, which triggered the event.

Deliver

After the SDK captures the event, it will be pushed into the data infrastructure. The SDK needs to take care of any failures while respecting the host's resources, e.g., you don’t want the SDK to blow up the CPU or memory of a mobile device. Keep in mind that guaranteeing that there will be zero data loss is impossible. For example, a mobile app has to push data to the infrastructure, the network is down, and the user decides to kill the app.

Queue

After the data is delivered to the data infrastructure, it’s usually received into a queuing system that can maintain the ordering of the events. This is important, especially if the events drive some behavior. This is the reason why systems like Kafka have very well-defined ordering semantics.

Process

After the event data is queued in such a system, they can be processed in a streaming fashion. In this case, it’s ideal to have a queuing system that can sustain high throughput, as these applications are usually real-time.

Store

Finally, the data can also be stored for batch processing or archiving.

Part Two

Now that you've learned about collecting event data, check out Data Collection (Part 2), where we cover the other big, commonly used category of data, relational data and briefly touch on two other common sources of data.

Sign up for Free and Start Sending Data

Test out our event stream, ELT, and reverse-ETL pipelines. Use our HTTP source to send data in less than 5 minutes, or install one of our 12 SDKs in your website or app. Get started.

Kostas Pardalis

ABOUT THE AUTHOR

Kostas Pardalis

Head of Developer Experience at RudderStack

Recent Posts

COMPANY

A Practical Guide to The Modern Data Stack: The Data Maturity Journey

By Eric Omwega
COMPANY

Why it's hard to build a 360-degree view of your customer

By Soumyadeb Mitra, Eric Dodds
COMPANY

It's Time for the Headless CDP

By Soumyadeb Mitra, Eric Dodds
arrow

See all posts

Subscription

Subscribe

We'll send you updates from the blog and monthly release notes.

Get Started Image

Get started today

Start building smarter customer data pipelines today with RudderStack. Our solutions engineering team is here to help.

Sign up for freeGet a demo

COMPANY

  • About
  • Contact us
  • Partner with us
  • 🚀 We’re hiring!
  • Privacy policy
  • Terms of service

JOIN THE CONVERSATION

Learn more about the product and how other engineers are building their customer data pipelines.

Join our Slack Community

READ OUR DOCUMENTATION

Technical documentation on using RudderStack to collect, route and manage your event data securely.

Go to Docs

© RudderStack Inc.

This site uses cookies to improve your experience. If you want to learn more about cookies and why we use them, visit our cookie policy. We’ll assume you’re ok with this, but you can opt-out if you wish Cookie Settings.