A Big (Data) Mess Indeed...

On the importance of having a data strategy

Nati Berkover

Apr 28, 2023

What’s your company's data strategy? Does it have one?

Mmm..

Wait… what does it even mean to have a ‘data strategy’?

And what does it have to do with product management? Isn’t it something the engineering team should be worried about?

Fair questions. I’ll answer all of them.

In this post I’ll explain what I mean by ‘data strategy’, why it’s important in general and why you should care about it as a product manager.

Let’s roll…

What is data strategy

Amazon defines data strategy as:

“A long-term plan that defines the technology, processes, people, and rules required to manage an organization's information assets.” (reference here)

That’s a bit too comprehensive for what I’d like to focus on. I suggest a more narrow approach which is effective for small-medium companies, with focus on cloud infrastructure. I’d therefore define data strategy as your policies for storing and retrieving data generated directly by the usage of your product.

Hence, I’m excluding here financial data such as invoices and other expenses, IT systems information such as the organizational email, Slack and so forth.

Your product, if it’s already out there (and sometimes even before) - is constantly generating data. Examples may include:

Analytical events recorded by your product
User actions
Users’ data and personal identifiable information (PII)
Backend services actions
Algorithm decisions
Processing of raw data for models generation
Processing of raw data for transforming it to aggregated data
Events generated by various sensors

And plenty of other types of data.

A data strategy in our context means that you have guidelines and/or policies in place for how to store, process and retain each type of data that your product produces over time.

Why is it important to have a data strategy in place?

Not sure if you read or not my post about cloud infrastructure costs. Whether you did or not - I recommend (re)visiting it here.

In this post I discuss the implications on your cloud’s costs if you don’t maintain your cloud infrastructure properly.

This is just one aspect.

You need to understand that with a lack of proper data strategy - each developer may:

Create data sets as they see fit and duplicate data which already exists - which costs money, but may also cause data inconsistencies
Store the data in the wrong type of storage - which will cause low performance in extracting this data and spikes in costs
Forget to assign any proper retention policy on the data - which will cause it to be retained forever, even if unnecessary OR the exact opposite - delete it too often when it’s still needed

Additionally, if your architect and/or your devOps didn’t take care of proper backups and have a proper disaster recovery plan - then you may not be able to reconstruct the data if it was accidentally lost.

Hence, with a lack of proper data strategy this is probably what you’re going to experience as a product manager:

Spikes in infrastructure costs which are not proportional to the business growth (and hence you’ll start losing money)
Features take much more time to get to production because your engineering team is struggling with gathering all relevant data points with their code for producing meaningful and timely results.
Delivered features are underperforming in terms of runtime execution time or memory consumption.
Data inconsistencies which result in data inquiries that return inaccurate or wrong results (internally, or even to your end users).
Loss of data (I don’t need to explain why this is bad, right?)

I’m not sure if costs reduction is part of your goals or not - so you may or may not care about #1. I am, however, certain, that you care about #2 and it may become a real velocity killer and reduce your time to market.

If high runtime performance is also an important aspect of your product - then you probably care about #3 as well.

Last - #4 means you cannot trust the data of your organization. Your users may not trust the reports you send them or even won’t bother to sign in to the reporting dashboard that you provide them with. Hence, you lost potential stickiness (aside from all other problems).

Sadly, I see this too often in companies I mentor. There is no proper data strategy and the problems above surface this way or another, causing a great havoc, loss of money and loss or competitive edge.

How can you help your engineering team avoid this?

Understanding data

Since we’re deep into the age of ‘big data’ - you no longer have the luxury of playing it dumb when it comes to your product’s data and how to handle it.

Hence, the first step in solving the problems above is to understand the basics of cloud data infrastructure and the various data storage types.

Now, there are plenty of ways to view and analyze the world of data. If you simply type ‘data types’ in Google you will get results about variables in programming languages, or storage types such as flash drives, NAS servers, etc…

We don’t care about all of those in the context of this post… and potentially ever.

So instead of wasting time with Google or ChatGPT - I will try to get you up to speed with the following data layers abstractions based on what I learned myself over several years of working with cloud infrastructure.

In short, I like to divide the world of big data to the following types of data:

Raw data - data which is stored unprocessed together with its metadata. For example: users clicks on ‘producing a report’ on your dashboard. You maintain this click event together with the various dashboard filters’ values that were associated with this click + the timestamp of this click. Another example: A user visit to a website. You maintain this raw event as an HTTP request, together with the metadata of this request (user-agent, IP address, etc..) + the timestamp of when you received it.
Enriched data - a combination of one or more raw data points which have been through some sort of transformation. For example, taking the ‘clicks’ data points from the previous example and adding to them the geo information from the matching HTTP request event and the visitors ID from the ‘session’ object. Why would I do that? Because it makes sense to someone in my business.
Aggregated data - data which holds accumulation (summaries) of other data points. For example - total visits to a specific domain per day. Another example - Total clicks on each menu item per day.

In the Databricks platform, they call it the ‘bronze’, ‘silver’ and ‘gold’ layers. You can read about it here.

Raw data characteristics and how to handle

Raw data, because it’s unprocessed, maintains a single source of truth of ‘what happened’ in reality, over time. You may decide to add/remove metadata attributes to each raw data point, but I always recommend maintaining the timestamp of when it took place (assuming it’s a real life event).

Raw data shouldn’t be retrieved often. At least not the old records. The raw data is used to create the aggregated and enriched data layers, and once created - it should mostly be left alone (again.. The old records at least).

My recommendation is to keep raw events forever, since epoch (since you started collecting them). There are several reasons for that:

Storage is relatively cheap nowadays
If you keep it forever you can theoretically restore any lost dataset (enriched or aggregated) by merely ‘replaying’ the raw data in a proper manner.
Since most of the records shouldn’t be retrieved often - it shouldn’t incur a serious runtime overhead on your system.

Just make sure you keep the raw data events in a cheap storage such as S3 or something similar.

Of course you need to make sure the raw data tables are constantly being backed up.

Enriched data characteristics and how to handle

Enriched data was built by applying various ETLs (extract, transform, load - you should be familiar with this term) on raw data, and potentially combining several raw data points together.

The goal of enriching data is to serve a specific functionality of your product. As a product manager, you may not be aware of all the enriched tables as some of them may be used only internally by the engineering teams.

What you do need to understand is that if the specific functionality is no longer needed - then the enriched data set may not be needed as well. Or in other words - the enriched data tables may be destroyed or reconstructed as the architecture is changing and/or the product itself.

Thus, the need for enriched tables should be identified by the engineering teams based on your product requirements and/or the group architect decisions.

You will be rarely involved with those decisions directly as a product manager, and it’s ok.

Aggregated data and how to handle

As a product manager - the aggregated data is probably the layer that is of the most interest to you.

Aggregated data often translates directly to product analytics and very often the tables will be constructed per your request by the product data analyst (or directly by the engineering team with the lack of one). The business team may also be interested in viewing analytics and business insights. Sometimes they have their own analysts and you won’t be bothered with it, but sometimes such requests will go through you.

When adding new aggregations you will need to define:

How often the data needs to be refreshed. The more ‘current’ you want it - the more cost and processing power it will use. Many aggregations are really fine with being refreshed once every 24 hours.
How far back would you like to be able to ‘travel in time’. E.g. - would you like to view trends older than a year? 6 months? One month? It all depends on the business insights you, your customers or other stakeholders are looking for. Naturally, the more time you want the data to be retained - the more it’s gonna cost you.

Apply best practices

Once you get a better grasp of the data around your products, it’s time to encourage the team to embrace some good practices that will help them avoid the data pitfalls mentioned above.

Here are some of these practices:

Map your data. Ask the team to go over all tables and for each note the following:
1. Used/not being used. If it’s not being used - don’t bother with the rest of the columns.
2. Type: raw, enriched, aggregation
3. Purpose: what data it holds and why
4. PII: True/False whether this table holds PII data - it will help you apply GDPR and other privacy practices much easier since you’ll know where the PII is hidden.
Remove all unused tables. To play it safe - simply rename the unused table and wait a few days. See what happens. If, after a week, nothing falls apart - delete them. They cost you money.
Make sure there is a single source of truth for each data entity. For example - make sure the team knows which table holds the customers and that it’s not duplicated. In any company which deals with ‘big data’ there are probably dozens, if not hundreds of tables that were made over time - focus on the entities that make sense to your business - and check if there is an agreed single source of truth for them. Usually the pitfalls are around most common entities - revenues, customers and end users. You won’t believe how many times it’d be hard to agree on the single source of truth for those.
Make sure there is an owner for each table. An owner can be a specific person or a team. This person/team should be able to explain the rationale behind this table’s existence and troubleshoot problems in case they happen. They are also accountable for any data inconsistencies with such a table.
Make sure each table has a well-defined retention policy which is enforced. You won’t believe how many times tables hold data for infinity while nobody needs more than a week back. Make sure there is a process which cleans the tables based on their retention policy.
Make sure the raw data and other key tables are always backed up. Most of the tables in your system can be re-created based on the raw data, so you might not need to pay for their backup. Make a disaster-recovery drill every few months, by testing a scenario where only the data which was backed up exists and see if the team manages to recreate the rest.

Now, most of the bullets here are not your direct responsibility. But from my experience, if you don’t ‘push’ the team to enforce them - they won’t happen. Hence, you should both ask the team/group leaders to make sure to apply the above AND give them the time to execute that. It is in your best interest. Believe me.

Make sure there is a data strategy in place

You need to push the team to come up with a data strategy. Essentially it means they need to publish guidelines and policies for all the engineering teams which include the following:

Access to relevant knowledge about the cloud data infrastructure the company is using. This is where the various team members who have knowledge gaps about this subject can bridge it.
Guidelines as for when to add new tables and in which data storage (does it fit in the Data lake? Data warehouse? Some other database?). Any engineer who needs to deal with a new data type will follow these guidelines to make sure the data is added properly and without duplications.
Guidelines for determining and enforcing a retention policy on each table. Sometimes this requires consulting with you.
Other guidelines which are specific to each and every storage type.

Again, it’s not your task to come up with such a plan. For most product managers this is too technical anyway. However, I do recommend that you require the engineering team to have such a strategy and to make sure it’s part of the onboarding of each engineer.

To summarize

Having a data strategy is important. Without it - your product and your execution plan will suffer. I hope that I managed to explain why.

Making sure there is a solid strategy in place is tricky, because it’s not for you to write it down and enforce it. This is one of the times when you need to help the engineering team even though it’s not clearly obvious what it has to do with product management.

I’d also recommend getting the team a consultant that will help them with the process if you don’t believe there is enough knowledge within the team.

That is it for today!

If you have any ideas for posts or topics you want me to cover - PLEASE LET ME KNOW.

Last - If you found this post/series useful - feel free to ‘like’ it. If you think others can benefit from it - feel free to share it with them.

Thank you, and until next time :-)

Producteneurship

Discussion about this post