Goodbye Data Swamp, Hello Data Lake

6 min. Read

Data lakes have been around for eight years, and while the original concept was promising, many of the lakes have unfortunately turned into data swamps. This has prompted criticism of the concept itself. However, as we will demonstrate below, we shouldn’t blame the concept, but rather the maintenance and care of the lakes. Our customers have found data lakes can be the centre of their business intelligence if you take the steps we describe in this article.

What’s in the lake?

The concept of a data lake was introduced in 2010 as an alternative to more traditional data warehouses and data marts. The mission of a data lake was to enable more uses of the data by storing all data that was coming from disparate sources in its raw form.

A data lake is a massive and easily accessible centralised repository of both structured and unstructured data.

James Dixon, who coined the term, explained that a data lake is “a large body of water in a natural state. The contents of the data lake stream in from a source to fill the lake and various users of the lake can come to examine, dive in, or take samples.”

Data Lakes have some common characteristics:

Data is stored in its natural format
The data is typically sub-transactional or non-transactional
There are some known questions to ask of the data
There are many unknown questions that will arise in the future
The data may be relevant to different business users

In its 2014 report, PWC said that data lakes could put an end to data silos and “can help to resolve the nagging problem of accessibility and data integration”.

THE BENEFITS ARE CLEAR

Companies enjoy several advantages when using data lakes.

Firstly, the storage is extremely cheap, which allows them to store everything and then some more.

Secondly, it opens up new opportunities for data scientists – because data lakes are operated by the principle “schema at read” (as opposed to “schema at design” in data warehouses), an end user is able to create any request to dive in with.

Thirdly, it is future-proof – the data lake efficiently deals with both present and future needs.

An overarching benefit, however, is being able to capture a 360-degree view of a customer by using this data to create a human-like consumer profile. This is likely a goal of any company striving to provide competitive customer service and top end personalisation. We call this a Customer First approach and this is what we deliver to our clients.

A LAKE OR A SWAMP?

The most common criticism of data lakes is that a lake can become a swamp when too much data is dumped into it without any understanding of how to handle it.

Data swamps are best described by Sean Martin from Cambridge Semantics: “We see customers creating big data graveyards, dumping everything into HDFS and hoping to do something with it down the road. But then they just lose track of what’s there”.

Other problems with data lakes include data quality, security and access control. In some organisations, data can be put into a data lake by almost anyone. Additionally, unregulated access to data in a data lake means that data privacy may be compromised, exposing an organisation to high reputational and financial risks in a landscape of tightening regulatory control.

If you’ve decided to take the data lake route, your approach to data governance, configuration, and accessibility for data science are all important considerations.

OUR SOLUTION

At Intent HQ we ensure your data is enriched and accessible so that it becomes the main source of your customer intelligence, enabling you to have a near real-time 360-degree view of a customer while empowering your business and staying fully compliant with GDPR, California Data Privacy Act 2020, or other legislative or policy requirements.

How do we take care of our data lake and make sure it’s easy to swim in it? Here are six rules we employ:

1. WE ENRICH DATA AND MAKE IT ACCESSIBLE TO THE ENTIRE ORGANISATION

We make it easy to “enrich” the data in your lake with ML/AI techniques and data from other sources. Intent HQ has enrichments available for calculating interests, location mapping, churn prediction, and much more. You can also easily plug-in your own enrichments.

Most importantly though, the access and configuration of these enrichments are again available across your organisation, providing transparency and agility to the process of creating a human-like profile.

2. WE ENABLE THE CREATION OF MULTIPLE PROFILES FROM THE SAME SOURCE OF TRUTH

Do you have a fragmented view of your users? Do you have different departments working on different user profiles? No problem. Our pipeline centralises all the information about a user creating a common single source of truth. This allows you to configure as many different profiles for different applications as you need without compromise:

They are all based on the same single source of truth
They are all automatically kept up-to-date
You can easily restrict access to different profiles based on department, role, etc.

3. WE ORGANISE DATA BY USER

We split up data by user while still leaving it unstructured in its original form. As a result, it has all the benefits of a data lake, but effectively it enables us to: a) linearly scale and b) have the data optimally organised for data science.

This approach becomes particularly crucial under GDPR requirements. In case you need to access, edit or delete any customer’s data, you can do it instantly with this approach. If data is not arranged by user, it may take hours if not days to collect all the user’s data, with a high chance of not finding everything and therefore running the risk of non-compliance.

4. WE CONFIGURE INSTEAD OF CODING

Our experience has taught us that everybody in the organisation should have easy access to the data. That’s why we’ve developed a configurable pipeline that democratises the access to the data. Anybody can write a data processing pipeline and produce a view of the data. Our platform empowers your organisation by:

Reducing the feedback loop: different departments can experiment with the data
Reducing the time to market: write a pipeline and have the data available in less than 90 minutes

This is vital for companies such as telcos that need a speed of industrialisation and have to respond to the market in a matter of minutes.

5. WE ENSURE ACCESSIBILITY FOR ML, DL, AND AI TECHNOLOGIES

With other architectures, building per user aggregations or making time analysis for each user is difficult. With our technology, it’s easy:

There is a user-centric view of the data, always up-to-date, combined with an API that supports a rich set of transformations
The natural structure of the data encourages efficient feature extraction
Our tooling is designed to fit neatly within a Data Scientist’s workflow, not replace it. This allows them to use the tooling they are already good at effectively and efficiently

The modular nature of our pipeline makes it easy for you to insert the tools, models, and languages you already know and use with ease.

6. WE MAINTAIN PRIVACY BY DESIGN

Privacy is crucial, and we take measures to protect it:

Not anyone can put data into the lake. Instead, it is pre-arranged to determine what will go in where and when
Only certain people have access to the raw data lake
Anyone accessing the data from the configurable pipeline sees a specific subset of it, such as a limited view of users and feeds

To conclude, a data lake can be an unparalleled source of customer intelligence in your business. Our approach makes your data work for you and your customers, powering your way to becoming a Customer First organisation that understands and treats its consumers as unique individuals.

REFERENCES: