Goodbye Data Swamp, Hello Data Lake
Data lakes have been around for eight years, and while the original concept was promising, many of the lakes have unfortunately turned into data swamps. This has prompted criticism of the concept itself. However, as we will demonstrate below, we shouldn’t blame the concept, but rather the maintenance and care of the lakes. Our customers have found data lakes can be the centre of their business intelligence if you take the steps we describe in this article.
What’s in the lake?
The concept of a data lake was introduced in 2010 as an alternative to more traditional data warehouses and data marts. The mission of a data lake was to enable more uses of the data by storing all data that was coming from disparate sources in its raw form.
A data lake is a massive and easily accessible centralised repository of both structured and unstructured data.
James Dixon, who coined the term, explained that a data lake is “a large body of water in a natural state. The contents of the data lake stream in from a source to fill the lake and various users of the lake can come to examine, dive in, or take samples.”
Data Lakes have some common characteristics:
- Data is stored in its natural format
- The data is typically sub-transactional or non-transactional
- There are some known questions to ask of the data
- There are many unknown questions that will arise in the future
- The data may be relevant to different business users
In its 2014 report, PWC said that data lakes could put an end to data silos and “can help to resolve the nagging problem of accessibility and data integration”.
THE BENEFITS ARE CLEAR
Companies enjoy several advantages when using data lakes.
Firstly, the storage is extremely cheap, which allows them to store everything and then some more.
Secondly, it opens up new opportunities for data scientists – because data lakes are operated by the principle “schema at read” (as opposed to “schema at design” in data warehouses), an end user is able to create any request to dive in with.
Thirdly, it is future-proof – the data lake efficiently deals with both present and future needs.
An overarching benefit, however, is being able to capture a 360-degree view of a customer by using this data to create a human-like consumer profile. This is likely a goal of any company striving to provide competitive customer service and top end personalisation. We call this a Customer First approach and this is what we deliver to our clients.
A LAKE OR A SWAMP?
The most common criticism of data lakes is that a lake can become a swamp when too much data is dumped into it without any understanding of how to handle it.
Data swamps are best described by Sean Martin from Cambridge Semantics: “We see customers creating big data graveyards, dumping everything into HDFS and hoping to do something with it down the road. But then they just lose track of what’s there”.
Other problems with data lakes include data quality, security and access control. In some organisations, data can be put into a data lake by almost anyone. Additionally, unregulated access to data in a data lake means that data privacy may be compromised, exposing an organisation to high reputational and financial risks in a landscape of tightening regulatory control.
If you’ve decided to take the data lake route, your approach to data governance, configuration, and accessibility for data science are all important considerations.
At Intent HQ we ensure your data is enriched and accessible so that it becomes the main source of your customer intelligence, enabling you to have a near real-time 360-degree view of a customer while empowering your business and staying fully compliant with GDPR, California Data Privacy Act 2020, or other legislative or policy requirements.
How do we take care of our data lake and make sure it’s easy to swim in it? Here are six rules we employ:
1. WE ENRICH DATA AND MAKE IT ACCESSIBLE TO THE ENTIRE ORGANISATION
We make it easy to “enrich” the data in your lake with ML/AI techniques and data from other sources. Intent HQ has enrichments available for calculating interests, location mapping, churn prediction, and much more. You can also easily plug-in your own enrichments.
Most importantly though, the access and configuration of these enrichments are again available across your organisation, providing transparency and agility to the process of creating a human-like profile.
2. WE ENABLE THE CREATION OF MULTIPLE PROFILES FROM THE SAME SOURCE OF TRUTH
Do you have a fragmented view of your users? Do you have different departments working on different user profiles? No problem. Our pipeline centralises all the information about a user creating a common single source of truth. This allows you to configure as many different profiles for different applications as you need without compromise:
- They are all based on the same single source of truth
- They are all automatically kept up-to-date
- You can easily restrict access to different profiles based on department, role, etc.
3. WE ORGANISE DATA BY USER
We split up data by user while still leaving it unstructured in its original form. As a result, it has all the benefits of a data lake, but effectively it enables us to: a) linearly scale and b) have the data optimally organised for data science.
This approach becomes particularly crucial under GDPR requirements. In case you need to access, edit or delete any customer’s data, you can do it instantly with this approach. If data is not arranged by user, it may take hours if not days to collect all the user’s data, with a high chance of not finding everything and therefore running the risk of non-compliance.
4. WE CONFIGURE INSTEAD OF CODING
Our experience has taught us that everybody in the organisation should have easy access to the data. That’s why we’ve developed a configurable pipeline that democratises the access to the data. Anybody can write a data processing pipeline and produce a view of the data. Our platform empowers your organisation by:
- Reducing the feedback loop: different departments can experiment with the data
- Reducing the time to market: write a pipeline and have the data available in less than 90 minutes
This is vital for companies such as telcos that need a speed of industrialisation and have to respond to the market in a matter of minutes.
5. WE ENSURE ACCESSIBILITY FOR ML, DL, AND AI TECHNOLOGIES
With other architectures, building per user aggregations or making time analysis for each user is difficult. With our technology, it’s easy:
- There is a user-centric view of the data, always up-to-date, combined with an API that supports a rich set of transformations
- The natural structure of the data encourages efficient feature extraction
- Our tooling is designed to fit neatly within a Data Scientist’s workflow, not replace it. This allows them to use the tooling they are already good at effectively and efficiently
The modular nature of our pipeline makes it easy for you to insert the tools, models, and languages you already know and use with ease.
6. WE MAINTAIN PRIVACY BY DESIGN
Privacy is crucial, and we take measures to protect it:
- Not anyone can put data into the lake. Instead, it is pre-arranged to determine what will go in where and when
- Only certain people have access to the raw data lake
- Anyone accessing the data from the configurable pipeline sees a specific subset of it, such as a limited view of users and feeds
To conclude, a data lake can be an unparalleled source of customer intelligence in your business. Our approach makes your data work for you and your customers, powering your way to becoming a Customer First organisation that understands and treats its consumers as unique individuals.
- Bethke, Uli: Are Data Lakes Fake News?, 2017.Source: https://sonra.io/2017/08/08/are-data-lakes-fake-news/
- Campbell, Chris: Top Five Differences Between Data Lakes And Data Warehouses, 2015. Source: https://www.blue-granite.com/blog/bid/402596/top-five-differences-between-data-lakes-and-data-warehouses
- Woods, Dan: Why Data Lakes Are Evil, Forbes 2018. Source: https://www.forbes.com/sites/danwoods/2016/08/26/why-data-lakes-are-evil/
- Fowler, Martin: Data Lakes, 2015. Source: https://www.martinfowler.com/bliki/DataLake.html
- Dixon, James: Pentaho, Hadoop, and Data Lakes, 2010. Source: https://jamesdixon.wordpress.com/2010/10/14/pentaho-hadoop-and-data-lakes/
- Gartner: Gartner Says Beware Of The Data Lake Fallacy, 2014. Source: https://www.gartner.com/newsroom/id/2809117
- PWC: The Enterprise Data Lake: Better Integration And Deeper Analytics, 2014. Source: https://www.pwc.com/us/en/technology-forecast/2014/cloud-computing/assets/pdf/pwc-technology-forecast-data-lakes.pdf