Region
Australian Flag
AU
New Zealand Flag
NZ
Back view of a man stands on wooden pier with futuristic cyberspace network background.

How to Best Design a Data Lake Storage?

by Lynn Naing, Managing Consultant – Altis Sydney

As Data Lake popularity grows, the common fundamental question we get asked from clients is ‘How to best design a Data Lake Storage?’. There is no silver bullet for ‘Best’ design as different clients have different requirements and use cases. However, there is a good common foundational design that we embark from. In this design, there are five core zones and is technology agnostic.

Landing:

This is a transient folder where data will land. It is separated by Data Sources.  This zone is to move data instantaneously from source to Data Lake and to reduce contention at source systems. Valid data is moved across to Raw Zone whereas invalid data is a move to Bad/Quarantine folder for manual intervention.

Raw:

Valid data is the move to Raw folder in native format and it is ready for Ingestion by subsequent processes. Data is categorised into Data Source>Year>Month>Day>Hour. Depending on the frequency of data transfer and requirements, folder granularity can change. By categorising, it also systematically archives data in native format.

Standardized:

This is a useful but optional zone. This zone is about standardising to a particular format for best suitable for the curated layer. For example, standardising of flat files to *.txt, photos to *.jpeg or videos to *.mkv  files. Also, there is an option here to perform some standard data cleansing if repeated processes are transpiring across multiple files such as removing special characters and characters encoding. The folder structure is kept the same as Raw and files are copied and converted into chosen formats.

Curated:

Data is transformed, cleansed and ready for consumption in Curated Zone. Data is categorised into Subject Area>Files. Depending on requisites and tools selected, there is an option to partition data at a specified level such as Subject Areas>Files>Year>Month or Subject Areas>Files>Region. This zone is utilised by most data users including BI developers.

Sandbox:

This zone is mainly for Data Scientist and Data Champions who understand data as well as the organisation’s business well. Data may originate from Curated as well as Standardized Zone. Data is organised by Project. Data Scientist can further separate per model within a project and write back of results are possible in Sandbox zone.

Adjustments to design are encouraged, depending on an organisation’s data type (structure, semi-structured, unstructured), data usage, data availability and data security requirements. Likewise, the organisation’s Data Lake usage for Data Warehouse, Applications, Master Data Management and others will prompt for further considerations and alteration to design. To conclude, the five core zones are not one size fit for all organisation but will provide a good foundation design for Data Lake Storage.

Share

Facebook
Twitter
LinkedIn
WhatsApp

2 Responses

  1. Bad/Quarantine folder in Landing Zone? Are you recommending addressing data quality in this early layer? How can you address data quality issue being agnostic?

    1. Hi Scott,

      This would depend on the overall architecture of the design and your organisation requirements.

      In this design, handling bad/quarantine data earlier in landing so that issues can be raised early on to sources as well as only valid data are kept and categorised accordingly.

      Good point on data quality issue agnostic. This is more about data files being corrupted such as issues associated with write, read and transfers.

Leave a Reply

Your email address will not be published. Required fields are marked *

Recent

Connect with us

If you’d like to be kept in the loop on courses, events and other related topics, simply complete your details and we’ll add you to our list.

We use cookies to improve your experience and support our mission.
Read more about it here. By using our sites, you agree to our use of cookies