In 2018, it’s estimated that over 2.5 quintillion bytes of data were created each day.¹ This was good news for businesses for several reasons. Perhaps the most important is that the shear quantity of data we produce has given rise to digital transformation and powerful analytics that drive detailed and focused decisions, business results and predictions.
But there’s a challenging aspect to the massive amount of data we produce: it must be stored in a location where it’s readily accessible.
The traditional storage method has been the data warehouse. For a data warehouse to function properly, the data being stored must be classified and formatted carefully before being inputted to the warehouse, a process known as “schema on write”.
Traditional Warehousing Data for Efficient Recovery
Much like a book is catalogued in a library and placed on a specific shelf, data in a warehouse is treated similarly, in a hierarchical manner.
Continuing the library analogy, when a query is made for data from the warehouse, the request must be made in specific terms and carefully defined, just like searching for an author or book title amongst thousands on the shelves.
A data warehouse, then, is a data storage repository containing highly organized data files or folders from which data can be queried and extracted after performing concise retrieval commands.
Potential Problems with a Traditional Data Warehouse
It’s important to recognize that a data warehouse contains formatted data from many different business sources within an organization. This limits the ability to reorganize the warehouse to suit one business practice without affecting the others.
Another potential problem with a traditional data warehouse is that it cannot hold unstructured or semi-structured data. So, any analysis of data from social media, emails, word processing files, PDFs or images is impossible because the data warehouse cannot store such information, due to its disparate nature.
Yet, this type of data is critical to analyzing consumer patterns and trends and predicting possible shifts in buying practices, for example.
That’s where the data lake takes center stage.
All Data is Welcome
A data lake allows ALL types of data – structured, unstructured and semi-structured to be held. The data doesn’t have to be filtered or sorted; that happens when the data is accessed, known as “schema on read”.
The costs of a data lake are vastly diminished thanks to scalable storage on demand in a cloud-based platform like Microsoft Azure which also eliminates costly infrastructure.
When a query is made that demands enormous amounts of different types of data, a data lake stands ready to supply the answers; the data warehouse is slowed by the need for ETL (Extract, Transform, Load) processes that must be defined, built and loaded before a business can begin to create the queries to get the answers it wants.
Especially, Data Lake Storage Gen2 offers low price for your centralized repository and it’s more efficient for both Flat and Hierarchical Namespace file structures.
Putting the Cloud to Use
Placing a data lake in the cloud provides a number of advantages. There is the ease of scalability versus the need for increasing on-site servers to store the huge amounts of data being used in business decisions today. Pay-for-what-you-use is a far more efficient means to an end.
More than this, Microsoft’s Azure cloud contains a battery of highly efficient cloud-based analytics tools for utilizing a data lake. Simply dump the raw data into cloud storage and call up the analytics tools as needed, without spinning up the entire Azure cluster.
Another powerful tool is Azure Databricks which is a unified analytics platform that links data science, engineering and business operations to provide even faster and more secure handling of data.
Losing the Battle to Unstructured Data
At the end of the day, a data warehouse is useful if formatted data is the only type to be used in decision making and the amount of data being stored isn’t excessive.
But, given that we are living in an on-demand economy which is generating more diverse data than ever before, a data lake makes sense. Today’s data is where the actual business value lies, especially when you consider that, according to International Data Corporation (IDC), 90% of the world’s available data has been generated in just the past three years and the majority is of the unstructured or semi-structured type.²
Have more questions about data storage? Contact the data modernization experts at Hanu today. We’re happy to answer any questions you might have regarding data lakes and data warehouses. We will provide FREE estimation if you want to transform your traditional Data warehouse to Azure and modernize it.
² https://www.idc.com/getdoc.jsp?containerId=prUS44417618