Using automated, intelligent asset discovery and a metadata-rich catalog to make analytic assets easy to find, govern and use to power AI
Five years ago, building big data architectures was one of the hottest topics on the enterprise IT agenda. The rise
of technologies such as Apache Hadoop and MapReduce, together with the success of big data-centric companies such as Google and Facebook, convinced many organizations that the time was right to take a deep dive into the previously unexplored depths of their data lake.
To perform that deep dive, it was necessary to build some kind of central store where all the data could reside. While most enterprises already had a central data warehouse to capture and store data from core systems, this was clearly not the right type of repository for many of the new data sets that needed to be analyzed. Data warehouses are built on relational databases, which require data to be organized into highly structured tables. Many of the new data sets that these enterprises wanted to explore—with their massive web scale and variety of structures—were simply not amenable to being restructured into this kind of rigid schema. A more flexible, versatile approach was needed.
The solution was the data lake—a general-purpose data storage environment that would store practically any type of data, and that would allow data scientists to apply the most appropriate analytics engines and tools to each data set, in its original location. Typically, these data lakes were built using Apache Hadoop and Hadoop Distributed File System (HDFS), combined with engines such as Apache Hive and Apache Spark.
As these data lakes began to grow, a set of problems became apparent. While the technology was indeed physically capable of scaling to capture, store and analyze vast and varied collections of structured and unstructured data, too little attention was paid to the practicalities of how to embed these capabilities into business workflows. As a result, questions such as: “What data should we put in the data lake?”, “Who is going to use it?”, “How do we make it easy for them to find?”, and “How do we prevent data from being misused?” often went unanswered.
The result is that the vast majority of data lake projects have stagnated. Instead of providing a single, clear source of truth for data scientists and business analysts to work with, they have often become a dumping ground for deserted data—data that nobody needs, understands or knows how to use, and that therefore provides no value to anyone.
Today’s businesses must find a way to extract value from their big data if they are to compete and outperform competitors. Instead of giving up on the idea of data lakes, we need to find a way to make them work.
This white paper will attempt to make an honest appraisal of the problems data lakes face, and explore some new strategies that can help turn them from a stagnant backwater into the gleaming centerpiece of an organization’s big data analytics strategy.
You May Also Like to Read:
Big Data- How can it Benefit Your Business