Technical Article

Data Lake vs. Big Data for Industrial Applications

December 09, 2021 by Antonio Armenta

This article teaches the definitions of big data and data lake, how to use them together, and industrial applications.

Data lake and big data are two modern terms often misconceived and used incorrectly. Due to the implied large volumes of data, these terms are sometimes used interchangeably. However, data lake and big data are different, even though their current definitions might not yet be fully established.

 

Figure 1. Modern data can come from many sources and be of different types. Image used courtesy of Analytics Vidhya

 

Let’s first look at a brief historical context. In the late 2000s, with the explosive growth of social media platforms, such as Facebook and Twitter, many data scientists began to realize the potential of such platforms for generating large amounts of valuable personal data. Consequently, new software applications were developed to facilitate data processing and analysis. One prominent example is Apache Hadoop, essentially a toolkit of open-source applications that can process big data levels of information.

In the next decade, the Internet of Things (IoT) entered the scene. This opened the doors for millions of more data sources that could provide insights into a person’s preferences and patterns, while also sending information about the product itself.

Simultaneously, machine learning was making important advances and finding more practical applications in the industrial landscape. This resulted in an increased need to handle large volumes of data in the industries, particularly in automated processes.

All projections indicate that the overall amount of data available in the world will continue to expand at accelerated rates in the coming years. For reference, in 2016, the world passed the milestone of 1 Zettabyte of annual internet traffic generated. One Zettabyte equals 1 trillion Gigabytes.

Annual internet traffic is expected to surpass 3 Zettabytes in 2021. These projections, along with the expanded capabilities of cloud computing, indicate that the value and uses of big data (and data lakes) are perhaps only just beginning.

 

What is Big Data?

When looking at it simply from the perspective of volume, the definition of big data is a moving target. As the amount of data and storage space available continues to grow, so does the benchmark of what is considered large amounts of information.

Today, a data repository of 100 Terabytes in size or more is generally considered to be in the range of big data. Large data repositories such as those from social media platforms may be in the range of several Petabytes.

Another reference used to define big data is when the amount of information can’t be handled by traditional computer tools, such as SQL. For example, today, it is not uncommon for databases to reach 1 Terabyte in size annually. But, with SQL applications becoming more powerful, this magnitude of the database can still be managed; therefore, they are not typically considered big data.

 

Big Data’s 4V Model

So far, we have looked at the definition of big data from the perspective of volume. There are three other important factors to consider: velocity, variety, and veracity. These, together with volume, form the 4V model.

 

Figure 2. The 4V model of big data: Volume, velocity, variety, and veracity. Image used courtesy of APSense

 

Variety refers to all the different types of data stored in a big data repository: text, images, sound, video, etc. It also refers to the fact that data can come from multiple sources.

Velocity is an important consideration in big data because the information is constantly streaming in. Velocity is concerned with the speed at which data is collected, generated, and distributed.

Veracity measures the accuracy and quality of the data to evaluate if a data scientist can use it for analysis and for reaching conclusions from it.

Now that we understand big data, let’s review data lakes before diving into how to use these in a control system.

 

What is a Data Lake?

Data lakes are centralized repositories of large amounts of raw data, which is information that may or may not be valuable in the future and whose purpose is not 100% known yet. Data lakes may store relational and non-relational databases, together with other types of files and entities.

Although the information in a data lake is not processed or organized, it is structured so that all the inputs and outputs are considered to create good architecture.

 

Data Lake vs. Big Data

A data lake is an instance of a big data application. They follow the criteria described in the 4V model, with some added particularities. In terms of volume, data lakes are, on average, near the lower end of what is considered big data.

Information in data lakes has variety, but the condition is that it’s only unprocessed raw data. Input and output speeds are as relevant as with any modern system and data quality evaluations are performed in a well-designed data lake.

 

Industrial Applications for Data

Advanced automation is driving a rapid increase in the amount of information handled on the factory floor. Thanks to this, manufacturing and other industrial processes are now entering the realm of big data, with several business activities now employing tools such as data lakes.

One prominent example is predictive maintenance. The ability to predict a mechanical or electrical failure is very valuable and can provide substantial savings in repair costs. Data lakes are useful tools that can compile information coming from log files, multiple sensors, and input devices, which can be used to understand trends and predict issues.

Machine learning is a concept in which robots are provided with information that can help them adapt to changing external conditions. Capturing information is similar to predictive maintenance, with the additional step that evaluations and changes to the process are automatically fed to the system controller. Machine learning data can be stored in a structured data lake.

 

Figure 3. Machine learning has several strategies that each require large amounts of data. Image used courtesy of WordStream

 

To conclude, a data lake is an instance of a big data application. These two ways of viewing data can work together. By utilizing both big data and data lake, a control engineer can predict failures, create maintenance routines, grow the facility’s digital transformation, and much more.

What do you use big data and data lakes for in your job?