Technical Article

Data Flow Tutorial With Mage.ai | Part 1: The Challenge of Data

November 13, 2023 by Michael Levanduski

All of modern industry relies on data. In the first installment of a series showcasing the importance of industrial data movement, we explain the challenge of data flow and build a fault list.

I believe that too little significance is placed on one particular subject, to the detriment of many fields, including but not limited to control engineering-related industries.

The subject I am referring to is the flow of data within large corporations. There is quite a bit of unseen, invisible work that happens behind the front end of every MES/SCADA application or manufacturing operations report. This field of work is often performed by data engineers, architects, and analysts, quite out of mind for equipment engineers. To give a more visual example of the complexity, refer to the example system architecture provided by Amazon Web Services in Figure 1 below.

So as a control system engineer, where hardware and equipment are your daily bread and butter, why should you care?

 

Data management workflow diagram

Figure 1. AWS data flow example. Image used courtesy of AWS

 

Data Flow: The Problem Statement

The following example illustrates exactly why it can be very beneficial to care about data flow.

Let’s say you are a controls engineer responsible for the OEE of a 24/7 automated assembly machine. This machine produces product X, composed of 3 child components A, B, and C. Unfortunately, the company's SCADA system is in its infancy. The only data you have available are a few text or CSV files spit out by the local HMI on the machine after each production run.

The OEE of this particular machine has been disappointing, and leadership has entrusted you to make it better. This has been a major thorn in your side, though. Combining and aggregating the required raw data and transforming it into a usable summary has been exceedingly difficult.

How can you control the data of your machine to drive increased business productivity?

 

Simplified machine data flow diagram

Figure 2. Block diagram overview of the example challenge system. Image used courtesy of the author

 

Listing Faults: Generating a Flat File

Since you likely do not have ‘sandbox mode’ access to a multi-million dollar capital asset, we will simulate the flat file output of the hypothetical machine (a 'flat' file means a record of data stored in 2-dimensional rows). This can be accomplished in either Google Sheets or an Excel file, but for this example, I will use Google Sheets since a license is not required.

Within a single sheet column (A1:A10) we can define a fault dimension table filled with possible unique fault types for the system illustrated above in Figure 2.

 

Table for fault types

Table 1. Possible fault list

 

Just below this first table, we will now define another dimension table for dates in cells A12:A15. This will list the possible dates on which faults were recorded for a single 72-hour manufacturing run.

 

Table for fault dates

Table 2. Possible date list

 

Finally, we will build a randomly generated ‘fault’ dataset pulled from this hypothetical 72-hour manufacturing run. In cells C1 and D1, we will name headers ‘Date’ and ‘Fault’ respectively.

 

Table to fill with generated random faults

Table 3. Generated fault list headers

 

In cell C2, we will choose a random date, plucked from the date dimension table (Table 2) using the cell formula.

=INDEX($A$13:$A$15,RANDBETWEEN(1,3))

In cell D2, a random fault can be paired with the random date from the fault description dimension table (Table 1) using the cell formula.

=INDEX($A$2:$A$10,RANDBETWEEN(1,9))

The fill handle can be used to carry this formula downwards for any number of records. I chose a sample size of 499 randomly generated records for the purposes of this article. Altogether, the sheet should look similar to below (note: many more rows of data continue below the bottom of the image).

 

Generated fault record datasheet

Figure 3. A simulated fault records file in columns C and D. Image used courtesy of the author

 

Charts: The Easy, Limited Solution

The simplest solution in this scenario would be to build charts and analytics within Excel or Google Sheets. Excel is one of the most widely used data analytics tools out there. Building out a Pareto chart or even a pivot table of faults would not be that difficult with this given data array. One convenient example based on our randomized generation approach could look like the graph below.

 

Fault analysis chart

Figure 4. Example of adding analytics within the spreadsheet. Image used courtesy of the author

 

However, this solution quickly starts to fall apart when applied to a larger industrial scale seen with any real-world application. How often will you need to manually append new data from future manufacturing runs? How will you maintain data quality in a “master” Excel document? Over time will the Excel file performance begin to degrade as records saturate the sheet?

As you can see, analytics within Excel works great for ad-hoc analysis, but at scale, it has some serious flaws.

 

Data Management: The Difficult, Scalable Solution

Alas, we must now, with great reluctance, enter the world of data engineering. You need a way of automating batch data exports from the machine using long-term storage space to store this data. A dynamically updated report as live data is refreshed would be nice as well.

This is where a data pipeline tool like mage.ai provides value. The continuation of this series in the next article (Part 2) will explain the software solution and explore the initialization of a pipeline tool.