Technical Article

Data Flow Tutorial With | Part 2: Initializing the Software

November 20, 2023 by Michael Levanduski

Data pipelines are software services that progress the data from source to storage, hopefully without too much programming complexity. In this article, we begin by initializing the software.

What is a Data Pipeline?

Data pipelines refer to the movement of data from the original sources into some form of data storage destination.

The sources could be any variety of items, including but not limited to physical IoT devices (sensors), APIs, and flat files. The storage destinations are also quite varied. Commonly, the destinations are relational databases, NoSQL databases, and cloud object stores such as AWS S3 buckets.

As the data flows from source to destination, transformations and processing of the data may occur to make the format compatible with the destination platform. This could include filtering, imputing, or aggregating the data prior to entering the destination. Data pipelines serve as a critical feature of the modern technology world.


Description of a data pipeline

Figure 1. Data pipeline visual. Image used courtesy of xenonstack


Introduction to is a software that is used to build data pipelines. There have been many others before it. One of the earliest was Microsoft SQL Server Integration Services (SSIS). Some other more recent relevant players include Apache Airflow and Apache Nifi. All of these tools work well, but each also has quirks and drawbacks. SSIS is clunky, outdated, and struggles to integrate with newer tech such as cloud storage buckets. Apache Airflow is perhaps the most widely used industry, offering integration support for many technologies. Workflows are defined using Python code, which makes it friendly to data technologists who prefer to code instead of learning how to click around an ever-changing UI landscape.

While Airflow is a great tool, there is a learning curve surrounding the Directed Acyclic Graphs or DAGs that define the sequence of task operations. Airflow is not going anywhere, but many competitors, such as, are emerging. is another approach to orchestrating data pipelines.

Continuing this article series, we will examine the use of this platform for our machine’s ‘fault data’ which we created in the previous article. banner and slogan

Figure 2. markets a user-friendly experience with no quirks. Image used courtesy of


Initializing and Running

A feature that I appreciate is’s ability to be run in a container from a Docker image (if you want more background on Docker containers, check out the previous series about IoT technology). This container makes running the platform incredibly straightforward and quick to get started.

Within the documentation, the mage team provides the docker compose yaml code to get started. It also includes a postgres relational database service to tie a database to the mage pipeline for destination storage.

version: '3'
    image: mageai/mageai:latest
    container_name: magic
      - postgres
    command: mage start magic
      - .env
      ENV: dev
      - 6789:6789
      - .:/home/src/
    restart: on-failure:5
    image: postgres:14
    restart: on-failure
    container_name: postgres-magic
      context: ./db
      - .env
      - "${PG_HOST_PORT}:5432"
      - pgdatabase_data:/var/lib/postgresql/data



Notice that there are some environment variables defined within the yaml file. We can include a .env file with the associated values we would like to pass to the containers upon startup.



Within a project directory, the files will appear as such.


Beginning project file directory

Figure 3. File directory before docker compose. Image used courtesy of the author


In a terminal window, running docker compose up -d will yield both the mage and postgres services on the specified ports of the local machine.

When the specified volume is mounted, several new directories will be created within the host machine’s project folder. This allows files to be shared from the local machine to the container.


Complete project file directory

Figure 4. The complete project file directory. Image used courtesy of the author


Quick Overview of Pipelines

Navigating to localhost on port 6789 will yield the container UI. The options within include the below features. menu options

Figure 5. menu options. Image used courtesy of the author


Since we intend to focus on data pipelines, let’s select the pipelines tab and explore this menu first. The container will already have a default standard ‘example_pipeline’ set. Select the ‘New’ button to see the different types of pipelines offered.


New pipeline creation button

Figure 6. Creating a new data pipeline in Image used courtesy of the author


You’ll notice that there are three main types of common pipeline foundations offered to build on top of:

  • Standard (batch): Data is accumulated in a ‘batch’ at the source and sent or uploaded on a manual or scheduled time interval. It moves a bunch of data at once to the destination.
  • Data Integration: The synchronization of the data source system with another system.
  • Streaming: The integration or subscription to a publish/subscribe-based system such as an Azure Event Hub or Apache Kafka stream.


Pipeline input options

Figure 7. Options available for creating a new data pipeline in Image used courtesy of the author


Select the ‘Data integration’ option displayed above, and we can now take a short pause on the pipeline that appears.

In the next article, we will dive into the work involved in integrating the Google Sheet into The steps within the container are not incredibly difficult, however, navigating the Google Cloud UI may pose a greater challenge, so we will reserve that process for its own article. Please join me in the next installment (forthcoming) to learn more.