Technical Article

Data Flow Tutorial With Mage.ai | Part 2: Initializing the Software

November 20, 2023 by Michael Levanduski

Data pipelines are software services that progress the data from source to storage, hopefully without too much programming complexity. In this article, we begin by initializing the mage.ai software.

What is a Data Pipeline?

Data pipelines refer to the movement of data from the original sources into some form of data storage destination.

The sources could be any variety of items, including but not limited to physical IoT devices (sensors), APIs, and flat files. The storage destinations are also quite varied. Commonly, the destinations are relational databases, NoSQL databases, and cloud object stores such as AWS S3 buckets.

As the data flows from source to destination, transformations and processing of the data may occur to make the format compatible with the destination platform. This could include filtering, imputing, or aggregating the data prior to entering the destination. Data pipelines serve as a critical feature of the modern technology world.

Description of a data pipeline

Figure 1. Data pipeline visual. Image used courtesy of xenonstack

Introduction to Mage.ai

Mage.ai is a software that is used to build data pipelines. There have been many others before it. One of the earliest was Microsoft SQL Server Integration Services (SSIS). Some other more recent relevant players include Apache Airflow and Apache Nifi. All of these tools work well, but each also has quirks and drawbacks. SSIS is clunky, outdated, and struggles to integrate with newer tech such as cloud storage buckets. Apache Airflow is perhaps the most widely used industry, offering integration support for many technologies. Workflows are defined using Python code, which makes it friendly to data technologists who prefer to code instead of learning how to click around an ever-changing UI landscape.

While Airflow is a great tool, there is a learning curve surrounding the Directed Acyclic Graphs or DAGs that define the sequence of task operations. Airflow is not going anywhere, but many competitors, such as mage.ai, are emerging. Mage.ai is another approach to orchestrating data pipelines.

Continuing this article series, we will examine the use of this platform for our machine’s ‘fault data’ which we created in the previous article.

Mage.ai banner and slogan

Figure 2. Mage.ai markets a user-friendly experience with no quirks. Image used courtesy of mage.ai

Initializing and Running Mage.ai

A feature that I appreciate is mage.ai’s ability to be run in a container from a Docker image (if you want more background on Docker containers, check out the previous series about IoT technology). This container makes running the platform incredibly straightforward and quick to get started.

Within the documentation, the mage team provides the docker compose yaml code to get started. It also includes a postgres relational database service to tie a database to the mage pipeline for destination storage.

version: '3'
services:
  mage:
    image: mageai/mageai:latest
    container_name: magic
    depends_on:
      - postgres
    command: mage start magic
    env_file:
      - .env
    environment:
      ENV: dev
      POSTGRES_DB: ${POSTGRES_DB}
      POSTGRES_USER: ${POSTGRES_USER}
      POSTGRES_PASSWORD: ${POSTGRES_PASSWORD}
      POSTGRES_HOST: ${POSTGRES_HOST}
      PG_HOST_PORT: ${PG_HOST_PORT}
    ports:
      - 6789:6789
    volumes:
      - .:/home/src/
    restart: on-failure:5
  
  postgres:
    image: postgres:14
    restart: on-failure
    container_name: postgres-magic
    build:
      context: ./db
    env_file:
      - .env
    environment:
      POSTGRES_DB: ${POSTGRES_DB}
      POSTGRES_USER: ${POSTGRES_USER}
      POSTGRES_PASSWORD: ${POSTGRES_PASSWORD}
    ports:
      - "${PG_HOST_PORT}:5432"
    volumes:
      - pgdatabase_data:/var/lib/postgresql/data

volumes:
  pgdatabase_data:

Notice that there are some environment variables defined within the yaml file. We can include a .env file with the associated values we would like to pass to the containers upon startup.

POSTGRES_DB=machine
POSTGRES_USER=admin
POSTGRES_PASSWORD=password
POSTGRES_HOST=192.168.0.197
PG_HOST_PORT=5432
POSTGRES_SCHEMA=public

Within a project directory, the files will appear as such.

Beginning project file directory

Figure 3. File directory before docker compose. Image used courtesy of the author

In a terminal window, running docker compose up -d will yield both the mage and postgres services on the specified ports of the local machine.

When the specified volume is mounted, several new directories will be created within the host machine’s project folder. This allows files to be shared from the local machine to the container.

Complete project file directory

Figure 4. The complete project file directory. Image used courtesy of the author

Quick Overview of mage.ai Pipelines

Navigating to localhost on port 6789 will yield the mage.ai container UI. The options within mage.ai include the below features.

Mage.ai menu options

Figure 5. Mage.ai menu options. Image used courtesy of the author

Since we intend to focus on data pipelines, let’s select the pipelines tab and explore this menu first. The container will already have a default standard ‘example_pipeline’ set. Select the ‘New’ button to see the different types of pipelines offered.

New pipeline creation button

Figure 6. Creating a new data pipeline in mage.ai. Image used courtesy of the author

You’ll notice that there are three main types of common pipeline foundations offered to build on top of:

Standard (batch): Data is accumulated in a ‘batch’ at the source and sent or uploaded on a manual or scheduled time interval. It moves a bunch of data at once to the destination.
Data Integration: The synchronization of the data source system with another system.
Streaming: The integration or subscription to a publish/subscribe-based system such as an Azure Event Hub or Apache Kafka stream.

Pipeline input options

Figure 7. Options available for creating a new data pipeline in mage.ai. Image used courtesy of the author

Select the ‘Data integration’ option displayed above, and we can now take a short pause on the pipeline that appears.

In the next article, we will dive into the work involved in integrating the Google Sheet into mage.ai. The steps within the mage.ai container are not incredibly difficult, however, navigating the Google Cloud UI may pose a greater challenge, so we will reserve that process for its own article. Please join me in the next installment (forthcoming) to learn more.