Technical Article

Data Flow Tutorial With Mage.ai | Part 4: PostgreSQL Database

December 06, 2023 by Michael Levanduski

To transfer data from a local spreadsheet to a database to create visual dashboards, we must now define the outbound integration to the PostgreSQL database through the mage.ai editor.

Check out the previous articles in this data pipeline series:


 

This article will pick up the data baton and cover the outbound loading of data into a database. Many companies use relational and non-relational databases as the destination endpoints for data within their data pipelines. Databases are great options for this use case since they offer persistent storage and can be integrated into business applications using APIs, database drivers, and middleware.

We will use a popular relational database called PostgreSQL in this article.

 

PostgreSQL

PostgreSQL (known informally as simply Postgres) began as a Defense Advanced Research Projects Agency (DARPA) project at UC Berkeley in 1986. From these humble origins, it has grown into a popular relational database management system and is still very relevant in today’s world.

Part of the popularity is due to its open-source nature. Other relational database services, such as Microsoft SQL Server, are products that companies must pay for. However, just because it is free, Postgres does not compromise on features, extensibility, or reliability. It offers many similar features of paid database services, and is arguably the better choice for companies that utilize a technology stack outside the traditional Microsoft ecosystem.

With a strong community and decades of development, postgres can stand toe to toe with other systems in terms of security and robustness.

 

PostgreSQL logo

Figure 1. Postgres logo. Image used courtesy of kinsta

 

The extensibility of Postgres allows the system to be run across multiple platforms, including Linux. Therefore, a Docker image exists in Docker Hub for Postgres. In Part 2 of the series, mage documentation provided us with the yaml code required to provision a Postgres service. You should already have this service defined in the Docker compose.yaml file, but I’ve included it below just in case.

 postgres:
    image: postgres:14
    restart: on-failure
    container_name: postgres-magic
    build:
      context: ./db
    env_file:
      - .env
    environment:
      POSTGRES_DB: ${POSTGRES_DB}
      POSTGRES_USER: ${POSTGRES_USER}
      POSTGRES_PASSWORD: ${POSTGRES_PASSWORD}
    ports:
      - "${PG_HOST_PORT}:5432"
   volumes:
      - pgdatabase_data:/var/lib/postgresql/data

 

Within the environment variable file, you will want to include the IP address of your local host machine for the variable POSTGRES_HOST. Furthermore, the PG_HOST_PORT and POSTGRES_SCHEMA values should be copied from below, as these are the default settings:

POSTGRES_DB=machine
POSTGRES_USER=admin
POSTGRES_PASSWORD=password
POSTGRES_HOST=192.168.0.197
PG_HOST_PORT=5432
POSTGRES_SCHEMA=public

 

Mage Stream Configuration

Let’s continue progressing through the development of the data pipeline in the mage editor running on localhost:6789. There should be an option to select a stream below the test connection button.

 

Button for selecting the data stream

Figure 2. Stream selection option. Image used courtesy of the author

 

The stream corresponds to the sheet names within the faults Google Sheet. In my case, the sheet is named “Faults” within the Google Sheets editor.

 

Google Sheet tab names

Figure 3. Google Sheet tabs: ‘Faults’ and ‘Test’. Image used courtesy of the author

 

Therefore, upon selecting “View and select streams,” the following window will appear.

 

Choosing tabs to stream data

Figure 4. Selecting the proper data stream. Image used courtesy of the author

 

Please select the sheet that contains the Fault data built in part 1 of the series, and click on the blue “Confirm 1 streams”. The Test sheet, in my case, was a sandbox sheet used for debugging purposes. Keeping this sandbox sheet around would be a good demonstration of the selective integration capability of the Google Sheet integration connector.

A window should appear upon confirming the stream. It will look like below.

 

Creating a new data storage table

Figure 5. Creating a storage table from the data stream. Image used courtesy of the author

 

The “Destination table name” is what we are going to call the table that will be created within the Postgres database upon loading data. The usage section tends to be fairly detailed. For simplicity, understand that with each run of the pipeline, we will be erasing and then uploading the most recent data.

Below these options, there is a featured widget. This widget allows us to select which columns within the fault sheet we would like to store in the database. I’ve excluded the dimension tables and have only kept the randomized “fact” table entries.

 

Data column selection options

Figure 6. Choosing individual columns from the data stream spreadsheet. Image used courtesy of the author

 

Clicking on “Load sample data” will yield a snapshot of what the data looks like in the Google Sheet in the right hand window pane.

 

Instant data snapshot from the spreadsheet

Figure 7. Sample data from the Google Sheet source stream. Image used courtesy of the author

 

We’ll leave the settings section below unchanged.

 

Stream settings

Figure 8. Stream settings. Image used courtesy of the author

 

A summary of the selections will be given at the end streams section.

 

Summary of completed data stream setup

Figure 9. Stream selection summary. Image used courtesy of the author

 


We will exclude one further setting, transformations, from this series for the sake of simplicity.

 

Mage Destination Configuration

Within the destination section, choose “PostgreSQL” from the drop down list.

 

Choosing the destination for the data

Figure 10. Selecting a data destination. Image used courtesy of the author

 

A text block will appear below this selection and will even contain a link to documentation. Again, the mage.ai team approach to data pipelines is new; documentation embedded within building a pipeline encourages best practices and guides IT and OT professionals through the construction process. Truly a great feature of the platform. We will need to enter the connection data for the Postgres database container below. The secrets were passed to the mage.ai container in the form of environment variables in the .env file upon startup.

 

Connection data required for Postgres connection

Figure 11. Postgres connection data passed from the .env file. Image used courtesy of the author

 

Clicking on the “Test connection” button should yield a successful connection.

 

Running the Pipeline

With the pipeline defined, we will need to run the pipeline in order to transfer data. Generally, runs are achieved through Triggers. Triggers are an event that prompts the pipeline to run and can include a scheduled daily time, API call, or event from a publish/subscribe architecture. To keep things simple, we will select the “Run@once” button to manually trigger the pipeline.

 

Button for running the data collection routine

Figure 12. Run @once button to run the pipeline. Image used courtesy of the author

 

This will yield a randomly named trigger that will execute the data pipeline. If all goes well, the completed status will appear.

 

Complete status on the data pipeline

Figure 13. Completed pipeline. Image used courtesy of the author

 

To see more of a step by step playthrough, you can select the logs cylinder icon in blue to view the progression of the pipeline run.

 

Log of previous data collection runs

Figure 14. Log report of previous data pipeline runs. Image used courtesy of the author

 

Where is the Visualization?

Our pipeline ran successfully! So now the question remains: how do we see the data in our Postgres database? Navigating to localhost:5432, we’re met with something strange.

 

Error message to be handled in Part 5 of the series

Figure 15. GUI error message… To be continued. Image used courtesy of the author

 

The reason the page isn’t responding is because the Postgres database container does not have a front-end visual UI. Managing the stored data could be accomplished via command line prompts, but that’s difficult to visualize.

In the next article (forthcoming), we will cover a visual editor-based tool for Postgres, as well as deliver on our original data dashboard goal that our management team required back in Part 1.