Data Flow Tutorial With Mage.ai | Part 4: PostgreSQL Database
To transfer data from a local spreadsheet to a database to create visual dashboards, we must now define the outbound integration to the PostgreSQL database through the mage.ai editor.
Check out the previous articles in this data pipeline series:
This article will pick up the data baton and cover the outbound loading of data into a database. Many companies use relational and non-relational databases as the destination endpoints for data within their data pipelines. Databases are great options for this use case since they offer persistent storage and can be integrated into business applications using APIs, database drivers, and middleware.
We will use a popular relational database called PostgreSQL in this article.
PostgreSQL
PostgreSQL (known informally as simply Postgres) began as a Defense Advanced Research Projects Agency (DARPA) project at UC Berkeley in 1986. From these humble origins, it has grown into a popular relational database management system and is still very relevant in today’s world.
Part of the popularity is due to its open-source nature. Other relational database services, such as Microsoft SQL Server, are products that companies must pay for. However, just because it is free, Postgres does not compromise on features, extensibility, or reliability. It offers many similar features of paid database services, and is arguably the better choice for companies that utilize a technology stack outside the traditional Microsoft ecosystem.
With a strong community and decades of development, postgres can stand toe to toe with other systems in terms of security and robustness.
Figure 1. Postgres logo. Image used courtesy of kinsta
The extensibility of Postgres allows the system to be run across multiple platforms, including Linux. Therefore, a Docker image exists in Docker Hub for Postgres. In Part 2 of the series, mage documentation provided us with the yaml code required to provision a Postgres service. You should already have this service defined in the Docker compose.yaml file, but I’ve included it below just in case.
postgres: image: postgres:14 restart: on-failure container_name: postgres-magic build: context: ./db env_file: - .env environment: POSTGRES_DB: ${POSTGRES_DB} POSTGRES_USER: ${POSTGRES_USER} POSTGRES_PASSWORD: ${POSTGRES_PASSWORD} ports: - "${PG_HOST_PORT}:5432" volumes: - pgdatabase_data:/var/lib/postgresql/data
Within the environment variable file, you will want to include the IP address of your local host machine for the variable POSTGRES_HOST. Furthermore, the PG_HOST_PORT and POSTGRES_SCHEMA values should be copied from below, as these are the default settings:
POSTGRES_DB=machine POSTGRES_USER=admin POSTGRES_PASSWORD=password POSTGRES_HOST=192.168.0.197 PG_HOST_PORT=5432 POSTGRES_SCHEMA=public
Mage Stream Configuration
Let’s continue progressing through the development of the data pipeline in the mage editor running on localhost:6789. There should be an option to select a stream below the test connection button.
Figure 2. Stream selection option. Image used courtesy of the author
The stream corresponds to the sheet names within the faults Google Sheet. In my case, the sheet is named “Faults” within the Google Sheets editor.
Figure 3. Google Sheet tabs: ‘Faults’ and ‘Test’. Image used courtesy of the author
Therefore, upon selecting “View and select streams,” the following window will appear.
Figure 4. Selecting the proper data stream. Image used courtesy of the author
Please select the sheet that contains the Fault data built in part 1 of the series, and click on the blue “Confirm 1 streams”. The Test sheet, in my case, was a sandbox sheet used for debugging purposes. Keeping this sandbox sheet around would be a good demonstration of the selective integration capability of the Google Sheet integration connector.
A window should appear upon confirming the stream. It will look like below.
Figure 5. Creating a storage table from the data stream. Image used courtesy of the author
The “Destination table name” is what we are going to call the table that will be created within the Postgres database upon loading data. The usage section tends to be fairly detailed. For simplicity, understand that with each run of the pipeline, we will be erasing and then uploading the most recent data.
Below these options, there is a featured widget. This widget allows us to select which columns within the fault sheet we would like to store in the database. I’ve excluded the dimension tables and have only kept the randomized “fact” table entries.
Figure 6. Choosing individual columns from the data stream spreadsheet. Image used courtesy of the author
Clicking on “Load sample data” will yield a snapshot of what the data looks like in the Google Sheet in the right hand window pane.
Figure 7. Sample data from the Google Sheet source stream. Image used courtesy of the author
We’ll leave the settings section below unchanged.
Figure 8. Stream settings. Image used courtesy of the author
A summary of the selections will be given at the end streams section.
Figure 9. Stream selection summary. Image used courtesy of the author
We will exclude one further setting, transformations, from this series for the sake of simplicity.
Mage Destination Configuration
Within the destination section, choose “PostgreSQL” from the drop down list.
Figure 10. Selecting a data destination. Image used courtesy of the author
A text block will appear below this selection and will even contain a link to documentation. Again, the mage.ai team approach to data pipelines is new; documentation embedded within building a pipeline encourages best practices and guides IT and OT professionals through the construction process. Truly a great feature of the platform. We will need to enter the connection data for the Postgres database container below. The secrets were passed to the mage.ai container in the form of environment variables in the .env file upon startup.
Figure 11. Postgres connection data passed from the .env file. Image used courtesy of the author
Clicking on the “Test connection” button should yield a successful connection.
Running the Pipeline
With the pipeline defined, we will need to run the pipeline in order to transfer data. Generally, runs are achieved through Triggers. Triggers are an event that prompts the pipeline to run and can include a scheduled daily time, API call, or event from a publish/subscribe architecture. To keep things simple, we will select the “Run@once” button to manually trigger the pipeline.
Figure 12. Run @once button to run the pipeline. Image used courtesy of the author
This will yield a randomly named trigger that will execute the data pipeline. If all goes well, the completed status will appear.
Figure 13. Completed pipeline. Image used courtesy of the author
To see more of a step by step playthrough, you can select the logs cylinder icon in blue to view the progression of the pipeline run.
Figure 14. Log report of previous data pipeline runs. Image used courtesy of the author
Where is the Visualization?
Our pipeline ran successfully! So now the question remains: how do we see the data in our Postgres database? Navigating to localhost:5432, we’re met with something strange.
Figure 15. GUI error message… To be continued. Image used courtesy of the author
The reason the page isn’t responding is because the Postgres database container does not have a front-end visual UI. Managing the stored data could be accomplished via command line prompts, but that’s difficult to visualize.
In the next article (forthcoming), we will cover a visual editor-based tool for Postgres, as well as deliver on our original data dashboard goal that our management team required back in Part 1.