Data Flow Tutorial With Mage.ai | Part 3: Connecting to Google Sheets
In this article, we will explore the inbound integration to the mage.ai pipeline, connecting to the spreadsheet data source using the Google Sheets API through a JSON credentials file.
Data professionals often divide data pipelines and integration projects into inbound and outbound components. This helps encapsulate or manage the complexity involved in the field with discernable pieces of the project that can be better planned for. Part 3 of this data pipeline article series will dive into the topic of inbound integrations.
In our case, the inbound integration to the mage.ai pipeline is Google Sheets. As you’ll soon learn, integrations can be difficult, leveraging complex authentication models and requiring specific configurations relative to the applications involved.
Mage.ai Source Block
Picking up from where we left off in Part 2, we should have a fresh new data pipeline on the screen within the mage.ai container bound to localhost:6789.
In the center of the screen is the editor window for the pipeline. The starting point of the pipeline is the inbound integration, the source block. Select ‘Google Sheets’ as a source in the drop-down menu titled ‘Select source.’
Figure 1. Pipeline editor frame within mage.ai. Image used courtesy of the author
It is helpful to note the up-front documentation link, which is unusual for this tool space. This is designed to encourage best practices and efficiency-boosting methods to speed up deployment, a win-win for both the developers and the clients.
To proceed, there are two keys whose values we will need to populate two fields.
Where is the Spreadsheet_id?
We’ll start with the easier of the two keys, spreadsheet_id. This is simply a unique identifier for the fault data Google Sheet we created in Part 1. It is found at the trailing end of the URL for the Google Sheet:
The value will be some large, seemingly random character string, and we simply copy and paste it into the spreadsheet_id field in the mage pipeline editor. Note: if your Google Sheets URL contains a ‘/edit’ detail at the end, exclude that part and only copy the long string between the / characters.
Credentials.json and Connected Apps
The latter of the two keys, path_to_credentials_json_file, is more complex. It relates to the concept of connected apps. If you are a Google user, you’ve certainly used your Google account credentials to log in to Gmail. This authentication is between a person, represented by your own user account, and an application, the Google email service.
However, our scenario is a bit different. The mage.ai pipeline is not a person like you; it is an application. Therefore, the question arises of how to authorize the mage.ai pipeline to access services provided by the Google Sheet. The solution is connected app authentication.
Figure 2. Connected app visual. Image used courtesy of Slack
In Google’s corner of the digital universe (which likely holds some substantial composition), the mage.ai pipeline we wish to integrate requires something called a service account to authenticate. These accounts do not have passwords and are not linked to your Google Workspace.
The process of authentication is complex, but to summarize, we can simply say that instead of passwords, private/public RSA key pairs are used. We supply the private key inside a JSON file to our mage.ai application, which we see in the so-called path_to_credentials_json_file configuration field in the pipeline.
The associated public key is stored in Google Cloud, and the pairing of these two applications authenticates the mage.ai pipeline. I’ve glossed over some important details for the sake of brevity, so please find a great article here explaining the Google service account authentication process in depth.
Where is the Credentials.json file?
The process to obtain this file is not very complex, but there are quite a few clicks required within Google’s console. Once again, for the sake of article scope, I will provide a rough overview below, while a more detailed walkthrough is easily accessible online. Please keep in mind the cloud console UI is a moving target and is updated frequently. The steps and screenshots provided are accurate at the time of this article’s publishing (November 2023).
The overall goal of this process is to create a service account (email address) linked to the API which can communicate with mage.ai. Then, we will need to grant this email address access to the Google Sheet with our fault data.
- Navigate to https://console.developers.google.com
- Click on ‘Select a project,’ where you will need to create a new project in the top right corner:
- Create a new project name and it will appear similar to below (I have named this new project simply ‘mage’):
- Navigate to the credentials subsection of the ‘APIs and Services’ pane:
- Click 'Manage Service Accounts':
- Create a service account, which will be attached to an email address of the format ‘gsheets(or your chosen name below)@(unique project ID).iam.gserviceaccount.com’:
- Generate a key for the service account by clicking the ‘keys’ tab and the ‘Add Key’ button, being sure to select the JSON type:
Remember that service accounts are not related to your own personal Google Drive workspace. Therefore, there is one additional authorization that we must enable within our own user account, including activating the Google Sheets API and then sharing the document.
- Within the project, search for “Google Drive” to enable the Google Drive API:
- Within the project, search for “Google Sheet” to enable the Google Sheets API:
- Lastly, back in the fault document itself, ‘Share’ the sheet with the service account email generated from the Google console by clicking the share button. Add the service account email.
Passing the Credentials.json File
First and foremost, do not share this file with anyone or save it in a public space. When you generated the JSON credentials file, it should have saved in the ‘downloads’ folder of the host machine. Locate it and move it into the project directory that is shared with the docker container via a host-mounted volume.
I’ve chosen the data_loaders directory since the credentials are used within the context of that space, but you may choose anywhere in the mage_data/magic directory you wish. I’ve renamed the file to obscure the leading characters of the private_key_id, as well as for general better readability.
Figure 3. Complete structure for local mage directory. Image used courtesy of the author
The configuration file will now appear in the mage container.
Figure 4. Mage.ai docker container. Image used courtesy of the author
Pass the absolute file path to the path_to_credentials_json_file.
Figure 5. Inbound data source configuration. Image used courtesy of the author
To test whether the integration is successful, select the ‘Test Connection’ box at the end of the source block.
Figure 6. Connection test option. Image used courtesy of the author
If the connection is successful, a green text notification should appear next to the ‘Test connection’ button.
Figure 7. Successful connection test! Image used courtesy of the author
In the right-hand visual tree of the pipeline, the source block should also have a green checkbox associated:
Figure 8. Green check indicating successful connection. Image used courtesy of the author
Next Steps: Outbound Data
With the inbound sorted, the next article (forthcoming) will progress to the outbound integration and the Postgresql database container.