AWS: Using Trifacta -known as GCP Cloud Dataprep -in AWS — Part 1. Setting up
I came across Trifacta, when was helping customer to move to AWS. Customer had several data pipelines to GCP BigQuery(data warehouse) and wanted to utilize low-code approach for data wrangling, it was implemented in GCP Cloud Dataprep.
From https://www.trifacta.com:
Trifacta is the only open and interactive cloud platform for data engineers and analysts to collaboratively profile, prepare, and pipeline data for analytics and machine learning.
Image is courtesy of https://www.trifacta.com
Trifacta lets you profile, prepare, and pipeline your data on any cloud data warehouse or cloud data lake / lakehouse.
Trifacta success apparently comes from “no-code” approach — this made product so successful, that GCP integrated product to its cloud under the name “Cloud Dataprep”.
But wait, it is not only GCP, as shown on diagram above, it connects to major clouds and even to the data on premises.
Note: It is important to understand, that Trifacta provides batch approach, not live streaming data pipeline.
Getting started
If you used Cloud Dataprep in GCP, you might notice popup, when used it first time, that Trifacta wants to access your account and thus we start with navigating to https://www.trifacta.com
Click “Start Free” — fill in information, and select AWS :)
You will get to exact same console view as if you were in GCP, but without side navigation panel. Make sure you confirm email Trifacta sends you as follow up for your trial sign up.
There are many connectors available and we will check them later, but right out of the box you can see 2 — S3 and Redshift.
Configuring AWS
To configure access to your AWS account click “Configure your AWS Account”
Or, if you closed it, navigate to Admin Console, clicking on your user avatar on the bottom left
You will see new side menu:
Click Workspace settings first to enable S3 Access
Now click AWS settings, and provide either IAM role or access key
Note: Trifacta recommends to use IAM role
Steps involved in Trifacta configuration using IAM role, are explained in details if you select “Use a cross-account role (IAM role)” and click “Use our step by step configuration->”
Here is part of the screen print , once you select “Step by step configuration”
Summary:
- Trifacta is no-code approach for data engineering.
- It works in batch mode
- One of the use cases — migration between the clouds and utilizing existing data wrangling recipes
Stay tune for the Part 2.