AWS Serverless Data Lake Jumpstart > Transforming your Data > Transform to Standardize Zone

Transform to Standardize Zone

Planning the Data Transformation Steps

It’s a good practice to plan the steps to transform your data. Based on the information we captured during data exploration stage, we can come up with the following transformation step:

Read raw data from S3.
Perform data transformation: Set appropriate data types.
Save processed dataset to S3 in a query optimized format.
Run crawlers to create tables
Query transformed data in Athena

Create Raw to Standardize Glue job

Go to Glue console.
In the left navigation panel, click ETL jobs.
On the AWS Glue Studio page, click Visual ETL.

taxi_zone_lookup

Adding Yellow Trips data from Amazon S3
- Click on the Source icon, choose S3.
- In the Data source – S3 bucket node, to specify the following information:
  - S3 URL: S3://{RAW_BUCKET}/nyc-taxi/taxi_zone_lookup/
Modify data types
- Click on the Transform icon, choose Change Schema.
- Change data type
Save transformed data to Amazon S3
- Click on the Target icon, choose Amazon S3.
- Specify the following information
  - Format – Parquet
  - Compression Type - Snappy
  - S3 Target Location S3://{Standardize_BUCKET}/taxi_zone_lookup/
Set job detail
- Specify Iam role
Run job
- Click Run.
Check output
- Go to standardize bucket in S3 console.

yellow_tripdata

Adding Yellow Trips data from Amazon S3
- Click on the Source icon, choose S3.
- In the Data source – S3 bucket node, to specify the following information:
  - S3 URL: S3://{RAW_BUCKET}/nyc-taxi/yellow_tripdata/
Modify data types
- Click on the Transform icon, choose Change Schema.
- Change data type
Save transformed data to Amazon S3
- Click on the Target icon, choose Amazon S3.
- Specify the following information
  - Format – Parquet
  - Compression Type - Snappy
  - S3 Target Location S3://{Standardize_BUCKET}/yellow_tripdata/
Set job detail
- Specify Iam role
Run job
- Click Run.
Check output
- Go to standardize bucket in S3 console.

Run Crawlers to create Tables

Go to the AWS Glue Console.
In the left navigation menu, click Crawlers.
On the Crawlers page, select your crawler, and then click Run crawler.
In the left navigation menu, click Tables.
On the Tables page, click on table name to review the table metadata and schema information.

Query transformed data in Athena

Using Amazon Athena for the first time

Amazon Athena automatically stores query results and metadata information for each query that runs in a query result location that you can specify in Amazon S3. If necessary, you can access the files in this location to work with them. You can also download query result files directly from the Athena console.

Go to Athena console. Click Get Started
Choose Edit Settings, click on Browse S3 and select bucket as the value for the Location of query result - optional field.
Go to the top menu, click on Editor to return back to the Query editor page.

Query Standardize data

Choose database
Choose table
Choose preview table