Transform to Standardize Zone

Planning the Data Transformation Steps

It’s a good practice to plan the steps to transform your data. Based on the information we captured during data exploration stage, we can come up with the following transformation step:

  1. Read raw data from S3.
  2. Perform data transformation: Set appropriate data types.
  3. Save processed dataset to S3 in a query optimized format.
  4. Run crawlers to create tables
  5. Query transformed data in Athena

Create Raw to Standardize Glue job

  1. Go to Glue console.

  2. In the left navigation panel, click ETL jobs.

  3. On the AWS Glue Studio page, click Visual ETL.

    Image

taxi_zone_lookup

  1. Adding Yellow Trips data from Amazon S3

    • Click on the Source icon, choose S3.
    • In the Data source – S3 bucket node, to specify the following information:
      • S3 URL: S3://{RAW_BUCKET}/nyc-taxi/taxi_zone_lookup/

    Image

  2. Modify data types

    • Click on the Transform icon, choose Change Schema.
    • Change data type

    Image

  3. Save transformed data to Amazon S3

    • Click on the Target icon, choose Amazon S3.
    • Specify the following information
      • Format – Parquet
      • Compression Type - Snappy
      • S3 Target Location S3://{Standardize_BUCKET}/taxi_zone_lookup/

    Image

  4. Set job detail

    • Specify Iam role

      Image

  5. Run job

    • Click Run.

      Image

  6. Check output

    • Go to standardize bucket in S3 console.

      Image

yellow_tripdata

  1. Adding Yellow Trips data from Amazon S3

    • Click on the Source icon, choose S3.
    • In the Data source – S3 bucket node, to specify the following information:
      • S3 URL: S3://{RAW_BUCKET}/nyc-taxi/yellow_tripdata/

    Image

  2. Modify data types

    • Click on the Transform icon, choose Change Schema.
    • Change data type

    Image

  3. Save transformed data to Amazon S3

    • Click on the Target icon, choose Amazon S3.
    • Specify the following information
      • Format – Parquet
      • Compression Type - Snappy
      • S3 Target Location S3://{Standardize_BUCKET}/yellow_tripdata/

    Image

  4. Set job detail

    • Specify Iam role

      Image

  5. Run job

    • Click Run.

      Image

  6. Check output

    • Go to standardize bucket in S3 console.

      Image

Run Crawlers to create Tables

  1. Go to the AWS Glue Console.

  2. In the left navigation menu, click Crawlers.

  3. On the Crawlers page, select your crawler, and then click Run crawler.

    Image

  4. In the left navigation menu, click Tables.

    Image

  5. On the Tables page, click on table name to review the table metadata and schema information.

    Image

    Image

    Image

Query transformed data in Athena

Using Amazon Athena for the first time

Amazon Athena automatically stores query results and metadata information for each query that runs in a query result location that you can specify in Amazon S3. If necessary, you can access the files in this location to work with them. You can also download query result files directly from the Athena console.

  1. Go to Athena console. Click Get Started

    Image

  2. Choose Edit Settings, click on Browse S3 and select bucket as the value for the Location of query result - optional field.

    Image

    Image

  3. Go to the top menu, click on Editor to return back to the Query editor page.

Query Standardize data

  1. Choose database

  2. Choose table

  3. Choose preview table

    Image

    Image