The goal of this project was to create a robust ETL pipeline for Green Taxi trip data.
The data needed to be extracted from multiple zipped CSV files from an online sorce, transformed to ensure data quality and consistency, and then loaded into both a PostgreSQL database for easy querying and analysis, and Google Cloud Storage in Parquet format for efficient storage and access.
Ensuring the pipeline runs reliably and automatically on a daily schedule was also a critical requirement.Â
Data Consistency: Ensuring the data across multiple files was consistent in terms of schema and data types.
Data Quality: Removing rows with invalid data (e.g., zero passenger count or trip distance) and validating the integrity of the data through assertions.
Efficient Storage: Storing the data in a format that supports efficient querying and reduced storage costs.
Data Extraction: Load Green Taxi trip data for October, November, and December 2020 from CSV files using Pandas.
Data Transformation: Remove rows with zero 'passenger_count' and 'trip_distance', create 'lpep_pickup_date', and rename columns to snake_case.
Validation: Ensure 'vendor_id' column contains valid values, 'passenger_count' is greater than 0, and 'trip_distance' is greater than 0.
Data Loading: Write the transformed data to a PostgreSQL table named 'green_taxi' in the 'mage' schema.
Efficient Storage: Save the data to Google Cloud Storage as Parquet files, partitioned by 'lpep_pickup_date'.
Scheduling: Schedule the pipeline to run daily at 5 AM UTC using Mage AI.