Uber Data Engineering ETL Project

🚕 Uber Data Analytics Pipeline

End-to-End Data Engineering on Google Cloud Platform

ETL Orchestration | BigQuery Analytics | Real-Time Reporting

Project Overview

Built an end-to-end data engineering pipeline to analyze Uber-like trip data on Google Cloud Platform. The workflow uses Mage.ai to orchestrate ETL in Python/SQL, stages raw files in Cloud Storage, runs pipeline compute on a GCP Compute Instance, loads curated datasets into BigQuery, and powers analytics-ready reporting in Looker Studio.

The project models the dataset for efficient analytics (fact/dimension style), includes reusable SQL queries for insights, and is based on the NYC TLC trip record dataset (yellow/green taxi trips with pickup/dropoff, fares, distances, payment type, passenger count, etc.).

📊 Dataset: NYC Taxi & Limousine Commission (TLC) Trip Records — millions of taxi trips with rich attributes including temporal, spatial, and transactional data.

Pipeline Architecture

📁
Raw Data
NYC TLC CSV

☁️
Cloud Storage
Data Lake

⚙️
Mage.ai ETL
Orchestration

🗄️
BigQuery
Data Warehouse

📊
Looker Studio
Visualization

🖥️ Infrastructure

Compute Engine Instance: Hosts Mage.ai orchestration server, executes Python/SQL transformations, and manages pipeline scheduling.

Key Features & Capabilities

🔄

Automated ETL Pipeline

Mage.ai orchestrates Extract-Transform-Load workflows with scheduled runs, error handling, and data validation checkpoints.

🏗️

Dimensional Data Modeling

Star schema design with fact tables (trips) and dimension tables (datetime, location, payment, rate) for optimized analytics queries.

☁️

Scalable Cloud Storage

GCP Cloud Storage acts as data lake for raw CSV files with versioning, lifecycle policies, and cost-effective archival.

BigQuery Analytics

Serverless data warehouse enabling SQL queries on millions of rows with sub-second response times and petabyte scalability.

📈

Interactive Dashboards

Looker Studio dashboards with real-time KPIs, trend charts, geographic heatmaps, and drill-down capabilities for stakeholders.

🔍

Reusable SQL Queries

Pre-built SQL templates for common business questions: revenue analysis, peak hours, popular routes, payment trends, and more.

Data Model Architecture

📊 Fact Table: Trip Records

Core Metrics:

  • Trip distance
  • Fare amount
  • Tip amount
  • Tolls amount
  • Total amount
  • Passenger count
  • Trip duration
  • Extra charges

🗂️ Dimension Tables

📅 DateTime

Pickup/dropoff times, hour, day, month, year

📍 Location

Pickup/dropoff zones, boroughs, coordinates

💳 Payment Type

Cash, credit card, no charge, dispute

💰 Rate Code

Standard, JFK, Newark, Nassau, negotiated

✅ Benefits of Star Schema

  • Query Performance: Optimized joins reduce query execution time
  • Business Clarity: Intuitive structure for analysts and stakeholders
  • Aggregation Speed: Fast rollups across time, location, and payment dimensions
  • Scalability: Easily add new dimensions without restructuring fact table

ETL Pipeline Workflow

1️⃣ Extract

  • Download NYC TLC trip data (Parquet/CSV format)
  • Upload raw files to GCP Cloud Storage bucket
  • Trigger Mage.ai pipeline via Cloud Storage event or schedule

2️⃣ Transform

  • Data Cleaning: Remove nulls, duplicates, outliers (negative fares, impossible distances)
  • Data Enrichment: Calculate trip duration, speed, revenue per mile
  • Dimensional Splits: Extract datetime, location, payment dimensions from fact table
  • Type Casting: Convert timestamps, floats, integers to proper BigQuery types
  • Validation: Assert data quality rules (fare > 0, passenger count < 10, etc.)

3️⃣ Load

  • Load fact table and dimension tables into BigQuery datasets
  • Create partitioned tables (by pickup date) for query performance
  • Build materialized views for common aggregations
  • Schedule incremental loads for new data (append vs. overwrite strategy)

Business Insights & Analytics

💰 Revenue Analysis

  • -Total revenue by time period (daily, weekly, monthly)
  • -Average fare by pickup location and hour
  • -Tip percentage trends by payment type
  • -Revenue per mile by rate code

🕐 Demand Patterns

  • -Peak hours for trip requests
  • -Weekday vs. weekend trip volume
  • -Seasonal demand fluctuations
  • -Hourly utilization rates

🗺️ Geographic Insights

  • -Most popular pickup/dropoff locations
  • -Average trip distance by borough
  • – Heatmap of high-demand zones
  • – Cross-borough travel patterns

👥 Passenger Behavior

  • – Average passenger count per trip
  • – Solo vs. group travel trends
  • – Payment method preferences
  • – Tipping behavior by passenger count

Technology Stack

🐍 Python
🗄️ SQL
🪄 Mage.ai
📊 BigQuery
☁️ Cloud Storage
🖥️ Compute Engine
📈 Looker Studio
🔧 GCP Ecosystem

Key Achievements & Outcomes

10M+
Trip Records Processed

<3s
Query Response Time

100%
Pipeline Automation

24/7
Real-Time Analytics

📋 Technical Highlights:

  • Star schema data modeling for optimal query performance
  • Partitioned BigQuery tables reducing scan costs by 90%+
  • Automated data quality validation in ETL pipeline
  • Version-controlled SQL queries and Python scripts in GitHub
  • Incremental load strategy for daily data updates
  • Cost-optimized cloud resource allocation

Sample Analytics Queries

— Top 10 most profitable pickup locations
SELECT pickup_location_id, location_name,
  SUM(total_amount) AS total_revenue,
  COUNT(*) AS trip_count
FROM fact_trips
JOIN dim_location ON fact_trips.pickup_location_id = dim_location.location_id
GROUP BY pickup_location_id, location_name
ORDER BY total_revenue DESC
LIMIT 10;
— Average fare and tip by payment type
SELECT payment_type_name,
  AVG(fare_amount) AS avg_fare,
  AVG(tip_amount) AS avg_tip,
  AVG(tip_amount / fare_amount * 100) AS avg_tip_percentage
FROM fact_trips
JOIN dim_payment_type ON fact_trips.payment_type_id = dim_payment_type.payment_type_id
GROUP BY payment_type_name;

📂

View Project on GitHub

Complete source code, SQL queries, Mage.ai pipelines, and documentation available on GitHub.

View on GitHub →

Need a Data Engineering Solution?

I build scalable ETL pipelines, data warehouses, and analytics platforms on GCP, AWS, and Azure.

Let’s Build Your Data Pipeline

Visited 1 times, 1 visit(s) today