🚕 Uber Data Analytics Pipeline

End-to-End Data Engineering on Google Cloud Platform

ETL Orchestration | BigQuery Analytics | Real-Time Reporting

Project Overview

Built an end-to-end data engineering pipeline to analyze Uber-like trip data on Google Cloud Platform. The workflow uses Mage.ai to orchestrate ETL in Python/SQL, stages raw files in Cloud Storage, runs pipeline compute on a GCP Compute Instance, loads curated datasets into BigQuery, and powers analytics-ready reporting in Looker Studio.

The project models the dataset for efficient analytics (fact/dimension style), includes reusable SQL queries for insights, and is based on the NYC TLC trip record dataset (yellow/green taxi trips with pickup/dropoff, fares, distances, payment type, passenger count, etc.).

📊 Dataset: NYC Taxi & Limousine Commission (TLC) Trip Records — millions of taxi trips with rich attributes including temporal, spatial, and transactional data.

Pipeline Architecture

📁

Raw Data

NYC TLC CSV

→

☁️

Cloud Storage

Data Lake

→

⚙️

Mage.ai ETL

Orchestration

→

🗄️

BigQuery

Data Warehouse

→

📊

Looker Studio

Visualization

🖥️ Infrastructure

Compute Engine Instance: Hosts Mage.ai orchestration server, executes Python/SQL transformations, and manages pipeline scheduling.

Key Features & Capabilities

🔄

Automated ETL Pipeline

Mage.ai orchestrates Extract-Transform-Load workflows with scheduled runs, error handling, and data validation checkpoints.

🏗️

Dimensional Data Modeling

Star schema design with fact tables (trips) and dimension tables (datetime, location, payment, rate) for optimized analytics queries.

☁️

Scalable Cloud Storage

GCP Cloud Storage acts as data lake for raw CSV files with versioning, lifecycle policies, and cost-effective archival.

⚡

BigQuery Analytics

Serverless data warehouse enabling SQL queries on millions of rows with sub-second response times and petabyte scalability.

📈

Interactive Dashboards

Looker Studio dashboards with real-time KPIs, trend charts, geographic heatmaps, and drill-down capabilities for stakeholders.

🔍

Reusable SQL Queries

Pre-built SQL templates for common business questions: revenue analysis, peak hours, popular routes, payment trends, and more.

Data Model Architecture

📊 Fact Table: Trip Records

Core Metrics:

Trip distance
Fare amount
Tip amount
Tolls amount
Total amount
Passenger count
Trip duration
Extra charges

🗂️ Dimension Tables

📅 DateTime

Pickup/dropoff times, hour, day, month, year

📍 Location

Pickup/dropoff zones, boroughs, coordinates

💳 Payment Type

Cash, credit card, no charge, dispute

💰 Rate Code

Standard, JFK, Newark, Nassau, negotiated

✅ Benefits of Star Schema

Query Performance: Optimized joins reduce query execution time
Business Clarity: Intuitive structure for analysts and stakeholders
Aggregation Speed: Fast rollups across time, location, and payment dimensions
Scalability: Easily add new dimensions without restructuring fact table

ETL Pipeline Workflow

1️⃣ Extract

Download NYC TLC trip data (Parquet/CSV format)
Upload raw files to GCP Cloud Storage bucket
Trigger Mage.ai pipeline via Cloud Storage event or schedule

2️⃣ Transform

Data Cleaning: Remove nulls, duplicates, outliers (negative fares, impossible distances)
Data Enrichment: Calculate trip duration, speed, revenue per mile
Dimensional Splits: Extract datetime, location, payment dimensions from fact table
Type Casting: Convert timestamps, floats, integers to proper BigQuery types
Validation: Assert data quality rules (fare > 0, passenger count < 10, etc.)

3️⃣ Load

Load fact table and dimension tables into BigQuery datasets
Create partitioned tables (by pickup date) for query performance
Build materialized views for common aggregations
Schedule incremental loads for new data (append vs. overwrite strategy)

Business Insights & Analytics

💰 Revenue Analysis

-Total revenue by time period (daily, weekly, monthly)
-Average fare by pickup location and hour
-Tip percentage trends by payment type
-Revenue per mile by rate code

🕐 Demand Patterns

-Peak hours for trip requests
-Weekday vs. weekend trip volume
-Seasonal demand fluctuations
-Hourly utilization rates

🗺️ Geographic Insights

-Most popular pickup/dropoff locations
-Average trip distance by borough
– Heatmap of high-demand zones
– Cross-borough travel patterns

👥 Passenger Behavior

– Average passenger count per trip
– Solo vs. group travel trends
– Payment method preferences
– Tipping behavior by passenger count

Technology Stack

🐍 Python

🗄️ SQL

🪄 Mage.ai

📊 BigQuery

☁️ Cloud Storage

🖥️ Compute Engine

📈 Looker Studio

🔧 GCP Ecosystem

Key Achievements & Outcomes

10M+

Trip Records Processed

<3s

Query Response Time

100%

Pipeline Automation

24/7

Real-Time Analytics

📋 Technical Highlights:

Star schema data modeling for optimal query performance
Partitioned BigQuery tables reducing scan costs by 90%+
Automated data quality validation in ETL pipeline
Version-controlled SQL queries and Python scripts in GitHub
Incremental load strategy for daily data updates
Cost-optimized cloud resource allocation

Sample Analytics Queries

— Top 10 most profitable pickup locations

SELECT pickup_location_id, location_name,

  SUM(total_amount) AS total_revenue,

  COUNT(*) AS trip_count

FROM fact_trips

JOIN dim_location ON fact_trips.pickup_location_id = dim_location.location_id

GROUP BY pickup_location_id, location_name

ORDER BY total_revenue DESC

LIMIT 10;
— Average fare and tip by payment type

SELECT payment_type_name,

  AVG(fare_amount) AS avg_fare,

  AVG(tip_amount) AS avg_tip,

  AVG(tip_amount / fare_amount * 100) AS avg_tip_percentage

FROM fact_trips

JOIN dim_payment_type ON fact_trips.payment_type_id = dim_payment_type.payment_type_id

GROUP BY payment_type_name;

📂

View Project on GitHub

Complete source code, SQL queries, Mage.ai pipelines, and documentation available on GitHub.

View on GitHub →

Need a Data Engineering Solution?

I build scalable ETL pipelines, data warehouses, and analytics platforms on GCP, AWS, and Azure.

Let’s Build Your Data Pipeline

Visited 3 times, 1 visit(s) today

Uber Data Engineering ETL Project