This project developed in a learning context consisted of develop back-end platform designed to calculate taxi fares in real time based on selected comfort levels. This project leverages Kafka for streaming data, ElasticSearch for indexing and monitoring, and Google BigQuery for revenue analysis and clustering. The platform computes fares by analyzing the distance between drivers and customers, grouped by comfort level and geographic clusters.
- Introduction
- Data Model
- TravelProcessor Development
- Data Transformation
- Indexing & Monitoring
- Data Warehouse and BigQuery
- Revenue Calculation by Cluster and Comfort Level
- Data Visualization
This project aims to provide an accurate fare calculation for taxi trips by estimating the distance between drivers and customers. Fare determination takes into account the selected level of comfort for each trip.
Incoming data is processed in real-time via Kafka, using Python's KafkaProducer
. This data includes customer requests, driver availability, and location information.
Note: Data, including location coordinates (longitude, latitude), is randomly generated, which may result in inconsistent map locations.
The TravelProcessor
module is designed and tested based on a predefined architecture. Unit tests ensure the reliability and accuracy of distance calculations and fare estimations.
Custom transformations are applied to the incoming data, with examples of transformations defined in TravelProcessor
using JoltTransformJSON
.
The platform uses ElasticSearch for:
- Indexing & Mapping: Data indexing and mapping are defined to optimize search and retrieval of fare and distance calculations.
- Performance Monitoring: Real-time monitoring of data streams and performance analysis.
- Data Visualization: Kibana provides a detailed and real-time visualization of processed data.
All records (10,000+) are consolidated and stored in a data warehouse in .parquet
format, along with timestamps, using Google BigQuery.
Revenue calculations are performed using a K-Means clustering model developed with BigQuery ML:
- Clustering - Eight clusters are identified based on geographical coordinates.
- Revenue Analysis - Revenue per cluster and comfort level is calculated to assist in understanding profitability across different areas and service types.
The results from the clustering and revenue calculations are visualized in Looker Studio, providing an insightful view of the model outcomes and revenue distributions.