Skip to the content.

Data Preprocessing

Overview

To build recommendation models on a solid foundation, we designed a repeatable pipeline that ingests, cleans, and inspects our raw data before any modeling begins. This ensures consistent, high-quality data for all recommendation algorithms.

Preprocessing Pipeline

All transformation logic is encapsulated in data/userspipeline.py, which runs these deterministic steps:

1. Data Loading

The pipeline begins by loading the raw data files:

2. Join & Filter

3. Primary Category Extraction

4. Intermediate Persistence

5. Per-User, Per-Category Aggregation

6. Active-Category Filtering

7. Category Normalization

Madrid Data Processing

For Madrid-specific data:

  1. Collection: We gathered place data for Madrid using the Google Places API
  2. Cleaning: Removed duplicates and places with missing critical information
  3. Categorization: Mapped Google place types to our normalized category system
  4. Embedding: Generated text embeddings for places to support our Madrid Embedding Recommender
  5. Storage: Saved the processed data to combined_places.csv and the embeddings to madrid_place_embeddings.npz

Data Validation

Throughout the pipeline, we implement validation checks:

By automating these steps in a single, transparent pipeline, we guarantee that every dataset entering model training has been consistently cleaned and comprehensively characterized—laying a solid foundation for all recommendation algorithms.