Data Preprocessing
Overview
To build recommendation models on a solid foundation, we designed a repeatable pipeline that ingests, cleans, and inspects our raw data before any modeling begins. This ensures consistent, high-quality data for all recommendation algorithms.
Preprocessing Pipeline
All transformation logic is encapsulated in data/userspipeline.py
, which runs these deterministic steps:
1. Data Loading
The pipeline begins by loading the raw data files:
- User-place ratings from
rating-California.csv
- Active user list from
filtered_users.csv
- Place metadata from
meta-California.json
2. Join & Filter
- Inner-join raw ratings with the active-user list
- Discard interactions from less-active accounts (fewer than 20 ratings)
- Flatten the nested JSON metadata so that every attribute becomes a regular DataFrame column
3. Primary Category Extraction
- Parse each place’s types list (e.g., [“art_gallery”, “museum”, “tourist_attraction”])
- Extract the first element into a new column,
category1
- This primary tag drives our downstream grouping and preference modeling
4. Intermediate Persistence
- Save the merged DataFrame (columns:
user_id
,place_id
,rating
,category1
) asusers_ratings_categories.csv
- This intermediate file provides transparency and enables reuse in different model pipelines
5. Per-User, Per-Category Aggregation
- Group by (
user_id
,category1
) to compute each user’s average rating per category - Export these summaries to
average_user_ratings_per_category.csv
- Save count information to
user_counts_per_category.csv
6. Active-Category Filtering
- Identify users who have rated more than 20 distinct primary categories
- Write their IDs to
filtered_users_over_20_categories.csv
- This ensures we model users with diverse experiences across category types
7. Category Normalization
- Apply the lookup in
refined_category_mapping_from_csv.py
to map raw codes to human-friendly buckets- Example: “art_gallery” → “Art & Culture”
- Example: “restaurant” → “Food & Drink”
- Save the final cleaned dataset as
final_users_over_20_categories.csv
Madrid Data Processing
For Madrid-specific data:
- Collection: We gathered place data for Madrid using the Google Places API
- Cleaning: Removed duplicates and places with missing critical information
- Categorization: Mapped Google place types to our normalized category system
- Embedding: Generated text embeddings for places to support our Madrid Embedding Recommender
- Storage: Saved the processed data to
combined_places.csv
and the embeddings tomadrid_place_embeddings.npz
Data Validation
Throughout the pipeline, we implement validation checks:
- Schema Validation: Ensure all required columns are present
- Value Range Checks: Validate that ratings fall within the expected 1-5 scale
- Null Detection: Identify and handle missing values
- Integrity Checks: Verify that foreign keys exist in the related tables
- Geocoding Validation: Ensure coordinates fall within expected bounds
By automating these steps in a single, transparent pipeline, we guarantee that every dataset entering model training has been consistently cleaned and comprehensively characterized—laying a solid foundation for all recommendation algorithms.