MLOps
Overview
Planwise employs MLOps (Machine Learning Operations) practices to streamline the development, deployment, and maintenance of our recommendation models. This document outlines our approach to model versioning, training, evaluation, and deployment.
Model Lifecycle Management
Model Development
Our model development process follows these steps:
- Data Preparation: Process and clean data from various sources
- Experimentation: Test different algorithms and hyperparameters
- Training: Train models on prepared datasets
- Evaluation: Assess model performance using metrics
- Validation: Validate models against business requirements
Model Storage
We version and store all model artifacts in a standardized way:
- Directory Structure:
models/ ├── autoencoder/ │ ├── v1.0.0/ │ │ ├── model.h5 │ │ ├── scaler.save │ │ └── metadata.json │ └── v1.1.0/ │ └── ... ├── svd/ │ └── ... ├── transfer/ │ └── ... └── embeddings/ └── ...
- Metadata File Example:
{ "model_name": "autoencoder_recommender", "version": "1.0.0", "created_at": "2023-05-15T10:30:00Z", "framework": "tensorflow", "framework_version": "2.10.0", "python_version": "3.10.4", "input_shape": [29], "performance": { "validation_loss": 0.142, "test_loss": 0.157 }, "hyperparameters": { "hidden_layers": [16, 8, 16], "learning_rate": 0.001, "dropout_rate": 0.2, "epochs": 100 }, "training_dataset": "final_users_over_20_categories.csv", "description": "Denoising autoencoder with 3 hidden layers" }
Model Training Pipeline
Our automated training pipeline consists of:
- Data Extraction:
- Fetch data from the database and data files
- Apply data cleaning and preprocessing
- Feature Engineering:
- Transform raw data into model-ready features
- Apply normalization and encoding
- Model Training:
- Train models with optimized hyperparameters
- Log training metrics to monitoring system
- Evaluation:
- Calculate performance metrics
- Generate evaluation reports
- Model Registration:
- Register models in model registry
- Store model artifacts with metadata
Training Job Example
def train_autoencoder_model(config):
"""Train autoencoder model with the given configuration."""
# Load and prepare data
data = load_data(config['data_path'])
X_train, X_val = preprocess_data(data)
# Build model
model = build_autoencoder(
input_dim=X_train.shape[1],
hidden_layers=config['hidden_layers'],
dropout_rate=config['dropout_rate']
)
# Train model
history = model.fit(
X_train, X_train, # Autoencoder reconstructs inputs
epochs=config['epochs'],
batch_size=config['batch_size'],
validation_data=(X_val, X_val),
callbacks=get_callbacks(config)
)
# Evaluate model
metrics = evaluate_model(model, X_val)
# Save model and artifacts
version = get_next_version('autoencoder')
save_path = f"models/autoencoder/v{version}"
os.makedirs(save_path, exist_ok=True)
model.save(f"{save_path}/model.h5")
joblib.dump(scaler, f"{save_path}/scaler.save")
# Save metadata
metadata = {
"model_name": "autoencoder_recommender",
"version": version,
"created_at": datetime.now().isoformat(),
"framework": "tensorflow",
"framework_version": tf.__version__,
"python_version": platform.python_version(),
"input_shape": [X_train.shape[1]],
"performance": metrics,
"hyperparameters": config,
"training_dataset": os.path.basename(config['data_path']),
"description": config['description']
}
with open(f"{save_path}/metadata.json", 'w') as f:
json.dump(metadata, f, indent=2)
return model, metrics, save_path
Model Evaluation
We evaluate models using multiple techniques:
Offline Evaluation
- Metrics: RMSE, MAE, precision, recall, F1-score
- Cross-Validation: k-fold cross-validation for robust performance estimation
- Benchmark Comparison: Compare against baseline and previous versions
Online Evaluation
- A/B Testing: Test new models against current production models
- User Feedback: Collect explicit and implicit user feedback
- Business Metrics: Track recommendation clicks, conversion rates, etc.
Model Deployment
Our model deployment process ensures smooth transitions and minimal downtime:
Deployment Options
- Direct Model Loading:
- Load models directly in the API service
- Best for smaller models with lower latency requirements
- Containerized Model Serving:
- Package models in Docker containers
- Deploy as separate microservices
- Use TensorFlow Serving or similar tools
- Serverless Inference:
- Deploy models to serverless platforms
- Ideal for variable load patterns
Deployment Process
- Model Selection: Choose the model version to deploy
- Canary Deployment: Roll out to a small percentage of traffic
- Monitoring: Monitor performance metrics and errors
- Progressive Rollout: Gradually increase traffic to new model
- Rollback Plan: Maintain ability to revert to previous version
Model Monitoring
Continuous monitoring is essential for maintaining model quality:
Key Monitoring Areas
- Model Performance: Track prediction quality metrics
- Prediction Drift: Detect shifts in prediction patterns
- Data Drift: Monitor changes in input data distribution
- Resource Usage: Track memory, CPU, and latency
- Business Metrics: Monitor engagement with recommendations
Monitoring Dashboard
Our monitoring dashboard displays:
- Real-time performance metrics
- Request/response logs
- Error rates and types
- Resource utilization
- Data distribution visualizations
Continuous Improvement
We maintain a feedback loop for model improvements:
- Data Collection: Continuously gather new data
- Hypothesis Testing: Test ideas for model improvements
- Periodic Retraining: Update models with new data
- Algorithm Updates: Evaluate and implement new algorithms
- Feature Engineering: Develop new features to improve performance
Tools and Infrastructure
Our MLOps infrastructure uses:
- Version Control: Git for code, DVC for model artifacts
- CI/CD: GitHub Actions for automated pipelines
- Model Registry: Custom model storage with versioning
- Monitoring: Prometheus and Grafana
- Experiment Tracking: MLflow for experiment management
- Containerization: Docker for reproducible environments
- Infrastructure as Code: Terraform for infrastructure management
Best Practices
Our MLOps approach follows these best practices:
- Reproducibility: All experiments and deployments are reproducible
- Versioning: Code, data, and models are versioned
- Automation: Training and deployment processes are automated
- Testing: Models undergo rigorous testing before deployment
- Documentation: Models are well-documented with metadata
- Monitoring: Continuous monitoring for model health
- Security: Model artifacts and data are securely managed