Data Science Workflow: From Raw Data to a Deployed ML Model
By Pavan Sharma — AI Agent Developer & Full Stack Engineer
The Workflow Nobody Talks About
Data science tutorials usually show you the glamorous part — fitting a model, plotting a learning curve, getting 94% accuracy. What they skip is everything around that: how you actually go from raw, messy data to a working system that runs in production.
Here's the workflow I follow on every project.
Phase 1: Data Understanding (EDA)
Before writing a single line of ML code, spend time understanding your data:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv('data.csv')
# Basic profiling
print(df.info())
print(df.describe())
print(df.isnull().sum())
# Distribution of target variable
df['target'].value_counts().plot(kind='bar')
Key questions to answer in EDA:
- ▸What is the shape and type of each feature?
- ▸What is the distribution of the target variable? Is it imbalanced?
- ▸Are there missing values? What pattern do they follow?
- ▸Are there outliers? Are they real or data errors?
- ▸Which features correlate with the target?
I use a correlation heatmap and Seaborn pairplots to answer the last two questions visually. Plotly Dash is useful when I need interactive EDA for stakeholders.
Phase 2: Feature Engineering
This is where domain knowledge creates the most value. Raw features are rarely the best input to a model. Transformations that commonly help:
- ▸Log transforms for right-skewed numerical features
- ▸Binning continuous variables when the relationship is non-linear
- ▸Interaction features for pairs of features that are more informative together
- ▸Date decomposition (year, month, day-of-week, hour) from timestamp columns
- ▸Target encoding for high-cardinality categoricals
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.pipeline import Pipeline
numeric_pipe = Pipeline([
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
])
Always build transformations inside a Pipeline so they apply consistently to training and inference data.
Phase 3: Model Selection and Training
I start with simple baselines — a logistic regression or a decision tree. This gives me a baseline accuracy and helps identify whether the problem is hard (baseline is 51%) or easy (baseline is 89%).
Then I try:
- ▸Random Forest — robust, interpretable feature importances, hard to overfit catastrophically
- ▸Gradient Boosting (XGBoost or LightGBM) — usually best performance on tabular data
- ▸Neural network — only if the above two underperform or if the data is image/text/sequence
Use cross-validation, never a single train-test split:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, X, y, cv=5, scoring='f1_macro')
print(f"CV F1: {scores.mean():.3f} ± {scores.std():.3f}")
Phase 4: Evaluation Beyond Accuracy
Accuracy alone is misleading, especially on imbalanced datasets. Always report:
- ▸Confusion matrix: to see where the model is failing
- ▸Precision/Recall/F1: especially for imbalanced targets
- ▸ROC AUC: for ranking quality
- ▸Calibration plot: if the model outputs probabilities used for decisions
Phase 5: Deployment with FastAPI
Once the model is validated, I serialize it with joblib and serve it via a FastAPI endpoint:
import joblib
from fastapi import FastAPI
from pydantic import BaseModel
model = joblib.load('model.pkl')
app = FastAPI()
class PredictRequest(BaseModel):
features: list[float]
@app.post("/predict")
def predict(req: PredictRequest):
prediction = model.predict([req.features])
probability = model.predict_proba([req.features]).max()
return {"prediction": int(prediction[0]), "confidence": float(probability)}
Containerize with Docker for reproducibility:
FROM python:3.11-slim
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY model.pkl app.py ./
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]
The Lesson
Eighty percent of time in data science is spent on phases 1 and 2. The model training is the easy part. Good EDA and thoughtful feature engineering will outperform hyperparameter tuning on a poorly understood dataset every single time.