Data Science Workflow: From Raw Data to a Deployed ML Model

Data SciencePythonMachine LearningScikit-learn

By Pavan Sharma — AI Agent Developer & Full Stack Engineer

The Workflow Nobody Talks About

Data science tutorials usually show you the glamorous part — fitting a model, plotting a learning curve, getting 94% accuracy. What they skip is everything around that: how you actually go from raw, messy data to a working system that runs in production.

Here's the workflow I follow on every project.

Phase 1: Data Understanding (EDA)

Before writing a single line of ML code, spend time understanding your data:

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv('data.csv')

# Basic profiling
print(df.info())
print(df.describe())
print(df.isnull().sum())

# Distribution of target variable
df['target'].value_counts().plot(kind='bar')

Key questions to answer in EDA:

▸What is the shape and type of each feature?
▸What is the distribution of the target variable? Is it imbalanced?
▸Are there missing values? What pattern do they follow?
▸Are there outliers? Are they real or data errors?
▸Which features correlate with the target?

I use a correlation heatmap and Seaborn pairplots to answer the last two questions visually. Plotly Dash is useful when I need interactive EDA for stakeholders.

Phase 2: Feature Engineering

This is where domain knowledge creates the most value. Raw features are rarely the best input to a model. Transformations that commonly help:

▸Log transforms for right-skewed numerical features
▸Binning continuous variables when the relationship is non-linear
▸Interaction features for pairs of features that are more informative together
▸Date decomposition (year, month, day-of-week, hour) from timestamp columns
▸Target encoding for high-cardinality categoricals

from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.pipeline import Pipeline

numeric_pipe = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

Always build transformations inside a Pipeline so they apply consistently to training and inference data.

Phase 3: Model Selection and Training

I start with simple baselines — a logistic regression or a decision tree. This gives me a baseline accuracy and helps identify whether the problem is hard (baseline is 51%) or easy (baseline is 89%).

Then I try:

▸Random Forest — robust, interpretable feature importances, hard to overfit catastrophically
▸Gradient Boosting (XGBoost or LightGBM) — usually best performance on tabular data
▸Neural network — only if the above two underperform or if the data is image/text/sequence

Use cross-validation, never a single train-test split:

from sklearn.model_selection import cross_val_score

scores = cross_val_score(model, X, y, cv=5, scoring='f1_macro')
print(f"CV F1: {scores.mean():.3f} ± {scores.std():.3f}")

Phase 4: Evaluation Beyond Accuracy

Accuracy alone is misleading, especially on imbalanced datasets. Always report:

▸Confusion matrix: to see where the model is failing
▸Precision/Recall/F1: especially for imbalanced targets
▸ROC AUC: for ranking quality
▸Calibration plot: if the model outputs probabilities used for decisions

Phase 5: Deployment with FastAPI

Once the model is validated, I serialize it with joblib and serve it via a FastAPI endpoint:

import joblib
from fastapi import FastAPI
from pydantic import BaseModel

model = joblib.load('model.pkl')
app = FastAPI()

class PredictRequest(BaseModel):
    features: list[float]

@app.post("/predict")
def predict(req: PredictRequest):
    prediction = model.predict([req.features])
    probability = model.predict_proba([req.features]).max()
    return {"prediction": int(prediction[0]), "confidence": float(probability)}

Containerize with Docker for reproducibility:

FROM python:3.11-slim
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY model.pkl app.py ./
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]

The Lesson

Eighty percent of time in data science is spent on phases 1 and 2. The model training is the easy part. Good EDA and thoughtful feature engineering will outperform hyperparameter tuning on a poorly understood dataset every single time.

⚡ Work With Me

← Back to all transmissions