Decrypted Data
5 min read

Data Science Workflow: From Raw Data to a Deployed ML Model

Data SciencePythonMachine LearningScikit-learn

By Pavan Sharma — AI Agent Developer & Full Stack Engineer

The Workflow Nobody Talks About

Data science tutorials usually show you the glamorous part — fitting a model, plotting a learning curve, getting 94% accuracy. What they skip is everything around that: how you actually go from raw, messy data to a working system that runs in production.

Here's the workflow I follow on every project.

Phase 1: Data Understanding (EDA)

Before writing a single line of ML code, spend time understanding your data:

import pandas as pd import matplotlib.pyplot as plt import seaborn as sns df = pd.read_csv('data.csv') # Basic profiling print(df.info()) print(df.describe()) print(df.isnull().sum()) # Distribution of target variable df['target'].value_counts().plot(kind='bar')

Key questions to answer in EDA:

  • What is the shape and type of each feature?
  • What is the distribution of the target variable? Is it imbalanced?
  • Are there missing values? What pattern do they follow?
  • Are there outliers? Are they real or data errors?
  • Which features correlate with the target?

I use a correlation heatmap and Seaborn pairplots to answer the last two questions visually. Plotly Dash is useful when I need interactive EDA for stakeholders.

Phase 2: Feature Engineering

This is where domain knowledge creates the most value. Raw features are rarely the best input to a model. Transformations that commonly help:

  • Log transforms for right-skewed numerical features
  • Binning continuous variables when the relationship is non-linear
  • Interaction features for pairs of features that are more informative together
  • Date decomposition (year, month, day-of-week, hour) from timestamp columns
  • Target encoding for high-cardinality categoricals
from sklearn.preprocessing import StandardScaler, LabelEncoder from sklearn.pipeline import Pipeline numeric_pipe = Pipeline([ ('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler()) ])

Always build transformations inside a Pipeline so they apply consistently to training and inference data.

Phase 3: Model Selection and Training

I start with simple baselines — a logistic regression or a decision tree. This gives me a baseline accuracy and helps identify whether the problem is hard (baseline is 51%) or easy (baseline is 89%).

Then I try:

  1. Random Forest — robust, interpretable feature importances, hard to overfit catastrophically
  2. Gradient Boosting (XGBoost or LightGBM) — usually best performance on tabular data
  3. Neural network — only if the above two underperform or if the data is image/text/sequence

Use cross-validation, never a single train-test split:

from sklearn.model_selection import cross_val_score scores = cross_val_score(model, X, y, cv=5, scoring='f1_macro') print(f"CV F1: {scores.mean():.3f} ± {scores.std():.3f}")

Phase 4: Evaluation Beyond Accuracy

Accuracy alone is misleading, especially on imbalanced datasets. Always report:

  • Confusion matrix: to see where the model is failing
  • Precision/Recall/F1: especially for imbalanced targets
  • ROC AUC: for ranking quality
  • Calibration plot: if the model outputs probabilities used for decisions

Phase 5: Deployment with FastAPI

Once the model is validated, I serialize it with joblib and serve it via a FastAPI endpoint:

import joblib from fastapi import FastAPI from pydantic import BaseModel model = joblib.load('model.pkl') app = FastAPI() class PredictRequest(BaseModel): features: list[float] @app.post("/predict") def predict(req: PredictRequest): prediction = model.predict([req.features]) probability = model.predict_proba([req.features]).max() return {"prediction": int(prediction[0]), "confidence": float(probability)}

Containerize with Docker for reproducibility:

FROM python:3.11-slim COPY requirements.txt . RUN pip install -r requirements.txt COPY model.pkl app.py ./ CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]

The Lesson

Eighty percent of time in data science is spent on phases 1 and 2. The model training is the easy part. Good EDA and thoughtful feature engineering will outperform hyperparameter tuning on a poorly understood dataset every single time.