From Statistics to Deep Learning: The Modern Modeling Landscape

See BONUS.md for advanced topics:

Fun fact: The word "model" comes from the Latin "modulus" meaning "measure" or "standard." In data science, we're literally creating standards - mathematical representations that measure and predict patterns in our data. But unlike Zoolander, we can turn left AND right!

xkcd 1838: Machine Learning

"I'm sorry, I can't do that. I'm a machine learning model, not a magic wand."

Outline

Quick Reference

Tool When to Use Key Features Best For
statsmodels Need p-values, confidence intervals, hypothesis testing Statistical inference, model diagnostics Understanding relationships, research
scikit-learn Tabular data, need predictions Consistent API, many algorithms General ML tasks, preprocessing
XGBoost Tabular data, need best performance Gradient boosting, feature importance Competitions, production tabular data
TensorFlow/Keras or PyTorch Images, text, audio, large datasets High-level API, production-ready Computer vision, NLP, deployment

The Modeling Ecosystem: A Brief Tour

Reality check: There are more Python modeling libraries than there are ways to overfit a model. But don't worry - we'll focus on the essential tools that actually matter for daily data science work, from the bread-and-butter statistical methods to the cutting-edge deep learning frameworks.

The Python modeling landscape has evolved dramatically. From simple linear regression to complex neural networks, each tool has its place. Understanding when to use what is half the battle - the other half is actually getting your model to work (which, let's be honest, is usually the harder part).

The Modeling Spectrum:

STATISTICAL MODELING          TRADITIONAL ML             DEEP LEARNING
┌─────────────────────┐      ┌──────────────────┐      ┌──────────────┐
│   statsmodels       │      │  scikit-learn    │      │ TensorFlow   │
│   (inference)       │      │  (predictions)   │      │ PyTorch      │
│                     │      │                  │      │              │
│ • Linear models     │      │ • Random Forest  │      │ • Neural     │
│ • GLMs              │      │ • SVM            │      │   networks   │
│ • Time series       │      │ • XGBoost        │      │ • CNNs       │
│                     │      │                  │      │ • RNNs       │
└─────────────────────┘      └──────────────────┘      └──────────────┘
     ↑                            ↑                          ↑
  "Why?"                      "What?"                    "How?"

Model Complexity vs Interpretability Trade-off:

Model Interpretability Trade-off

As models get more powerful, they often become harder to interpret. Choose based on what you need: understanding (interpretability) or performance (accuracy).

Key Decision Points:

Pro tip: Start simple. A well-tuned linear regression often beats a poorly tuned neural network. Remember: "But why male models?" - because sometimes the simplest model is the right model!

Model Selection Decision Tree:

flowchart TD
    A[What's your problem?] --> B{Need statistical<br/>inference?}
    B -->|Yes| C[statsmodels]
    B -->|No| D{What type of data?}
    D -->|Tabular/Structured| E{How much data?}
    D -->|Images/Text/Audio| F[Deep Learning<br/>TensorFlow/PyTorch]
    E -->|Small dataset| G[scikit-learn<br/>Random Forest]
    E -->|Large dataset| H[XGBoost]
    C --> I[Linear Regression<br/>GLMs<br/>Time Series]
    G --> J[Random Forest<br/>Linear Models]
    H --> K[XGBoost<br/>LightGBM<br/>CatBoost]
    F --> L[Neural Networks<br/>CNNs/RNNs]
    
    style C fill:#e1f5ff
    style G fill:#fff4e1
    style H fill:#ffe1f5
    style F fill:#e1ffe1

"But why models?" "Seriously? I just told you that a moment ago."

xkcd 882: Significantly

"We found a statistically significant correlation between the data and our hypothesis. (p < 0.05)"

The Foundation: Statistical Modeling

Think of statistical modeling as the foundation of your modeling house - you can build fancy additions on top, but you need to understand the basics first.

Statistical modeling focuses on understanding relationships and making inferences about populations. Unlike machine learning (which prioritizes prediction), statistical models help you understand why things happen, not just what will happen.

Introduction to statsmodels

statsmodels is Python's comprehensive statistical modeling library. It provides tools for statistical inference, hypothesis testing, and model diagnostics - the bread and butter of statistical analysis.

When to use statsmodels:

pandas compatibility: Most statsmodels functions work directly with pandas DataFrames. You can pass DataFrames to model constructors, and results are often returned as pandas objects (Series, DataFrames).

Reference:

Linear Regression

Linear regression is the workhorse of statistical modeling. It models the relationship between a dependent variable and one or more independent variables using a linear equation.

Think of linear regression as the Derek Zoolander of modeling - it's simple, it's reliable, and it can turn left (or right, or any direction really).

Linear Regression: The Blue Steel of Modeling

Linear regression finds the best-fitting line through your data. It's like finding the perfect pose - simple, elegant, and it works every time (well, most of the time).

y = β₀ + β₁x₁ + β₂x₂ + ... + ε

Where:
- y = dependent variable (what you're predicting)
- β₀ = intercept (where the line starts)
- β₁, β₂, ... = coefficients (how much each x affects y)
- ε = error term (the stuff we can't explain)

Visual Example: Simple Linear Regression

y (target)
  ↑
  |     ●
  |   ●   ●
  | ●       ●
  |●         ●
  |_____________→ x (feature)
  
Best-fit line: y = 2.0 + 1.5x

The line minimizes the distance (errors) between all data points and the line itself. That's what "least squares" means!

"I can turn left, I can turn right, I can even turn... statistically significant!"

Reference:

Example:

import statsmodels.api as sm
import statsmodels.formula.api as smf
import pandas as pd
import numpy as np

# Create sample data
np.random.seed(42)
df = pd.DataFrame({
    'x1': np.random.randn(100),
    'x2': np.random.randn(100),
    'y': 2 + 3 * np.random.randn(100) + 0.5 * np.random.randn(100)
})

# Formula API (R-like, works with DataFrames)
model = smf.ols('y ~ x1 + x2', data=df)
results = model.fit()
print(results.summary())

# Access coefficients
print(results.params)  # Intercept, x1, x2 coefficients
print(results.pvalues)  # Statistical significance

The summary() method provides comprehensive output including R-squared, p-values, confidence intervals, and model diagnostics - all the statistical information you need for inference.

xkcd 539: Boyfriend

Other Statistical Methods

statsmodels provides many other statistical modeling tools beyond linear regression:

Generalized Linear Models (GLMs):

Time Series Models:

When Statistical Methods Beat ML:

Remember: Statistical models answer "why?" Machine learning models answer "what?" Both are valuable, but for different questions.

xkcd 1725: Correlation

"Correlation doesn't imply causation, but it does waggle its eyebrows suggestively and gesture furtively while mouthing 'look over there'."

LIVE DEMO

"Traditional" Machine Learning

Think of scikit-learn as the Swiss Army knife of machine learning - it has a tool for almost everything, it's reliable, and it's been around long enough that everyone knows how to use it.

Machine learning focuses on prediction rather than inference. While statistical models help you understand relationships, ML models help you make accurate predictions on new data.

Introduction to scikit-learn

scikit-learn is Python's standard machine learning library. It provides a consistent API across all models: fit, predict, transform. This consistency makes it easy to try different algorithms and build complex pipelines.

The scikit-learn API Pattern:

# 1. Create model
model = SomeModel()

# 2. Fit on training data
model.fit(X_train, y_train)

# 3. Make predictions
predictions = model.predict(X_test)

Train/Test Split Visualization:

Original Dataset (1000 samples)
├── Training Set (800 samples, 80%)
│   └── Used to train the model
└── Test Set (200 samples, 20%)
    └── Used to evaluate model performance
        (Never seen during training!)

The golden rule: Never evaluate on data the model has seen during training. That's like giving a student the answers before the test and then being surprised they got 100%.

Why scikit-learn is the ML standard:

pandas compatibility: scikit-learn functions accept pandas DataFrames and Series directly. However, some operations (like fit_transform) may return NumPy arrays, so you may need to convert back to DataFrames if you want to preserve column names.

Reference:

Linear Regression

Linear regression in scikit-learn is optimized for prediction rather than inference. It's faster and simpler than statsmodels but doesn't provide p-values or detailed diagnostics.

statsmodels vs scikit-learn Linear Regression:

Feature statsmodels scikit-learn
Purpose Statistical inference Prediction
P-values ✅ Yes ❌ No
Confidence intervals ✅ Yes ❌ No
Model diagnostics ✅ Comprehensive ❌ Basic
Speed Slower Faster
Use when Need to understand relationships Need predictions

Think of it this way: statsmodels answers "why?" while scikit-learn answers "what?"

Reference:

Regularization: Ridge and Lasso add penalty terms to prevent overfitting. Ridge (L2) shrinks coefficients, Lasso (L1) can zero out coefficients (feature selection).

Regularization Comparison:

Method Penalty Type Effect on Coefficients Use When
Linear Regression None No shrinkage Simple problems, no overfitting
Ridge (L2) Sum of squares Shrinks all coefficients Many features, multicollinearity
Lasso (L1) Sum of absolute values Can zero out coefficients Feature selection needed

Example:

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np

# Create sample data
np.random.seed(42)
X = np.random.randn(100, 3)
y = 2 + 3 * X[:, 0] + 0.5 * X[:, 1] + np.random.randn(100)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Fit model
model = LinearRegression()
model.fit(X_train, y_train)

# Predictions and evaluation
predictions = model.predict(X_test)
score = model.score(X_test, y_test)  # R²
print(f"R² score: {score:.3f}")

Random Forest

Random Forest is an ensemble method that combines multiple decision trees. It's robust, handles non-linear relationships well, and provides feature importance scores.

Random Forest is like having a committee of decision trees vote on the answer. It's democracy in action - except the trees are actually smart and the voting actually works.

How Random Forest Works:

Training Data
    ↓
Create 100 Decision Trees (each sees random subset)
    ↓
Tree 1: Predicts Class A
Tree 2: Predicts Class B
Tree 3: Predicts Class A
...
Tree 100: Predicts Class A
    ↓
Final Prediction: Class A (majority vote)

Each tree votes, and the most popular answer wins. It's like asking 100 people for directions - the majority is usually right!

Why Random Forest?

Reference:

Example:

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np

# Create sample data
np.random.seed(42)
X = np.random.randn(200, 4)
y = (X[:, 0] + X[:, 1] > 0).astype(int)  # Binary classification

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Fit model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Predictions and feature importance
predictions = model.predict(X_test)
importance = model.feature_importances_
print(f"Feature importance: {importance}")

Other scikit-learn Methods

scikit-learn provides many other algorithms:

Classification:

Regression:

Unsupervised Learning:

Model Selection:

Pro tip: Start with Random Forest for most problems. It's like the "blue steel" of machine learning - reliable, effective, and works in most situations.

"Did you ever think that maybe there's more to life than being really, really, ridiculously good at machine learning?"

Really, really, really, ridiculously...

The scikit-learn Workflow:

flowchart LR
    A[Raw Data] --> B[Preprocessing]
    B --> C[Train/Test Split]
    C --> D[Fit Model]
    D --> E[Make Predictions]
    E --> F[Evaluate]
    F -->|Good enough?| G[Deploy]
    F -->|Not good enough?| H[Tune Hyperparameters]
    H --> D
    
    style D fill:#e1f5ff
    style E fill:#fff4e1
    style F fill:#ffe1f5

"I'm not an ambi-turner. I can't turn left. I can't turn right. But I CAN fit, predict, and score!"

The Secret Weapon: Gradient Boosting

Gradient boosting is like the Magnum of machine learning - it's the secret weapon that wins competitions and makes you look like a modeling genius.

Gradient boosting has dominated machine learning competitions (Kaggle, etc.) for years. It's particularly powerful for tabular data - the kind of structured data you work with in pandas DataFrames.

Why Gradient Boosting?

Performance on Tabular Data:

When to Choose Over Deep Learning:

Real-World Dominance:

Fun fact: XGBoost stands for "Extreme Gradient Boosting" - and it lives up to the name. It's so good that it's basically cheating (but legal cheating, which is the best kind).

Gradient Boosting: The Magnum of Machine Learning

Gradient boosting builds models sequentially, each one correcting the mistakes of the previous ones.

Model 1: Makes predictions (with errors)
Model 2: Predicts the errors of Model 1
Model 3: Predicts the errors of Model 2
...
Final: Combine all models (like a modeling ensemble)

Gradient Boosting Step-by-Step:

Step What Happens Example
1 Initial model makes predictions Predicts: [5.0, 3.0, 7.0]
2 Calculate errors (residuals) Actual: [5.5, 3.2, 6.8], Errors: [0.5, 0.2, -0.2]
3 New model predicts the errors Predicts errors: [0.4, 0.3, -0.1]
4 Add error predictions to original New predictions: [5.4, 3.3, 6.9]
5 Repeat until errors are minimized Continue for N rounds

Each new model focuses on what the previous model got wrong. It's like having a tutor who only helps with your mistakes!

"What is this? A model for ants? It needs to be at least... three times more accurate!"

xkcd 2400: Machine Learning

"Our machine learning model has achieved 99.9% accuracy on the training data!" "Great! How does it do on new data?" "Oh, we haven't tested that yet."

XGBoost Basics

XGBoost (Extreme Gradient Boosting) is the most popular gradient boosting library. It's fast, accurate, and handles many data types well.

Reference:

Key Hyperparameters:

Hyperparameter Effects:

Hyperparameter Too Low Too High Sweet Spot
n_estimators Underfitting Overfitting 50-200
max_depth Can't learn complex patterns Overfitting 3-6
learning_rate Slow convergence Unstable training 0.01-0.3
subsample Less robust More variance 0.8-1.0

Finding the right hyperparameters is like tuning a car - too conservative and you're slow, too aggressive and you crash. The sweet spot is somewhere in between.

Early Stopping: Prevents overfitting by stopping training when validation performance stops improving.

Early stopping monitors validation performance during training. When validation metrics stop improving (or start getting worse), training stops automatically. This prevents overfitting by using the best model from earlier rounds rather than continuing to train.

Example:

import xgboost as xgb
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np

# Create sample data
np.random.seed(42)
X = np.random.randn(200, 5)
y = (X[:, 0] + X[:, 1] > 0).astype(int)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Fit XGBoost model
model = xgb.XGBClassifier(
    n_estimators=100,
    max_depth=3,
    learning_rate=0.1,
    early_stopping_rounds=10
)
model.fit(X_train, y_train, 
          eval_set=[(X_test, y_test)],
          verbose=False)

# Predictions and feature importance
predictions = model.predict(X_test)
importance = model.feature_importances_
print(f"Feature importance: {importance}")

Feature importance is returned as an array showing the relative importance of each feature. Higher values indicate more important features for making predictions.

The Boosting Ecosystem

Beyond XGBoost, there are other powerful gradient boosting libraries:

LightGBM:

CatBoost:

Pro tip: Start with XGBoost. If you need speed, try LightGBM. If you have lots of categories, try CatBoost. But remember: they're all really, really good. It's like choosing between blue steel, magnum, and le tigre - they're all amazing, just slightly different.

The Boosting Family Tree:

Gradient Boosting
├── XGBoost (Extreme - the competition winner)
├── LightGBM (Fast - the speed demon)
└── CatBoost (Categorical - the category king)

"It's all about family. And by family, I mean gradient boosting."

Fast & Furious Family

LIVE DEMO

Deep Learning: The Modern Frontier

Deep learning is like the "Derelicte" of modeling - it's cutting-edge, it's flashy, and everyone wants to use it even when they probably shouldn't.

Deep learning uses neural networks with multiple layers to learn complex patterns. It excels at unstructured data: images, text, audio, sequences.

Why Deep Learning?

When Neural Networks Excel:

The Deep Learning vs Traditional ML Decision:

When NOT to Use Deep Learning:

Overfitting Visualization:

Good Fit:                    Overfitting:
Training Loss: 0.2          Training Loss: 0.05
Test Loss: 0.22             Test Loss: 0.35
                            ↑ Big gap = overfitting!

The model learned patterns    The model memorized training
that generalize well.        data but can't generalize.

Overfitting is like memorizing answers to practice problems but failing the actual test. The model performs great on training data but poorly on new data.

Remember: Deep learning is powerful, but it's not always the answer. Sometimes a simple model is the right model.

When to Use Deep Learning: A Decision Framework

flowchart TD
    A[Your Problem] --> B{Data Type?}
    B -->|Images| C[Use Deep Learning<br/>CNNs]
    B -->|Text| D[Use Deep Learning<br/>RNNs/Transformers]
    B -->|Audio| E[Use Deep Learning<br/>RNNs]
    B -->|Tabular| F{How much data?}
    F -->|Millions of rows| G{Consider Deep Learning}
    F -->|Thousands of rows| H[Use XGBoost<br/>or Random Forest]
    G -->|Complex patterns| I[Maybe Deep Learning]
    G -->|Simple patterns| H
    C --> J[Neural Networks]
    D --> J
    E --> J
    I --> J
    J --> K[Train for days<br/>Hope it works]
    
    style J fill:#e1ffe1
    style H fill:#fff4e1
    style K fill:#ffe1f5

"But why deep learning models?" "Seriously? I just told you that a moment ago."

xkcd 2169: Predictive Models

"Our model is 99% accurate!" "On what?" "On the data we trained it on." "And on new data?" "We're still working on that part."

TensorFlow/Keras: The High-Level Approach

TensorFlow is Google's deep learning framework. Keras (now integrated into TensorFlow) provides a high-level, user-friendly API for building neural networks.

Why TensorFlow/Keras?

Reference:

Basic Workflow:

  1. Build model - Define architecture (layers)
  2. Compile model - Specify optimizer, loss function, metrics
  3. Train model - Fit on training data
  4. Evaluate model - Check performance on test data
  5. Make predictions - Use trained model

During training, you'll see loss decrease and accuracy (or other metrics) improve with each epoch. Monitor both training and validation metrics to detect overfitting.

Neural Network Architecture (Simple Example):

Input Layer (10 features)
    ↓
Hidden Layer 1 (64 neurons, ReLU)
    ↓
Hidden Layer 2 (32 neurons, ReLU)
    ↓
Output Layer (1 neuron, Sigmoid)

What Each Layer Does:

Layer Purpose Example
Input Receives raw features 10 numeric features
Hidden 1 Learns complex patterns 64 neurons find non-linear relationships
Hidden 2 Refines patterns 32 neurons combine learned features
Output Makes final prediction 1 neuron outputs probability (0-1)

"I'm not an ambi-turner. I can't turn left. I can't turn right. But I CAN backpropagate!"

Example:

import tensorflow as tf
from tensorflow import keras
import numpy as np

# Create sample data
np.random.seed(42)
X_train = np.random.randn(1000, 10)
y_train = (X_train.sum(axis=1) > 0).astype(int)
X_test = np.random.randn(200, 10)
y_test = (X_test.sum(axis=1) > 0).astype(int)

# Build model
model = keras.Sequential([
    keras.layers.Dense(64, activation='relu', input_shape=(10,)),
    keras.layers.Dense(32, activation='relu'),
    keras.layers.Dense(1, activation='sigmoid')
])

# Compile model
model.compile(
    optimizer='adam',
    loss='binary_crossentropy',
    metrics=['accuracy']
)

# Train model
model.fit(X_train, y_train, epochs=10, batch_size=32, verbose=0)

# Evaluate
loss, accuracy = model.evaluate(X_test, y_test, verbose=0)
print(f"Accuracy: {accuracy:.3f}")

PyTorch: The Research Standard

PyTorch is Facebook's deep learning framework. It's popular in research because of its Pythonic, flexible design and dynamic computation graphs.

PyTorch vs TensorFlow Philosophy:

Reference:

Note: We're keeping PyTorch brief here since TensorFlow/Keras is more beginner-friendly. But PyTorch is excellent for research and when you need more control.

Modern Frameworks

Beyond TensorFlow and PyTorch, there are cutting-edge research frameworks:

JAX:

Other Research Frameworks:

The Deep Learning Ecosystem:

Deep Learning Frameworks
├── TensorFlow/Keras (Production - the reliable one)
├── PyTorch (Research - the flexible one)
└── JAX (Cutting-edge - the experimental one)

"What is this? A learning rate for ants? It needs to be at least... three times smaller!"

Model Performance Comparison (Humorous):

Model Type Training Time Accuracy Interpretability When to Use
Linear Regression ⚡ Very Fast 📊 Good ✅ High Always start here
Random Forest ⚡⚡ Fast 📊📊 Very Good ✅✅ Medium Most problems
XGBoost ⚡⚡ Fast 📊📊📊 Excellent ✅ Medium Tabular data
Deep Learning 🐌 Slow 📊📊📊📊 Excellent* ❌ Low Images/text/audio

Note: Training time varies significantly with dataset size. XGBoost is often faster than Random Forest on large datasets, but both are much faster than deep learning for tabular data. Deep learning accuracy is excellent only if you have enough data and time to tune it properly. Otherwise, it's just an expensive way to overfit.

"I'm pretty sure there's a lot more to modeling than being really, really, ridiculously good at deep learning." "But it helps!"

LIVE DEMO