Data Visualization: From Exploration to Communication

See BONUS.md for advanced topics:

Fun fact: The word "visualization" comes from the Latin "visus" meaning "sight." In data science, we're literally making data visible - turning numbers into stories that our eyes can understand and our brains can process.

Outline

xkcd 1945: Scientific Paper Graph Quality

"The data clearly shows that our hypothesis is correct, assuming we ignore all the data that doesn't support our hypothesis."

Edward Tufte's Principles of Data Visualization

Good visualization is like good writing - it should be clear, honest, and serve the reader (or viewer) first.

"Above all else, show the data." - Edward Tufte

Edward Tufte, the pioneer of information design, established fundamental principles that remain essential for effective data visualization.

Essential Reading:

1. Data-Ink Ratio: Maximize the Data-Ink

The data-ink ratio is the proportion of ink (or pixels) used to present actual data compared to the total ink used in the entire display.

Data-Ink Ratio = Data-Ink / Total Ink Used

Tufte's Goal: Maximize this ratio by eliminating non-data ink (chartjunk).

Key Practices:

Data-Ink Ratio Comparison

Left: Low data-ink ratio with excessive decoration. Right: High data-ink ratio focusing on the data.

2. Chartjunk: Eliminate Visual Noise

Chartjunk includes any visual elements that do not convey information:

3. Lie Factor: Maintain Visual Integrity

The lie factor measures how much a visualization distorts the data:

Lie Factor = (Size of effect shown in graphic) / (Size of effect in data)

Ideal Lie Factor: Close to 1.0 (no distortion)

Common distortions to avoid:

4. Small Multiples: Show Comparisons

Use small, repeated charts with the same scale to enable easy comparison across categories or time.

Small Multiples Example

Small multiples enable quick visual comparison across multiple dimensions while maintaining consistent scales.

5. High-Resolution Data Graphics

Show as much detail as the data allows - don't oversimplify or aggregate unnecessarily.

Before/After Examples: Applying Tufte's Principles

Example 1: Bar Chart Redesign

Bar Chart Comparison

Before (left): Excessive colors, patterns, and heavy gridlines distract from the data. After (right): Clean design with direct labeling maximizes data-ink ratio.

Example 2: Line Chart with Truncated Axis (Lie Factor)

Lie Factor Example

Before (left): Truncated y-axis creates a high lie factor, exaggerating modest growth. After (right): Honest scale starting at zero shows true magnitude of change.

Color Palette Best Practices

Different data types require different color strategies:

Color Palette Guide

Color Selection Guidelines:

Additional Resources:

The Right Chart for the Job

Chart Selection Guide:

Chart Selection Guide

Different chart types are optimized for different data relationships and questions. Choose the right chart for your message.

The Visualization Ecosystem

Reality check: There are more Python visualization libraries than there are ways to mess up a bar chart. But don't worry - we'll focus on the essential tools that actually matter for daily data science work.

Python's visualization landscape has evolved dramatically. While matplotlib remains the foundation, modern tools like seaborn, altair, and plotnine offer more intuitive interfaces for common tasks.

Visual Guide - Python Visualization Stack:

FOUNDATION LAYER
┌─────────────────────────────────────┐
│           matplotlib                │  ← Low-level, highly customizable
│     (The foundation of everything)   │
└─────────────────────────────────────┘
                    ↑
                    │
            PANDAS LAYER
┌─────────────────────────────────────┐
│         pandas.plot()              │  ← Quick exploration, built on matplotlib
│     (DataFrame/Series plotting)     │
└─────────────────────────────────────┘
                    ↑
                    │
            STATISTICAL LAYER
┌─────────────────────────────────────┐
│           seaborn                   │  ← Statistical plots, beautiful defaults
│     (Built on matplotlib)           │
└─────────────────────────────────────┘
                    ↑
                    │
            MODERN LAYER
┌─────────────────────────────────────┐
│    altair (vega-lite)               │  ← Grammar of graphics, interactive
│    plotnine (ggplot2)               │  ← R's ggplot2 in Python
└─────────────────────────────────────┘

Choosing the Right Tool

When to use what:

Pro tip: Start with pandas for exploration, seaborn for analysis, matplotlib for customization, and modern tools for interactive/sharing needs.

matplotlib: Foundation Layer

Think of matplotlib as the foundation of your visualization house - you can build anything on it, but you need to understand the plumbing before you can install the fancy fixtures.

matplotlib is the bedrock of Python visualization. While it can be verbose, understanding its core concepts gives you the power to create any visualization you can imagine.

Figures and Subplots

Every matplotlib plot lives within a Figure object, which can contain multiple subplots (individual plot areas).

Reference:

Example:

import matplotlib.pyplot as plt
import numpy as np

# Create a figure with 2x2 subplots
fig, axes = plt.subplots(2, 2, figsize=(10, 8))

# Plot on each subplot
axes[0, 0].plot([1, 2, 3, 4], [1, 4, 2, 3])
axes[0, 0].set_title('Line Plot')

axes[0, 1].hist(np.random.normal(0, 1, 1000), bins=30)
axes[0, 1].set_title('Histogram')

axes[1, 0].scatter(np.random.randn(100), np.random.randn(100))
axes[1, 0].set_title('Scatter Plot')

axes[1, 1].bar(['A', 'B', 'C'], [3, 7, 2])
axes[1, 1].set_title('Bar Chart')

plt.tight_layout()
plt.show()

Matplotlib Subplots Example

Creating multiple subplots in a single figure allows for easy comparison across different visualization types.

Customizing Plots

matplotlib's power comes from its extensive customization options.

Reference:

Example:

# Create a customized plot
fig, ax = plt.subplots(figsize=(8, 6))

# Generate sample data
x = np.linspace(0, 10, 100)
y1 = np.sin(x)
y2 = np.cos(x)

# Plot with customization
ax.plot(x, y1, label='sin(x)', color='blue', linewidth=2)
ax.plot(x, y2, label='cos(x)', color='red', linewidth=2, linestyle='--')

# Customize appearance
ax.set_title('Trigonometric Functions')
ax.set_xlabel('X values')
ax.set_ylabel('Y values')
ax.grid(True, alpha=0.3)
ax.legend()

plt.tight_layout()
plt.show()

Matplotlib Customization Example

Customization allows you to create publication-quality plots with precise control over every visual element.

Colors, Markers, and Line Styles

matplotlib offers extensive control over visual elements.

Reference:

Colors:

Line Styles:

Markers:

Example:

# Demonstrate different styles
fig, ax = plt.subplots(figsize=(10, 6))

x = np.linspace(0, 10, 20)

# Different line styles and markers
ax.plot(x, x, 'o-', label='circles', color='blue', markersize=8)
ax.plot(x, x**0.5, 's--', label='squares', color='red', markersize=6)
ax.plot(x, np.log(x+1), '^-.', label='triangles', color='green', markersize=8)
ax.plot(x, np.sin(x), '*:', label='stars', color='purple', markersize=10)

ax.set_title('Different Line Styles and Markers')
ax.legend()
ax.grid(True, alpha=0.3)
plt.show()

Matplotlib Colors and Styles

matplotlib provides extensive options for colors, markers, and line styles to create visually distinct data series.

xkcd 833: Convincing

"And if you don't label your axes, I'm leaving you." - The importance of proper chart labeling, illustrated.

LIVE DEMO!

pandas: Quick Data Exploration

Think of pandas plotting as your data exploration Swiss Army knife - not the most specialized tool, but incredibly useful for getting a quick sense of your data.

pandas provides convenient plotting methods that build on matplotlib, perfect for quick data exploration.

Reference:

Example:

import pandas as pd
import numpy as np

# Create sample data
np.random.seed(42)
df = pd.DataFrame({
    'A': np.random.randn(100),
    'B': np.random.randn(100),
    'C': np.random.randn(100)
})

# Quick exploration with pandas
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# Line plot
df.plot(ax=axes[0, 0], title='Line Plot')

# Histogram
df.plot(kind='hist', ax=axes[0, 1], alpha=0.7, title='Histogram')

# Scatter plot
df.plot(kind='scatter', x='A', y='B', ax=axes[1, 0], title='Scatter Plot')

# Box plot
df.plot(kind='box', ax=axes[1, 1], title='Box Plot')

plt.tight_layout()
plt.show()

Pandas Plotting Examples

pandas plotting methods provide quick, convenient visualization for data exploration with minimal code.

DataFrame Plotting Options

Reference:

Example:

# Sales data example
sales_data = pd.DataFrame({
    'Month': ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun'],
    'Product_A': [100, 120, 110, 130, 140, 135],
    'Product_B': [80, 90, 95, 105, 110, 115],
    'Product_C': [60, 70, 75, 80, 85, 90]
})

# Set Month as index for better plotting
sales_data.set_index('Month', inplace=True)

# Create subplots for each product
sales_data.plot(subplots=True, figsize=(10, 8), 
                title='Sales by Product Over Time',
                grid=True, legend=True)
plt.tight_layout()
plt.show()

seaborn: Statistical Graphics

seaborn is like having a data visualization expert sitting next to you, automatically choosing the right colors, styles, and statistical methods to make your plots look professional and informative.

seaborn builds on matplotlib to provide beautiful statistical visualizations with minimal code. It's the go-to choice for most data analysis tasks.

Reference:

Example:

import seaborn as sns

# Set seaborn style
sns.set_style('whitegrid')
tips = sns.load_dataset('tips')

# Create multiple plots
fig, axes = plt.subplots(2, 2, figsize=(12, 8))

# Scatter plot
sns.scatterplot(data=tips, x='total_bill', y='tip', 
                hue='time', ax=axes[0, 0])
axes[0, 0].set_title('Total Bill vs Tip')

# Box plot
sns.boxplot(data=tips, x='day', y='tip', ax=axes[0, 1])
axes[0, 1].set_title('Tip by Day')

# Histogram
sns.histplot(data=tips, x='total_bill', hue='time', 
             alpha=0.7, ax=axes[1, 0])
axes[1, 0].set_title('Bill Distribution')

plt.tight_layout()
plt.show()

Seaborn Statistical Plots

seaborn excels at creating beautiful statistical visualizations with automatic styling and color choices.

Advanced seaborn Features

Reference:

Example:

# Advanced seaborn visualizations
fig, axes = plt.subplots(2, 2, figsize=(15, 12))

# Pair plot (shows all pairwise relationships)
# Note: This creates its own figure, so we'll use a subset
sample_data = tips.sample(50)
sns.pairplot(sample_data, hue='time', height=3)

# Joint plot (scatter + histograms)
sns.jointplot(data=tips, x='total_bill', y='tip', kind='hex')

# Violin plot (shows distribution shape)
sns.violinplot(data=tips, x='day', y='tip', ax=axes[0, 0])
axes[0, 0].set_title('Tip Distribution by Day (Violin Plot)')

# Strip plot (shows individual points)
sns.stripplot(data=tips, x='day', y='tip', hue='time', ax=axes[0, 1])
axes[0, 1].set_title('Individual Tips by Day and Time')

plt.tight_layout()
plt.show()

Density Plots and Distribution Visualization

Density plots show the shape of your data distribution - they're like histograms but smoother, revealing patterns that might be hidden in discrete bins.

Density plots (also called KDE - Kernel Density Estimation) provide a smooth representation of data distribution.

Reference:

Example:

# Create sample data with different distributions
np.random.seed(42)
normal_data = np.random.normal(0, 1, 1000)
bimodal_data = np.concatenate([
    np.random.normal(-2, 0.5, 500),
    np.random.normal(2, 0.5, 500)
])

# Density plots
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# pandas density plot
pd.Series(normal_data).plot.density(ax=axes[0, 0], title='Normal Distribution')
axes[0, 0].grid(True, alpha=0.3)

# seaborn density plot
sns.kdeplot(data=normal_data, ax=axes[0, 1], title='Normal Distribution (seaborn)')
axes[0, 1].grid(True, alpha=0.3)

# Bimodal distribution
sns.kdeplot(data=bimodal_data, ax=axes[1, 0], title='Bimodal Distribution')
axes[1, 0].grid(True, alpha=0.3)

# Combined histogram and density
sns.histplot(data=normal_data, kde=True, ax=axes[1, 1], title='Histogram + Density')
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

LIVE DEMO!

Modern Visualization Libraries

The Python visualization ecosystem is constantly evolving. While matplotlib and seaborn are the workhorses, modern libraries offer exciting new approaches.

vega-altair: Grammar of Graphics with Vega-Lite

altair uses a declarative approach where you describe the data mapping rather than specifying drawing commands. It implements the grammar of graphics through Vega-Lite.

altair implements the Vega-Lite grammar of graphics, providing a declarative approach to creating statistical visualizations. It's designed for interactive web-based visualizations and supports multiple output formats.

Chart Creation and Mark Types

altair uses a simple pattern: create a chart, specify the mark type, and encode your data.

Reference:

Example:

import altair as alt
import pandas as pd

# Sample data
data = pd.DataFrame({
    'x': [1, 2, 3, 4, 5],
    'y': [2, 4, 1, 5, 3],
    'category': ['A', 'B', 'A', 'C', 'B']
})

# Scatter plot - shows relationships between variables
scatter = alt.Chart(data).mark_circle().encode(x='x', y='y')
scatter.show()

# Bar chart - compares values across categories  
bar = alt.Chart(data).mark_bar().encode(x='category', y='y')
bar.show()

# Line chart - shows trends over ordered data
line = alt.Chart(data).mark_line().encode(x='x', y='y')
line.show()

# Combined view using altair's concatenation
combined = alt.hconcat(scatter, bar, line)
combined.show()

Altair Basic Charts

Combined altair charts: scatter plot (left), bar chart (middle), line plot (right)

Data Encoding

The .encode() method maps data columns to visual properties using type annotations.

Reference:

Example:

# Enhanced scatter plot with encoding
chart = alt.Chart(data).mark_circle().encode(
    x='x:Q',                    # Quantitative x-axis
    y='y:Q',                    # Quantitative y-axis
    color='category:N',         # Color by category
    size='y:Q',                 # Size by y-value
    tooltip=['x', 'y', 'category']  # Hover info
)

chart.show()

Altair Encoded Chart

Enhanced scatter plot with color encoding by category and size encoding by y-value

Interactive Features

altair provides built-in interactivity through the .interactive() method, enabling zoom, pan, and selection.

Reference:

Example:

# Interactive scatter plot
interactive_chart = alt.Chart(data).mark_circle().encode(
    x='x:Q',
    y='y:Q', 
    color='category:N',
    tooltip=['x', 'y', 'category']
).interactive()

interactive_chart.show()

Advanced altair Features

Faceting and Layering

altair supports faceting (small multiples) and layering multiple mark types.

Reference:

Example:

# Faceted chart
faceted = alt.Chart(data).mark_circle().encode(
    x='x:Q',
    y='y:Q',
    color='category:N'
).facet('category:N', columns=2)

# Layered chart
base = alt.Chart(data).encode(x='x:Q', y='y:Q')
layered = alt.layer(
    base.mark_circle(color='lightblue'),
    base.mark_line(color='red').transform_regression('x', 'y')
)

Statistical Transformations

altair includes built-in statistical transformations.

Reference:

Example:

# Chart with regression line
regression = alt.Chart(data).mark_circle().encode(
    x='x:Q',
    y='y:Q'
) + alt.Chart(data).mark_line(color='red').transform_regression(
    'x', 'y'
).encode(x='x:Q', y='y:Q')

Export Formats

altair supports multiple output formats for different use cases.

Reference:

Example:

# Export to different formats
chart.save('scatter.png')      # Static bitmap
chart.save('scatter.svg')      # Static vector
chart.save('interactive.html') # Interactive HTML

Other Modern Tools: plotnine, Bokeh, and Plotly

plotnine: ggplot2 for Python

plotnine brings R's ggplot2 syntax to Python, perfect for those familiar with R.

Key Features:

Reference:

Example:

# Simple scatter plot with ggplot2 syntax
(ggplot(tips, aes(x='total_bill', y='tip', color='time'))
 + geom_point()
 + theme_minimal())

Bokeh: Interactive Web Visualizations

Bokeh creates interactive web-based visualizations with rich interactivity.

Key Features:

Reference:

Plotly: Interactive Dashboards

Plotly excels at creating interactive dashboards and web applications.

Key Features:

Reference:

Example:

import plotly.express as px

# Simple interactive scatter plot
fig = px.scatter(tips, x='total_bill', y='tip', color='time',
                 title="Interactive Scatter Plot")
fig.show()

Tool Selection Guide

When to use each tool:

Tool Best For Learning Curve Interactivity Output Formats Grammar
matplotlib Custom plots, publication quality High None PNG/SVG/PDF Imperative
seaborn Statistical plots, beautiful defaults Low None PNG/SVG/PDF Imperative
pandas Quick exploration, basic charts Very Low None PNG/SVG/PDF Imperative
altair Interactive plots, grammar of graphics Medium Built-in PNG/SVG/HTML/JSON Declarative
plotnine R users, layered approach Medium None PNG/SVG/PDF Declarative
bokeh Interactive web visualizations High High HTML/JS Imperative
plotly Dashboards, web applications Medium High HTML/JS Declarative

xkcd 1138: Heatmap

"Every single map of the United States looks the same because it's just a population density map." - A reminder that your visualization should show meaningful patterns, not just expected distributions.

LIVE DEMO!