Skip to content

07: Data Cleaning

See Bonus for advanced topics:

  • Advanced matplotlib customization and publication-quality plots
  • Interactive visualization with Bokeh and Plotly
  • Statistical visualization with seaborn advanced features
  • Custom color palettes and themes
  • Animation and dynamic plots

Fun fact: The word “visualization” comes from the Latin “visus” meaning “sight.” In data science, we’re literally making data visible - turning numbers into stories that our eyes can understand and our brains can process.

  • matplotlib fundamentals (figures, subplots, customization)
  • Statistical visualizations with seaborn
  • pandas plotting for quick data exploration
  • Visualization principles and best practices
  • Tufte’s principles for effective data visualization
  • Modern visualization libraries (altair, plotnine)

xkcd 1945: Scientific Paper Graph Quality

“The data clearly shows that our hypothesis is correct, assuming we ignore all the data that doesn’t support our hypothesis.”

Edward Tufte’s Principles of Data Visualization

Section titled “Edward Tufte’s Principles of Data Visualization”

Good visualization is like good writing - it should be clear, honest, and serve the reader (or viewer) first.

“Above all else, show the data.” - Edward Tufte

Edward Tufte, the pioneer of information design, established fundamental principles that remain essential for effective data visualization.

Essential Reading:

1. Data-Ink Ratio: Maximize the Data-Ink

The data-ink ratio is the proportion of ink (or pixels) used to present actual data compared to the total ink used in the entire display.

Data-Ink Ratio = Data-Ink / Total Ink Used

Tufte’s Goal: Maximize this ratio by eliminating non-data ink (chartjunk).

Key Practices:

  • Remove unnecessary gridlines (or make them subtle)
  • Eliminate decorative elements that don’t convey information
  • Use direct labeling instead of legends when possible
  • Avoid 3D effects and shadows that distort perception
  • Remove redundant labels and tick marks

Data-Ink Ratio Comparison

Left: Low data-ink ratio with excessive decoration. Right: High data-ink ratio focusing on the data.

2. Chartjunk: Eliminate Visual Noise

Chartjunk includes any visual elements that do not convey information:

  • Unnecessary 3D effects
  • Heavy grid lines
  • Decorative fills and patterns
  • Excessive colors
  • Redundant labels

3. Lie Factor: Maintain Visual Integrity

The lie factor measures how much a visualization distorts the data:

Lie Factor = (Size of effect shown in graphic) / (Size of effect in data)

Ideal Lie Factor: Close to 1.0 (no distortion)

Common distortions to avoid:

  • Truncated y-axes that exaggerate differences
  • 3D perspective that distorts area/volume comparisons
  • Inconsistent scales
  • Cherry-picked time ranges

4. Small Multiples: Show Comparisons

Use small, repeated charts with the same scale to enable easy comparison across categories or time.

Small Multiples Example

Small multiples enable quick visual comparison across multiple dimensions while maintaining consistent scales.

5. High-Resolution Data Graphics

Show as much detail as the data allows - don’t oversimplify or aggregate unnecessarily.

Before/After Examples: Applying Tufte’s Principles

Section titled “Before/After Examples: Applying Tufte’s Principles”

Bar Chart Comparison

Before (left): Excessive colors, patterns, and heavy gridlines distract from the data. After (right): Clean design with direct labeling maximizes data-ink ratio.

Example 2: Line Chart with Truncated Axis (Lie Factor)

Section titled “Example 2: Line Chart with Truncated Axis (Lie Factor)”

Lie Factor Example

Before (left): Truncated y-axis creates a high lie factor, exaggerating modest growth. After (right): Honest scale starting at zero shows true magnitude of change.

Different data types require different color strategies:

Color Palette Guide

Color Selection Guidelines:

  • Sequential: Use for ordered data (temperature, age, income) - single hue gradient
  • Diverging: Use for data with meaningful zero/midpoint (profit/loss, correlation) - two contrasting hues
  • Qualitative: Use for categories with no inherent order - distinct, unrelated colors
  • Accessibility: Always test for colorblind accessibility using tools like ColorBrewer

Additional Resources:

Chart Selection Guide:

  • Line charts: Time series, trends over time
  • Bar charts: Categories, comparisons
  • Scatter plots: Relationships between two variables
  • Histograms: Distribution of single variable
  • Box plots: Distribution with outliers
  • Heatmaps: Patterns in 2D data
  • Pie charts: Parts of a whole (use sparingly!)

Chart Selection Guide

Different chart types are optimized for different data relationships and questions. Choose the right chart for your message.

Reality check: There are more Python visualization libraries than there are ways to mess up a bar chart. But don’t worry - we’ll focus on the essential tools that actually matter for daily data science work.

Python’s visualization landscape has evolved dramatically. While matplotlib remains the foundation, modern tools like seaborn, altair, and plotnine offer more intuitive interfaces for common tasks.

Visual Guide - Python Visualization Stack:

FOUNDATION LAYER
┌─────────────────────────────────────┐
│ matplotlib │ ← Low-level, highly customizable
│ (The foundation of everything) │
└─────────────────────────────────────┘
PANDAS LAYER
┌─────────────────────────────────────┐
│ pandas.plot() │ ← Quick exploration, built on matplotlib
│ (DataFrame/Series plotting) │
└─────────────────────────────────────┘
STATISTICAL LAYER
┌─────────────────────────────────────┐
│ seaborn │ ← Statistical plots, beautiful defaults
│ (Built on matplotlib) │
└─────────────────────────────────────┘
MODERN LAYER
┌─────────────────────────────────────┐
│ altair (vega-lite) │ ← Grammar of graphics, interactive
│ plotnine (ggplot2) │ ← R's ggplot2 in Python
└─────────────────────────────────────┘

When to use what:

  • pandas.plot() - Quick exploration, basic charts
  • matplotlib - Custom plots, publication quality, fine control
  • seaborn - Statistical plots, beautiful defaults, relationship analysis
  • altair - Interactive plots, grammar of graphics, web-ready
  • plotnine - If you know ggplot2, consistent API

Pro tip: Start with pandas for exploration, seaborn for analysis, matplotlib for customization, and modern tools for interactive/sharing needs.

Think of matplotlib as the foundation of your visualization house - you can build anything on it, but you need to understand the plumbing before you can install the fancy fixtures.

matplotlib is the bedrock of Python visualization. While it can be verbose, understanding its core concepts gives you the power to create any visualization you can imagine.

Every matplotlib plot lives within a Figure object, which can contain multiple subplots (individual plot areas).

Reference:

  • plt.figure(figsize=(width, height)) - Create a new figure
  • fig.add_subplot(rows, cols, position) - Add subplot to figure
  • plt.subplots(rows, cols) - Create figure with multiple subplots
  • fig.savefig('filename.png', dpi=300) - Save figure to file
  • plt.show() - Display the plot

Example:

import matplotlib.pyplot as plt
import numpy as np
# Create a figure with 2x2 subplots
fig, axes = plt.subplots(2, 2, figsize=(10, 8))
# Plot on each subplot
axes[0, 0].plot([1, 2, 3, 4], [1, 4, 2, 3])
axes[0, 0].set_title('Line Plot')
axes[0, 1].hist(np.random.normal(0, 1, 1000), bins=30)
axes[0, 1].set_title('Histogram')
axes[1, 0].scatter(np.random.randn(100), np.random.randn(100))
axes[1, 0].set_title('Scatter Plot')
axes[1, 1].bar(['A', 'B', 'C'], [3, 7, 2])
axes[1, 1].set_title('Bar Chart')
plt.tight_layout()
plt.show()

Matplotlib Subplots Example

Creating multiple subplots in a single figure allows for easy comparison across different visualization types.

matplotlib’s power comes from its extensive customization options.

Reference:

  • ax.set_title('Title') - Set plot title
  • ax.set_xlabel('X Label') - Set x-axis label
  • ax.set_ylabel('Y Label') - Set y-axis label
  • ax.set_xlim(min, max) - Set x-axis limits
  • ax.set_ylim(min, max) - Set y-axis limits
  • ax.grid(True) - Add grid lines
  • ax.legend() - Add legend
  • ax.set_style('seaborn') - Change plot style

Example:

# Create a customized plot
fig, ax = plt.subplots(figsize=(8, 6))
# Generate sample data
x = np.linspace(0, 10, 100)
y1 = np.sin(x)
y2 = np.cos(x)
# Plot with customization
ax.plot(x, y1, label='sin(x)', color='blue', linewidth=2)
ax.plot(x, y2, label='cos(x)', color='red', linewidth=2, linestyle='--')
# Customize appearance
ax.set_title('Trigonometric Functions')
ax.set_xlabel('X values')
ax.set_ylabel('Y values')
ax.grid(True, alpha=0.3)
ax.legend()
plt.tight_layout()
plt.show()

Matplotlib Customization Example

Customization allows you to create publication-quality plots with precise control over every visual element.

matplotlib offers extensive control over visual elements.

Reference:

Colors:

  • Named colors: 'red', 'blue', 'green'
  • Hex colors: '#FF5733', '#2E8B57'
  • RGB tuples: (0.1, 0.2, 0.5)

Line Styles:

  • '-' solid, '--' dashed, '-.' dash-dot, ':' dotted

Markers:

  • 'o' circle, 's' square, '^' triangle, '*' star

Example:

# Demonstrate different styles
fig, ax = plt.subplots(figsize=(10, 6))
x = np.linspace(0, 10, 20)
# Different line styles and markers
ax.plot(x, x, 'o-', label='circles', color='blue', markersize=8)
ax.plot(x, x**0.5, 's--', label='squares', color='red', markersize=6)
ax.plot(x, np.log(x+1), '^-.', label='triangles', color='green', markersize=8)
ax.plot(x, np.sin(x), '*:', label='stars', color='purple', markersize=10)
ax.set_title('Different Line Styles and Markers')
ax.legend()
ax.grid(True, alpha=0.3)
plt.show()

Matplotlib Colors and Styles

matplotlib provides extensive options for colors, markers, and line styles to create visually distinct data series.

xkcd 833: Convincing

“And if you don’t label your axes, I’m leaving you.” - The importance of proper chart labeling, illustrated.

Think of pandas plotting as your data exploration Swiss Army knife - not the most specialized tool, but incredibly useful for getting a quick sense of your data.

pandas provides convenient plotting methods that build on matplotlib, perfect for quick data exploration.

Reference:

  • df.plot() - Line plot (default)
  • df.plot(kind='bar') - Bar chart
  • df.plot(kind='hist') - Histogram
  • df.plot(kind='scatter', x='col1', y='col2') - Scatter plot
  • df.plot(kind='box') - Box plot
  • df.plot(kind='pie') - Pie chart

Example:

import pandas as pd
import numpy as np
# Create sample data
np.random.seed(42)
df = pd.DataFrame({
'A': np.random.randn(100),
'B': np.random.randn(100),
'C': np.random.randn(100)
})
# Quick exploration with pandas
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
# Line plot
df.plot(ax=axes[0, 0], title='Line Plot')
# Histogram
df.plot(kind='hist', ax=axes[0, 1], alpha=0.7, title='Histogram')
# Scatter plot
df.plot(kind='scatter', x='A', y='B', ax=axes[1, 0], title='Scatter Plot')
# Box plot
df.plot(kind='box', ax=axes[1, 1], title='Box Plot')
plt.tight_layout()
plt.show()

Pandas Plotting Examples

pandas plotting methods provide quick, convenient visualization for data exploration with minimal code.

Reference:

  • subplots=True - Create separate subplots for each column
  • figsize=(width, height) - Set figure size
  • title='Title' - Set plot title
  • xlabel='X Label' - Set x-axis label
  • ylabel='Y Label' - Set y-axis label
  • legend=True - Show legend
  • grid=True - Add grid lines

Example:

# Sales data example
sales_data = pd.DataFrame({
'Month': ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun'],
'Product_A': [100, 120, 110, 130, 140, 135],
'Product_B': [80, 90, 95, 105, 110, 115],
'Product_C': [60, 70, 75, 80, 85, 90]
})
# Set Month as index for better plotting
sales_data.set_index('Month', inplace=True)
# Create subplots for each product
sales_data.plot(subplots=True, figsize=(10, 8),
title='Sales by Product Over Time',
grid=True, legend=True)
plt.tight_layout()
plt.show()

seaborn is like having a data visualization expert sitting next to you, automatically choosing the right colors, styles, and statistical methods to make your plots look professional and informative.

seaborn builds on matplotlib to provide beautiful statistical visualizations with minimal code. It’s the go-to choice for most data analysis tasks.

Reference:

  • sns.set_style('whitegrid') - Set plot style
  • sns.set_palette('husl') - Set color palette
  • sns.scatterplot(x='col1', y='col2', data=df) - Scatter plot
  • sns.lineplot(x='col1', y='col2', data=df) - Line plot
  • sns.histplot(data=df, x='col') - Histogram
  • sns.boxplot(data=df, x='col1', y='col2') - Box plot
  • sns.heatmap(data=df) - Heatmap

Example:

import seaborn as sns
# Set seaborn style
sns.set_style('whitegrid')
tips = sns.load_dataset('tips')
# Create multiple plots
fig, axes = plt.subplots(2, 2, figsize=(12, 8))
# Scatter plot
sns.scatterplot(data=tips, x='total_bill', y='tip',
hue='time', ax=axes[0, 0])
axes[0, 0].set_title('Total Bill vs Tip')
# Box plot
sns.boxplot(data=tips, x='day', y='tip', ax=axes[0, 1])
axes[0, 1].set_title('Tip by Day')
# Histogram
sns.histplot(data=tips, x='total_bill', hue='time',
alpha=0.7, ax=axes[1, 0])
axes[1, 0].set_title('Bill Distribution')
plt.tight_layout()
plt.show()

Seaborn Statistical Plots

seaborn excels at creating beautiful statistical visualizations with automatic styling and color choices.

Reference:

  • sns.pairplot(df) - Pairwise relationships
  • sns.jointplot(x='col1', y='col2', data=df) - Joint distribution
  • sns.violinplot(data=df, x='col1', y='col2') - Violin plot
  • sns.stripplot(data=df, x='col1', y='col2') - Strip plot
  • sns.catplot(kind='box', data=df, x='col1', y='col2') - Categorical plot

Example:

# Advanced seaborn visualizations
fig, axes = plt.subplots(2, 2, figsize=(15, 12))
# Pair plot (shows all pairwise relationships)
# Note: This creates its own figure, so we'll use a subset
sample_data = tips.sample(50)
sns.pairplot(sample_data, hue='time', height=3)
# Joint plot (scatter + histograms)
sns.jointplot(data=tips, x='total_bill', y='tip', kind='hex')
# Violin plot (shows distribution shape)
sns.violinplot(data=tips, x='day', y='tip', ax=axes[0, 0])
axes[0, 0].set_title('Tip Distribution by Day (Violin Plot)')
# Strip plot (shows individual points)
sns.stripplot(data=tips, x='day', y='tip', hue='time', ax=axes[0, 1])
axes[0, 1].set_title('Individual Tips by Day and Time')
plt.tight_layout()
plt.show()

Density Plots and Distribution Visualization

Section titled “Density Plots and Distribution Visualization”

Density plots show the shape of your data distribution - they’re like histograms but smoother, revealing patterns that might be hidden in discrete bins.

Density plots (also called KDE - Kernel Density Estimation) provide a smooth representation of data distribution.

Reference:

  • df.plot.density() - Create density plot
  • sns.histplot(data=df, x='col', kde=True) - Histogram with density overlay
  • sns.kdeplot(data=df, x='col') - Pure density plot
  • sns.distplot(data=df, x='col') - Combined histogram and density

Example:

# Create sample data with different distributions
np.random.seed(42)
normal_data = np.random.normal(0, 1, 1000)
bimodal_data = np.concatenate([
np.random.normal(-2, 0.5, 500),
np.random.normal(2, 0.5, 500)
])
# Density plots
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
# pandas density plot
pd.Series(normal_data).plot.density(ax=axes[0, 0], title='Normal Distribution')
axes[0, 0].grid(True, alpha=0.3)
# seaborn density plot
sns.kdeplot(data=normal_data, ax=axes[0, 1], title='Normal Distribution (seaborn)')
axes[0, 1].grid(True, alpha=0.3)
# Bimodal distribution
sns.kdeplot(data=bimodal_data, ax=axes[1, 0], title='Bimodal Distribution')
axes[1, 0].grid(True, alpha=0.3)
# Combined histogram and density
sns.histplot(data=normal_data, kde=True, ax=axes[1, 1], title='Histogram + Density')
axes[1, 1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

The Python visualization ecosystem is constantly evolving. While matplotlib and seaborn are the workhorses, modern libraries offer exciting new approaches.

vega-altair: Grammar of Graphics with Vega-Lite

Section titled “vega-altair: Grammar of Graphics with Vega-Lite”

altair uses a declarative approach where you describe the data mapping rather than specifying drawing commands. It implements the grammar of graphics through Vega-Lite.

altair implements the Vega-Lite grammar of graphics, providing a declarative approach to creating statistical visualizations. It’s designed for interactive web-based visualizations and supports multiple output formats.

altair uses a simple pattern: create a chart, specify the mark type, and encode your data.

Reference:

  • alt.Chart(data) - Create chart from data
  • alt.Chart(data).mark_circle() - Scatter plot
  • alt.Chart(data).mark_bar() - Bar chart
  • alt.Chart(data).mark_line() - Line plot
  • alt.Chart(data).mark_area() - Area chart
  • alt.Chart(data).mark_rect() - Heatmap/rectangles
  • alt.Chart(data).mark_point() - Point plot

Example:

import altair as alt
import pandas as pd
# Sample data
data = pd.DataFrame({
'x': [1, 2, 3, 4, 5],
'y': [2, 4, 1, 5, 3],
'category': ['A', 'B', 'A', 'C', 'B']
})
# Scatter plot - shows relationships between variables
scatter = alt.Chart(data).mark_circle().encode(x='x', y='y')
scatter.show()
# Bar chart - compares values across categories
bar = alt.Chart(data).mark_bar().encode(x='category', y='y')
bar.show()
# Line chart - shows trends over ordered data
line = alt.Chart(data).mark_line().encode(x='x', y='y')
line.show()
# Combined view using altair's concatenation
combined = alt.hconcat(scatter, bar, line)
combined.show()

Altair Basic Charts

Combined altair charts: scatter plot (left), bar chart (middle), line plot (right)

The .encode() method maps data columns to visual properties using type annotations.

Reference:

  • x='column:Q' - Quantitative (continuous) data
  • y='column:O' - Ordinal (discrete) data
  • color='column:N' - Nominal (categorical) data
  • size='column:Q' - Size encoding
  • shape='column:N' - Shape encoding
  • tooltip=['col1', 'col2'] - Hover information

Example:

# Enhanced scatter plot with encoding
chart = alt.Chart(data).mark_circle().encode(
x='x:Q', # Quantitative x-axis
y='y:Q', # Quantitative y-axis
color='category:N', # Color by category
size='y:Q', # Size by y-value
tooltip=['x', 'y', 'category'] # Hover info
)
chart.show()

Altair Encoded Chart

Enhanced scatter plot with color encoding by category and size encoding by y-value

altair provides built-in interactivity through the .interactive() method, enabling zoom, pan, and selection.

Reference:

  • .interactive() - Enable zoom/pan
  • .add_selection() - Add selection tools
  • .transform_filter() - Filter data dynamically
  • alt.selection_interval() - Rectangle selection
  • alt.selection_single() - Point selection

Example:

# Interactive scatter plot
interactive_chart = alt.Chart(data).mark_circle().encode(
x='x:Q',
y='y:Q',
color='category:N',
tooltip=['x', 'y', 'category']
).interactive()
interactive_chart.show()

altair supports faceting (small multiples) and layering multiple mark types.

Reference:

  • .facet('column:N') - Create small multiples
  • alt.layer() - Combine multiple mark types
  • .properties(width=300, height=200) - Set chart dimensions

Example:

# Faceted chart
faceted = alt.Chart(data).mark_circle().encode(
x='x:Q',
y='y:Q',
color='category:N'
).facet('category:N', columns=2)
# Layered chart
base = alt.Chart(data).encode(x='x:Q', y='y:Q')
layered = alt.layer(
base.mark_circle(color='lightblue'),
base.mark_line(color='red').transform_regression('x', 'y')
)

altair includes built-in statistical transformations.

Reference:

  • .transform_regression('x', 'y') - Add regression line
  • .transform_aggregate() - Group and aggregate data
  • .transform_filter() - Filter data based on selections

Example:

# Chart with regression line
regression = alt.Chart(data).mark_circle().encode(
x='x:Q',
y='y:Q'
) + alt.Chart(data).mark_line(color='red').transform_regression(
'x', 'y'
).encode(x='x:Q', y='y:Q')

altair supports multiple output formats for different use cases.

Reference:

  • chart.save('plot.png') - Static bitmap (PNG)
  • chart.save('plot.svg') - Static vector (SVG)
  • chart.save('plot.html') - Interactive HTML
  • chart.save('plot.json') - Vega-Lite JSON specification

Example:

# Export to different formats
chart.save('scatter.png') # Static bitmap
chart.save('scatter.svg') # Static vector
chart.save('interactive.html') # Interactive HTML

Other Modern Tools: plotnine, Bokeh, and Plotly

Section titled “Other Modern Tools: plotnine, Bokeh, and Plotly”

plotnine brings R’s ggplot2 syntax to Python, perfect for those familiar with R.

Key Features:

  • Grammar of graphics approach
  • Layered plotting syntax
  • Familiar to R users
  • Statistical transformations

Reference:

  • ggplot(data, aes(x='col1', y='col2')) - Base plot
  • + geom_point() - Add scatter points
  • + geom_smooth() - Add trend line
  • + facet_wrap('~column') - Create facets
  • + theme_minimal() - Apply themes

Example:

# Simple scatter plot with ggplot2 syntax
(ggplot(tips, aes(x='total_bill', y='tip', color='time'))
+ geom_point()
+ theme_minimal())

Bokeh creates interactive web-based visualizations with rich interactivity.

Key Features:

  • High-performance interactive plots
  • Web-based output
  • Custom JavaScript callbacks
  • Server applications

Reference:

  • figure() - Create plot figure
  • .circle(), .line(), .bar() - Add glyphs
  • HoverTool() - Add hover information
  • output_notebook() - Display in Jupyter

Plotly excels at creating interactive dashboards and web applications.

Key Features:

  • Easy-to-use API
  • Rich interactivity
  • Dashboard capabilities
  • Multiple chart types

Reference:

  • px.scatter(), px.line(), px.bar() - Express functions
  • go.Figure() - Graph objects
  • make_subplots() - Multiple plots
  • .show() - Display plot

Example:

import plotly.express as px
# Simple interactive scatter plot
fig = px.scatter(tips, x='total_bill', y='tip', color='time',
title="Interactive Scatter Plot")
fig.show()

When to use each tool:

  • matplotlib: Custom plots, publication quality, fine control
  • seaborn: Statistical plots, beautiful defaults, relationship analysis
  • pandas: Quick exploration, basic charts
  • altair: Interactive plots, grammar of graphics, web-ready
  • plotnine: R users, layered approach, statistical plots
  • Bokeh: High-performance web visualizations, custom interactions
  • Plotly: Dashboards, web applications, easy interactivity
ToolBest ForLearning CurveInteractivityOutput FormatsGrammar
matplotlibCustom plots, publication qualityHighNonePNG/SVG/PDFImperative
seabornStatistical plots, beautiful defaultsLowNonePNG/SVG/PDFImperative
pandasQuick exploration, basic chartsVery LowNonePNG/SVG/PDFImperative
altairInteractive plots, grammar of graphicsMediumBuilt-inPNG/SVG/HTML/JSONDeclarative
plotnineR users, layered approachMediumNonePNG/SVG/PDFDeclarative
bokehInteractive web visualizationsHighHighHTML/JSImperative
plotlyDashboards, web applicationsMediumHighHTML/JSDeclarative

xkcd 1138: Heatmap

“Every single map of the United States looks the same because it’s just a population density map.” - A reminder that your visualization should show meaningful patterns, not just expected distributions.