04: NumPy
See Bonus for advanced topics:
- Alignment and broadcasting tricks
- Function application patterns and method chaining notes
- Ranking strategies and working with duplicate index labels
- Extended I/O and performance tips (Excel, JSON, chunked reads)
Jupyter Notebooks: Interactive Data Analysis
Section titled “Jupyter Notebooks: Interactive Data Analysis”In Lectures 1-3, you wrote Python scripts (.py files) that run top-to-bottom. Jupyter notebooks (.ipynb files) let you run code in any order, see results immediately, and mix code with documentation - perfect for data exploration and analysis. Think of .py files for production code and automation, and .ipynb files for interactive analysis and storytelling with data.
Jupyter notebooks provide an interactive environment for data analysis, combining code execution with rich output display. They’re essential for exploratory data analysis, prototyping, and sharing results with stakeholders.
Jupyter Notebook Interface
Section titled “Jupyter Notebook Interface”Jupyter notebooks organize work into cells that can contain code or markdown. This structure enables iterative analysis and clear documentation of the analytical process.
Reference:
- Code cells: Execute Python code and display output
- Markdown cells: Write documentation and explanations
- Cell execution:
Shift+Enter(run and advance),Ctrl+Enter(run and stay) - Cell management:
A(add above),B(add below),DD(delete cell) - Magic commands:
%matplotlib inline(display plots),%timeit(time execution) - Kernel: Python interpreter that executes code cells
Example:
# Cell 1: Import librariesimport pandas as pdimport matplotlib.pyplot as plt
# Cell 2: Load datadf = pd.read_csv('data.csv')display(f"Data shape: {df.shape}")
# Cell 3: Quick visualizationdf.head().plot(kind='bar')plt.title('Sample Data')plt.show()display() vs print()
Section titled “display() vs print()”Think of print() as the reliable Honda Civic - works everywhere, gets the job done, but nothing fancy. display() is the sports car - looks amazing, handles beautifully, but only in the right environment (Jupyter).
When working with DataFrames and Series in Jupyter, you have two options for viewing output. print() works in any Python environment (scripts, notebooks, REPL) and shows plain text. display() is Jupyter-specific and renders rich HTML tables with formatting, making data much easier to read.
Reference:
print(df)- Plain text output, works everywhere (scripts and notebooks)display(df)- Rich HTML table output, Jupyter notebooks only- Use
display()for DataFrames/Series in notebooks for better readability - Use
print()for simple values, strings, or when writing.pyscripts display()will fail in regular Python scripts (.pyfiles run from terminal)
Example:
import pandas as pd
df = pd.DataFrame({ 'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35]})
# Plain text - works everywhere, but harder to readprint(df)# Name Age# 0 Alice 25# 1 Bob 30# 2 Charlie 35
# Rich formatting - Jupyter only, beautiful tablesdisplay(df)# [Renders as a formatted HTML table with borders, alternating row colors, etc.]
# For simple values, print() is fineprint(f"Total rows: {len(df)}") # Total rows: 3Pro tip: In Jupyter, if a DataFrame is the last line in a cell, it automatically displays without calling display() - but being explicit is clearer!
Jupyter Magic Commands
Section titled “Jupyter Magic Commands”Magic commands are like cheat codes for Jupyter - they give you special powers that normal Python doesn’t have. Think of them as the “konami code” of data science, except instead of getting 30 extra lives, you get inline plots and package installation!
Magic commands provide special functionality for notebook environments. They start with % and extend Jupyter’s capabilities for data analysis.
Reference:
%matplotlib inline- Display plots within notebook cells%pwd- Print current working directory%ls- List directory contents%pip install package_name- Install Python packages%pip list- List installed packages%pip show package_name- Show package information
Example:
# Display plots inline%matplotlib inlineimport matplotlib.pyplot as pltplt.plot([1, 2, 3, 4])plt.show()
# Install packages%pip install pandas numpy matplotlib
# Check your location%pwd%lsJupyter Notebooks in VS Code
Section titled “Jupyter Notebooks in VS Code”VS Code provides excellent Jupyter notebook support with integrated terminal, git management, and debugging capabilities. Understanding how to work with notebooks in VS Code is essential for modern data science workflows.
Reference:
- Create notebook:
Ctrl+Shift+P→ “Jupyter: Create New Jupyter Notebook” - Open notebook:
Ctrl+O→ Select.ipynbfile - Run cell:
Shift+Enter(run and advance),Ctrl+Enter(run and stay) - Add cell: Click
+button above cells, or right-click → “Insert Cell Above/Below” - Delete cell: Right-click cell → “Delete Cell”, or select cell and press
Deletekey - Save:
Ctrl+S(auto-saves frequently)
Note: Keyboard shortcuts like A and DD only work in command mode (when cell is selected but not editing). For beginners, using the GUI buttons and right-click menu is more reliable.
Example:
# VS Code automatically detects .ipynb files# Just open any .ipynb file and start codingimport pandas as pddf = pd.read_csv('data.csv')display(df.head())Kernel Management Basics
Section titled “Kernel Management Basics”The kernel is the Python interpreter running your code. Sometimes it gets stuck or needs a fresh start.
Reference:
- Restart Kernel: Clears all variables from memory, fresh start
- Run All: Executes all cells from top to bottom
- Restart & Run All: Combines both - useful for testing if code works from scratch
- Common issues: Variable conflicts, memory issues, stuck computations
Example:
When to restart your kernel:
- Code behaves unexpectedly
- Variables seem to have wrong values
- “It worked before but now it doesn’t”
- Before submitting assignments (test it runs from top to bottom!)
Notebook Outputs and Git: The Memory Problem
Section titled “Notebook Outputs and Git: The Memory Problem”Warning: Jupyter notebooks are like that one friend who screenshots everything you text them. They save both your code AND all the outputs (results, data, plots) in the same file.
This means if you accidentally print your password, patient data, or that embarrassing test result, it’s now permanently saved in your notebook file. It’s like having a photographic memory of your most awkward moments.
Before committing to git (the “digital hygiene” moment):
- Clear all outputs - Click the “Clear All Outputs” button in VS Code
- Check for sensitive data - Make sure no personal information, passwords, or confidential data is visible
- Save the notebook - The outputs will be removed from the file
Example:
# This output contains sensitive data and will be saved in the notebookdf = pd.read_csv('patient_data.csv')display(df.head()) # Shows patient names, IDs, medical info# Oops! Now everyone can see John Doe's blood pressure on GitHubAfter running this code, the patient data will be visible in your notebook file. Always clear outputs before sharing or committing to git.
LIVE DEMO!
Section titled “LIVE DEMO!”(Demo 1: Jupyter Basics - interface, cells, magic commands)
Introduction to Pandas
Section titled “Introduction to Pandas”Pandas provides powerful data structures and tools for working with structured data. It’s built on NumPy but adds labeled axes and missing data handling, making it essential for data analysis workflows.
Fun fact: Pandas got its name from “Panel Data” - the economics term for time-series data. The cute bear logo? That’s just a happy accident that makes data science more approachable! 🐼
Pandas Data Structures
Section titled “Pandas Data Structures”Think of pandas data structures like Russian nesting dolls - Series fit inside DataFrames, which can contain other DataFrames, which can contain… well, you get the idea. It’s data structures all the way down!
Reference:
pd.Series(data, index=None, name=None)— create a labeled vectorpd.DataFrame(data, index=None, columns=None)— create a table with labeled axes.index,.columns,.shape,.dtypes— inspect structure metadata.info(),.describe()— quick structure and summary diagnostics
Series
Section titled “Series”A Series is a one-dimensional labeled array that can hold any data type. It’s like a column in a spreadsheet with an index that labels each value, enabling powerful data manipulation and analysis.
Reference:
pd.Series(data, index=None, name=None)- Create Seriesseries.index- Access index labelsseries.values- Get values as NumPy arrayseries.name- Get/set Series nameseries.dtype- Get data typeseries.size- Number of elementsseries.head(n=5)- First n elementsseries.tail(n=5)- Last n elementsseries.describe()- Summary statisticsseries.value_counts()- Value frequencies
Example:
# Create Seriesages = pd.Series([25, 30, 35, 40], index=['Alice', 'Bob', 'Charlie', 'Diana'])display(ages) # Alice: 25, Bob: 30, Charlie: 35, Diana: 40display(ages.index) # ['Alice', 'Bob', 'Charlie', 'Diana']display(ages.values) # [25 30 35 40]
# Series operationsdisplay(ages.mean()) # 32.5display(ages.describe()) # count, mean, std, min, 25%, 50%, 75%, maxDataFrame
Section titled “DataFrame”A DataFrame is a two-dimensional labeled data structure with columns of potentially different types. DataFrames combine multiple Series so you can operate column-wise or row-wise with shared labels. Because columns can hold different dtypes, keep an eye on schema when merging disparate sources.
Pro tip: DataFrames are like Excel spreadsheets, but with superpowers. They can handle millions of rows without breaking a sweat, and they never ask you to “save as” or complain about circular references.
Reference:
pd.DataFrame(data, index=None, columns=None)- Create DataFramedf.index- Access row indexdf.columns- Access column namesdf.values- Get values as NumPy arraydf.shape- (rows, columns) tupledf.dtypes- Data types per columndf.info()- Detailed informationdf.describe()- Summary statisticsdf.head(n=5)- First n rowsdf.tail(n=5)- Last n rowsdf.sample(n=5)- Random n rows
Example:
# Create DataFramedf = pd.DataFrame({ 'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35], 'Salary': [50000, 60000, 70000]})
display(df.shape) # (3, 3)display(df.dtypes) # Name: object, Age: int64, Salary: int64display(df.describe()) # Summary statistics for numeric columnsSelecting Columns from a DataFrame
Section titled “Selecting Columns from a DataFrame”Thankfully, we don’t have to use the whole DataFrame at all times. We can select subsets of columns to work with instead.
Think of column selection like picking your team for dodgeball - sometimes you want just your star player (single column), sometimes you want your entire A-team (multiple columns), and sometimes you want everyone except that one person who always gets you out (column exclusion).
Reference:
df['column_name']- Select single column (returns Series)df[['col1', 'col2']]- Select multiple columns (returns DataFrame)df.column_name- Dot notation for single column (if name has no spaces/special chars)df.select_dtypes(include=['number'])- Select by data type
Example:
# Create sample DataFrameemployees = pd.DataFrame({ 'Name': ['Alice', 'Bob', 'Charlie', 'Diana'], 'Age': [25, 30, 35, 28], 'Salary': [50000, 60000, 70000, 55000], 'Department': ['Engineering', 'Sales', 'Engineering', 'Marketing'], 'Years_Experience': [2, 5, 8, 3]})
# Single column selection (returns Series)names = employees['Name']display(type(names)) # <class 'pandas.core.series.Series'>display(names)
# Single column selection (returns DataFrame)names = employees[['Name']]display(type(names)) # <class 'pandas.core.frame.DataFrame'>display(names)
# Multiple column selection (returns DataFrame)basic_info = employees[['Name', 'Age']]display(type(basic_info)) # <class 'pandas.core.frame.DataFrame'>display(basic_info)
# Dot notation (careful with column names!)ages = employees.Age # Works if column name is valid Python identifierdisplay(ages)
# Select numeric columns onlynumeric_data = employees.select_dtypes(include=['number'])display(numeric_data.columns) # ['Age', 'Salary', 'Years_Experience']/callout(“This is confusing!”) Tips:
- Use
df['column']for single columns when you want a Series - Use
df[['column']]for single columns when you want a DataFrame - Bracket notation
df['column']is safer than dot notationdf.column - Multiple column selection always returns a DataFrame, even if you select just one column
Data Selection and Indexing
Section titled “Data Selection and Indexing”Data selection in pandas uses label-based and position-based indexing. Understanding these methods is crucial for data manipulation and analysis.
Warning: Indexing in pandas is like a choose-your-own-adventure book - there are multiple ways to get to the same destination, and sometimes you end up in a completely different story than you intended.
.loc vs .iloc
Section titled “.loc vs .iloc”The key difference: .loc uses LABELS, .iloc uses POSITIONS (like list indexing).
Think of it this way: .loc is like asking “Give me the data for employee ‘Alice’” (using names/labels), while .iloc is like saying “Give me the data from the 2nd row” (using positions like 0, 1, 2…).
Critical Differences:
-
What they use:
.loc[row_label, col_label]— uses index labels and column names.iloc[row_position, col_position]— uses integer positions (0, 1, 2…)
-
Slicing behavior:
.loc[1:3]— includes BOTH endpoints (1, 2, AND 3).iloc[1:3]— excludes the end (1, 2, but NOT 3)
-
When to use:
- Use
.locwhen you know the names/labels - Use
.ilocwhen you know the positions
- Use
Reference:
df.loc[row_labels, column_labels]— label-based selectiondf.iloc[row_positions, column_positions]— position-based selectiondf.query("expression")— filter with readable expressionsdf[df['column'] > value]— boolean maskingdf.isin(sequence)/df.between(left, right)— membership utilities
Example:
# Create DataFrame with custom index to show the difference clearlyemployees = pd.DataFrame({ 'Name': ['Alice', 'Bob', 'Charlie', 'Diana'], 'Age': [25, 30, 35, 28], 'Salary': [50000, 60000, 70000, 55000]}, index=['emp001', 'emp002', 'emp003', 'emp004'])
display(employees)# Name Age Salary# emp001 Alice 25 50000# emp002 Bob 30 60000# emp003 Charlie 35 70000# emp004 Diana 28 55000
# .loc uses LABELS (index names and column names)display(employees.loc['emp002', 'Name']) # 'Bob' - using row label 'emp002'display(employees.loc['emp001':'emp003', 'Age']) # Ages for emp001, emp002, AND emp003 (inclusive!)
# .iloc uses POSITIONS (like list indexing: 0, 1, 2, 3...)display(employees.iloc[1, 0]) # 'Bob' - position 1 (2nd row), position 0 (1st column)display(employees.iloc[0:3, 1]) # Ages for positions 0, 1, 2 (NOT including position 3!)
# Common mistake: mixing them up!# employees.loc[1, 'Name'] # ERROR! No row with label '1'# employees.iloc['emp002', 0] # ERROR! Can't use string labels with iloc
# Boolean indexing (works with either)adults = employees[employees['Age'] >= 30] # Bob, Charlie, Dianahigh_earners = employees.loc[employees['Salary'] > 60000] # CharlieMemory Trick:
.loc= “L”abels (names, strings, custom indices).iloc= “i”nteger “L”ocations (0, 1, 2, 3… like list positions)
Adding Columns to DataFrames
Section titled “Adding Columns to DataFrames”Derived columns capture new features and align automatically with existing indexes. Choose between direct assignment for quick mutations and .assign() when you need a non-mutating pipeline step.
Reference:
df['column_name'] = expression— insert or overwrite a columndf.assign(name=lambda d: ...)— add columns while returning a new DataFramedf.insert(loc, column, value)— control column orderingdf.eval("new = ...")— expression syntax for simple arithmetic
Example:
salaries = pd.DataFrame({ 'Name': ['Avery', 'Bianca', 'Cheng'], 'Salary': [120000, 95000, 88000], 'Department': ['Engineering', 'Sales', 'People Ops']})
salaries['HourlyRate'] = salaries['Salary'] / 2080augmented = salaries.assign( Bonus=lambda d: d['Salary'] * 0.05, TotalComp=lambda d: d['Salary'] + d['Salary'] * 0.05)
display(augmented[['Name', 'HourlyRate', 'TotalComp']])Handling Missing Data
Section titled “Handling Missing Data”Missing data decisions begin when files are read and continue throughout transformations. Detect gaps, decide whether to fill or drop, and capture read-time options so messy inputs become reproducible.
Reference:
series.isnull(),series.notnull()— null diagnosticsdf.fillna(value | method='ffill' | method='bfill')— replacement strategiesdf.dropna(subset=..., how='any' | 'all')— remove incomplete rowsdf.isnull().sum()— column-level null counts
Example:
survey = pd.read_csv( 'employee_survey.csv', na_values=['NA', 'NULL', '?'], keep_default_na=True, skiprows=[1], usecols=['name', 'role', 'bonus', 'start_date'])
survey['bonus'] = survey['bonus'].fillna(0)clean = survey.dropna(subset=['start_date'])display(survey.isnull().sum())Data Type Conversion
Section titled “Data Type Conversion”Converting data types is like trying to convince your data to identify as something else. Sometimes it works smoothly (string “42” → int 42), sometimes you need therapy (error handling), and sometimes it just refuses and throws a ValueError. Just remember: you can’t force a square peg into a round hole, but pandas will try its best!
Converting data to the correct types is essential for proper analysis. This includes converting strings to numbers, dates, and other appropriate types.
Reference:
df.astype('int64')- Convert to integerdf.astype('float64')- Convert to floatdf.astype('string')- Convert to stringpd.to_datetime(df['date_column'])- Convert to datetimepd.to_numeric(df['column'], errors='coerce')- Convert to numeric, errors become NaN
Example:
# Convert data typesdf = pd.DataFrame({'A': ['1', '2', '3'], 'B': [4.5, 5.5, 6.5]})df['A'] = df['A'].astype('int64') # Convert string to integerdf['B'] = df['B'].astype('int64') # Convert float to integerdisplay(df.dtypes) # A: int64, B: int64
# Handle conversion errorsdf['C'] = ['1', '2', 'invalid', '4']df['C'] = pd.to_numeric(df['C'], errors='coerce') # Invalid becomes NaNdisplay(df['C']) # [1.0, 2.0, NaN, 4.0]LIVE DEMO!
Section titled “LIVE DEMO!”(Demo 2: Pandas DataFrames - selection, filtering, groupby, operations)
Essential Pandas Operations
Section titled “Essential Pandas Operations”Sorting Data
Section titled “Sorting Data”Sorting organizes your data by values or index, making it easier to find patterns and outliers. This is one of the most common operations in data analysis.
Reference:
df.sort_values('column')- Sort by column valuesdf.sort_values(['col1', 'col2'])- Sort by multiple columnsascending=False- Sort in descending orderdf.sort_index()- Sort by index
Example:
# Sort by agedf = pd.DataFrame({'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 20]})df.sort_values('Age') # Sort by age (youngest first)df.sort_values('Age', ascending=False) # Sort by age (oldest first)Finding Unique Values
Section titled “Finding Unique Values”Exploring unique values helps you understand your data and identify categories. This is essential for data exploration and cleaning.
Reference:
series.unique()- Get unique valuesseries.nunique()- Count unique valuesseries.value_counts()- Count how often each value appearsseries.isin(['A', 'B'])- Check if values are in a list
Example:
# Find unique valuescategories = pd.Series(['A', 'B', 'A', 'C', 'B'])display(categories.unique()) # ['A' 'B' 'C']display(categories.value_counts()) # A: 2, B: 2, C: 1
# Filter by membershipdisplay(categories.isin(['A', 'B'])) # [True, True, True, False, True]GroupBy
Section titled “GroupBy”GroupBy enables split-apply-combine analytics: split data into groups, apply aggregations or filters, then combine results into aligned outputs. It replaces manual loops with expressive transformations.
Reference:
df.groupby('col')[target].agg(['mean', 'count'])— summarize groupsdf.groupby(['col1', 'col2']).sum(numeric_only=True)— multi-key aggregationgrouped.filter(lambda g: ...)— keep groups matching criteriagrouped.transform(func)— broadcast aggregated values back to rows
Example:
comp = pd.DataFrame({ 'Department': ['Eng', 'Eng', 'Sales', 'Sales', 'People Ops'], 'Salary': [120000, 115000, 95000, 98000, 88000]})
dept_summary = comp.groupby('Department')['Salary'].agg(['count', 'mean'])display(dept_summary)Data Loading and Storage
Section titled “Data Loading and Storage”Pro tip: If you’re ever stuck with a weird file format, remember: “There’s a pandas function for that!” (Usually pd.read_[format]() - pandas is surprisingly comprehensive at reading data from just about anywhere)
Reading and Writing CSV Files
Section titled “Reading and Writing CSV Files”CSV files are the most common format for data analysis. Pandas makes it easy to read CSV files with sensible defaults.
Fun fact: CSV stands for “Comma-Separated Values,” but in reality, it’s more like “Comma-Separated Values (unless someone used semicolons, or tabs, or pipes, or any other delimiter they felt like using that day).”
Reference:
Reading
pd.read_csv('filename.csv')- Read CSV filepd.read_csv('filename.csv', sep=';')- Custom separatorpd.read_csv('filename.csv', header=0)- Specify header rowpd.read_csv('filename.csv', index_col=0)- Use first column as indexpd.read_csv(path, sep=',', header=0, index_col=None)— all options
Writing
- `df.to_csv(‘asdf.csv’) - no frills
- `df.to_csv(‘tab_separated.tsv’, sep=‘\t’)
df.to_csv(path, index=False, na_rep='')— write cleaned resultsdf.to_csv(path, columns=[...])— export selected columns
Example:
# Basic CSV readingdf = pd.read_csv('data.csv')display(df.head())
# Custom optionsdf = pd.read_csv('data.csv', sep=';', index_col=0)display(df.head())Reading and Writing Other Formats
Section titled “Reading and Writing Other Formats”Reference:
Excel
pd.read_excel(path, sheet_name=0, usecols=None)— ingest worksheetsdf.to_excel(path, sheet_name='Summary', index=False)— share spreadsheets
JSON
pd.read_json(path_or_buf, orient='records')— parse structured payloadsdf.to_json(path_or_buf, orient='records', indent=2)— export API-friendly data
Example:
#Excelsales = pd.read_excel('quarterly_sales.xlsx', sheet_name='Q2')sales.to_excel('quarterly_sales_clean.xlsx', sheet_name='Q2', index=False)
# JSONpayload = pd.read_json('inventory_payload.json')payload.to_json('inventory_payload_export.json', orient='records', indent=2)Note: Database access and sql will be covered later course content.
Data Exploration and Summary Statistics
Section titled “Data Exploration and Summary Statistics”Summary Statistics
Section titled “Summary Statistics”Summary statistics provide a quick overview of your data’s distribution and characteristics. They’re essential for understanding data quality and identifying patterns.
Remember: Correlation does not imply causation! (But it’s still useful for understanding patterns in your data)
Reference:
df.describe()- Summary statistics for numeric columnsdf.info()- Data types and memory usagedf.shape- (rows, columns) tupledf.count()- Non-null values per columndf.nunique()- Unique values per columndf.memory_usage()- Memory usage per columndf.isnull().sum()- Missing values per column
Example:
# Summary statisticsdf = pd.DataFrame({'A': [1, 2, 3, 4, 5], 'B': [2, 4, 6, 8, 10]})display(df.describe()) # count, mean, std, min, 25%, 50%, 75%, maxdisplay(df.info()) # Data types and memory usagedisplay(df.isnull().sum()) # Missing values per columnData Quality Assessment
Section titled “Data Quality Assessment”Data quality assessment identifies issues like missing values, duplicates, and outliers. This is crucial for ensuring reliable analysis results.
Reference:
df.isnull()- Boolean DataFrame: True for missing valuesdf.notnull()- Boolean DataFrame: True for non-missing valuesdf.duplicated()- Boolean Series: True for duplicate rowsdf.drop_duplicates()- Remove duplicate rowsdf.nunique()- Count unique values per columndf.value_counts()- Value frequenciesdf.describe()- Summary statistics
Example:
# Check for missing valuesdisplay(df.isnull().sum()) # Missing values per column
# Check for duplicatesdisplay(df.duplicated().sum()) # Number of duplicate rowsdf_clean = df.drop_duplicates() # Remove duplicates
# Check data typesdisplay(df.dtypes) # Data types per columndisplay(df.info()) # Detailed informationLIVE DEMO!
Section titled “LIVE DEMO!”(Demo 3: Data I/O - CSV, Excel, JSON, real-world cleaning workflow)