Bonus: Advanced Pandas
Bonus Content: Advanced Pandas Operations
This material builds on the lecture essentials in README.md. Revisit the lecture for Series/DataFrame basics, column creation, groupby introductions, and the core CSV workflow before tackling these extensions.
Data Alignment and Broadcasting
Section titled “Data Alignment and Broadcasting”See the lecture for baseline Series/DataFrame comparisons. This section deepens alignment control for mismatched labels, explicit reindexing, and DataFrame↔Series broadcasting so multi-source arithmetic remains predictable.
Reference:
- Automatic index alignment during arithmetic operations
- Series align on their index; DataFrames align on both axes when mixed
df.add(series, axis='columns', fill_value=0)/df.sub(series, axis='index', fill_value=0)/df.mul(...)/df.div(...)— combine while filling gaps.reindex()and.align(join='inner' | 'outer')— enforce explicit label sets before combining
Series align on their index, while DataFrames track both axes. When you mix them, pandas broadcasts along matching labels and introduces NaN wherever labels do not overlap, so make the intended axis explicit when you call arithmetic methods.
Brief Example:
s1 = pd.Series([1, 2, 3], index=['a', 'b', 'c'])s2 = pd.Series([4, 5, 6], index=['b', 'c', 'd'])print(s1 + s2) # a: NaN, b: 6, c: 8, d: NaNprint(s1.add(s2, fill_value=0)) # a: 1, b: 6, c: 8, d: 6
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})row = pd.Series([10, 20], index=['A', 'B'])print(df.sub(row, axis='columns')) # Broadcast Series across DataFrame columns
left, right = df.align(df.iloc[:2], join='outer', axis=0)print(left)print(right)
metrics = pd.DataFrame({ 'Salary': [120000, 95000, 88000], 'Bonus': [6000, 4750, 4400]}, index=['Avery', 'Bianca', 'Cheng'])
averages = metrics.mean()targets = pd.Series({'Salary': 110000, 'Bonus': 5000, 'Equity': 2000})
print(metrics.sub(averages, axis='columns')) # Broadcast Series down rowsprint(metrics.add(targets, fill_value=0)) # Fill missing labels before combiningFunction Application and Method Chaining
Section titled “Function Application and Method Chaining”The lecture covers vectorized operations; reach for the tools below when you need custom logic or pipeline readability. Combine apply/map with chaining helpers to keep transformations compact and transparent.
Reference:
df.apply(func)— column-wise by default; addaxis='columns'for row-wise logicseries.map(func)— element-level transformations with optional dict/Series mappingdf.applymap(func)— element-wise DataFrame transform (use sparingly for performance)- Chain helpers:
.assign(),.pipe(),.rename()to build fluent pipelines
Brief Example:
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})print(df.apply(lambda col: col.max() - col.min()))print(df.apply(lambda row: row.sum(), axis='columns'))print(df.map(lambda x: f"${x:.2f}"))
summary = ( df.assign(total=lambda d: d.sum(axis=1)) .pipe(lambda d: d / d['total'].max()))print(summary)Ranking Strategies
Section titled “Ranking Strategies”Go beyond simple sorting by assigning ranks, controlling tie behavior, and ranking across rows or columns. Pair these techniques with the lecture’s descriptions of sorting and unique values when you need ordered analytics.
Reference:
series.rank()— mean rank for ties (default)method='first' | 'min' | 'max' | 'dense'— tie handling strategiesascending=False— reverse rankingdf.rank(axis='columns')— rank across columns within each row
Brief Example:
s = pd.Series([7, -5, 7, 4, 2, 0, 4])print(s.rank()) # Mean rank for tiesprint(s.rank(method='first')) # First occurrence gets the better rankprint(s.rank(ascending=False)) # Reverse order rankingHandling Duplicate Index Labels
Section titled “Handling Duplicate Index Labels”The lecture covers duplicate detection at the column level. This section focuses on index semantics when labels repeat, plus tactics for normalizing or exploiting duplicates in time-series and log pipelines.
Reference:
index.is_unique— quick sanity check- Label-based selection returns Series/DataFrame when duplicates exist
duplicated()anddrop_duplicates()also operate on indexesgroupby(level=0)or.reset_index()can normalize duplicates
Brief Example:
import numpy as nps = pd.Series([1, 2, 3, 4, 5], index=['a', 'a', 'b', 'b', 'c'])print(s.index.is_unique) # Falseprint(s['a']) # Series with two valuesprint(s['c']) # Scalar
df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'a', 'b', 'b', 'c'])print(df.loc['b']) # DataFrame with the duplicate rowsExtended I/O and Performance
Section titled “Extended I/O and Performance”Revisit the lecture for core CSV ingestion/export. Use this section when you need alternate formats, iterative processing, or performance tuning. Each technique notes the scenarios where it adds value. See the lecture for baseline syntax before layering these extensions.
Excel Integration
Section titled “Excel Integration”Ideal for business spreadsheets or multi-sheet workbooks.
# Read entire workbookdf = pd.read_excel('data.xlsx')print(df.head())
# Target a specific sheetdf_sales = pd.read_excel('data.xlsx', sheet_name='Sales')print(df_sales.head())
# Write results back outdf_sales.to_excel('sales_summary.xlsx', sheet_name='Summary', index=False)Use when you need Excel-native formatting or your stakeholders expect .xlsx outputs.
JSON and Semi-Structured Data
Section titled “JSON and Semi-Structured Data”Designed for API payloads or nested records.
df = pd.read_json('data.json')df.to_json('output.json', orient='records', indent=2)Switch the orient parameter ('records', 'columns', 'table', etc.) based on the consumer. JSON is great for web services and lightweight integrations.
SQL Databases
Section titled “SQL Databases”Ideal when data already resides in transactional stores. Requires a SQLAlchemy engine or DB-API connection.
# import sqlalchemy as sqla# engine = sqla.create_engine('sqlite:///mydb.sqlite')# query = "SELECT name, total, date FROM sales WHERE date >= '2024-01-01'"# df = pd.read_sql(query, engine)Once records are in a DataFrame, downstream cleaning and analysis mirrors the lecture workflow.
Reading Large Files in Chunks
Section titled “Reading Large Files in Chunks”Break massive files into bite-sized pieces without exhausting RAM.
chunk_iter = pd.read_csv('huge_file.csv', chunksize=10000)results = []
for chunk in chunk_iter: processed = chunk[chunk['value'] > 0].groupby('category').sum() results.append(processed)
final = pd.concat(results, axis=0).groupby(level=0).sum()Use chunking when files exceed memory or when you only need aggregated results. See lecture fundamentals for basic read_csv; layer this pattern when datasets push RAM limits.
Advanced CSV Options
Section titled “Advanced CSV Options”Tame messy inputs with custom NA markers, delimiters, or sampling.
df = pd.read_csv( 'data.csv', na_values=['NA', 'NULL', 'missing', '?'], keep_default_na=True)
preview = pd.read_csv('large_file.csv', nrows=100)print(preview.head())Start with the lecture’s clean CSV example, then layer these options as you encounter real-world quirks.