Skip to content

Bonus: Advanced Pandas

Bonus Content: Advanced Pandas Operations

This material builds on the lecture essentials in README.md. Revisit the lecture for Series/DataFrame basics, column creation, groupby introductions, and the core CSV workflow before tackling these extensions.


See the lecture for baseline Series/DataFrame comparisons. This section deepens alignment control for mismatched labels, explicit reindexing, and DataFrame↔Series broadcasting so multi-source arithmetic remains predictable.

Reference:

  • Automatic index alignment during arithmetic operations
  • Series align on their index; DataFrames align on both axes when mixed
  • df.add(series, axis='columns', fill_value=0) / df.sub(series, axis='index', fill_value=0) / df.mul(...) / df.div(...) — combine while filling gaps
  • .reindex() and .align(join='inner' | 'outer') — enforce explicit label sets before combining

Series align on their index, while DataFrames track both axes. When you mix them, pandas broadcasts along matching labels and introduces NaN wherever labels do not overlap, so make the intended axis explicit when you call arithmetic methods.

Brief Example:

s1 = pd.Series([1, 2, 3], index=['a', 'b', 'c'])
s2 = pd.Series([4, 5, 6], index=['b', 'c', 'd'])
print(s1 + s2) # a: NaN, b: 6, c: 8, d: NaN
print(s1.add(s2, fill_value=0)) # a: 1, b: 6, c: 8, d: 6
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
row = pd.Series([10, 20], index=['A', 'B'])
print(df.sub(row, axis='columns')) # Broadcast Series across DataFrame columns
left, right = df.align(df.iloc[:2], join='outer', axis=0)
print(left)
print(right)
metrics = pd.DataFrame({
'Salary': [120000, 95000, 88000],
'Bonus': [6000, 4750, 4400]
}, index=['Avery', 'Bianca', 'Cheng'])
averages = metrics.mean()
targets = pd.Series({'Salary': 110000, 'Bonus': 5000, 'Equity': 2000})
print(metrics.sub(averages, axis='columns')) # Broadcast Series down rows
print(metrics.add(targets, fill_value=0)) # Fill missing labels before combining

The lecture covers vectorized operations; reach for the tools below when you need custom logic or pipeline readability. Combine apply/map with chaining helpers to keep transformations compact and transparent.

Reference:

  • df.apply(func) — column-wise by default; add axis='columns' for row-wise logic
  • series.map(func) — element-level transformations with optional dict/Series mapping
  • df.applymap(func) — element-wise DataFrame transform (use sparingly for performance)
  • Chain helpers: .assign(), .pipe(), .rename() to build fluent pipelines

Brief Example:

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
print(df.apply(lambda col: col.max() - col.min()))
print(df.apply(lambda row: row.sum(), axis='columns'))
print(df.map(lambda x: f"${x:.2f}"))
summary = (
df.assign(total=lambda d: d.sum(axis=1))
.pipe(lambda d: d / d['total'].max())
)
print(summary)

Go beyond simple sorting by assigning ranks, controlling tie behavior, and ranking across rows or columns. Pair these techniques with the lecture’s descriptions of sorting and unique values when you need ordered analytics.

Reference:

  • series.rank() — mean rank for ties (default)
  • method='first' | 'min' | 'max' | 'dense' — tie handling strategies
  • ascending=False — reverse ranking
  • df.rank(axis='columns') — rank across columns within each row

Brief Example:

s = pd.Series([7, -5, 7, 4, 2, 0, 4])
print(s.rank()) # Mean rank for ties
print(s.rank(method='first')) # First occurrence gets the better rank
print(s.rank(ascending=False)) # Reverse order ranking

The lecture covers duplicate detection at the column level. This section focuses on index semantics when labels repeat, plus tactics for normalizing or exploiting duplicates in time-series and log pipelines.

Reference:

  • index.is_unique — quick sanity check
  • Label-based selection returns Series/DataFrame when duplicates exist
  • duplicated() and drop_duplicates() also operate on indexes
  • groupby(level=0) or .reset_index() can normalize duplicates

Brief Example:

import numpy as np
s = pd.Series([1, 2, 3, 4, 5], index=['a', 'a', 'b', 'b', 'c'])
print(s.index.is_unique) # False
print(s['a']) # Series with two values
print(s['c']) # Scalar
df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'a', 'b', 'b', 'c'])
print(df.loc['b']) # DataFrame with the duplicate rows

Revisit the lecture for core CSV ingestion/export. Use this section when you need alternate formats, iterative processing, or performance tuning. Each technique notes the scenarios where it adds value. See the lecture for baseline syntax before layering these extensions.

Ideal for business spreadsheets or multi-sheet workbooks.

# Read entire workbook
df = pd.read_excel('data.xlsx')
print(df.head())
# Target a specific sheet
df_sales = pd.read_excel('data.xlsx', sheet_name='Sales')
print(df_sales.head())
# Write results back out
df_sales.to_excel('sales_summary.xlsx', sheet_name='Summary', index=False)

Use when you need Excel-native formatting or your stakeholders expect .xlsx outputs.

Designed for API payloads or nested records.

df = pd.read_json('data.json')
df.to_json('output.json', orient='records', indent=2)

Switch the orient parameter ('records', 'columns', 'table', etc.) based on the consumer. JSON is great for web services and lightweight integrations.

Ideal when data already resides in transactional stores. Requires a SQLAlchemy engine or DB-API connection.

# import sqlalchemy as sqla
# engine = sqla.create_engine('sqlite:///mydb.sqlite')
# query = "SELECT name, total, date FROM sales WHERE date >= '2024-01-01'"
# df = pd.read_sql(query, engine)

Once records are in a DataFrame, downstream cleaning and analysis mirrors the lecture workflow.

Break massive files into bite-sized pieces without exhausting RAM.

chunk_iter = pd.read_csv('huge_file.csv', chunksize=10000)
results = []
for chunk in chunk_iter:
processed = chunk[chunk['value'] > 0].groupby('category').sum()
results.append(processed)
final = pd.concat(results, axis=0).groupby(level=0).sum()

Use chunking when files exceed memory or when you only need aggregated results. See lecture fundamentals for basic read_csv; layer this pattern when datasets push RAM limits.

Tame messy inputs with custom NA markers, delimiters, or sampling.

df = pd.read_csv(
'data.csv',
na_values=['NA', 'NULL', 'missing', '?'],
keep_default_na=True
)
preview = pd.read_csv('large_file.csv', nrows=100)
print(preview.head())

Start with the lecture’s clean CSV example, then layer these options as you encounter real-world quirks.