git gud with Version Control

VS Code Basics (GUI-first)

We'll start in the editor so Git makes visual sense later. No JSON needed—just the VS Code interface.

Palette Cleanse: Command Palette & Quick Open

Themes and Schemes: Make it Py‑pretty

Meet the Main Bars

Core Panes You’ll Use

Settings (GUI) you’ll toggle today

Recommended Extensions (install via View → Extensions)

Break(points) the Ice: 5‑minute hands‑on

  1. Change the Color Theme (Preferences: Color Theme)
  2. Install “Python” and “Markdown All in One”
  3. Turn on “Format on Save” in Settings (GUI)
  4. Open a .py file → add a breakpoint (click gutter) → Run → Start Debugging
  5. Open a .md file → right‑click → Open Preview to the Side
  6. Make a small edit → View → Source Control → stage, commit (GUI)

Git Version Control

xkcd 1597: Git

Don't worry - we're taking a different approach than that xkcd suggests!

Why Version Control Matters

The Problem Without Version Control

Picture this: You're working on a data analysis. You create these files:

Sound familiar? Now imagine collaborating with teammates doing the same thing. Chaos!

The Git Solution

Git tracks every change, letting you see what changed, restore versions, work in parallel, collaborate, and avoid losing work. Infinite undo plus collaboration.

Git Concepts - The Mental Model

Repository (Repo)

Your project folder that Git tracks. Contains your files plus a hidden .git folder with all the version history.

Think: "This entire folder is under Git management."

Commit

A saved snapshot of your project at a specific point in time. Like saving a game - you can always come back to this exact state.

Think: "I'm saving my progress with a description of what I accomplished."

Remote

The version of your repository stored on GitHub (or similar service). Your local computer has a copy, GitHub has a copy, your teammates have copies.

Think: "The shared version everyone can access."

Branch

A parallel timeline for your project. The main branch contains your official version, feature branches contain experimental work.

Think: "I'm trying something new without risking the working version."

We'll focus on the main branch today - branches come later!

Reference:

Git Branches

Essential Git Commands

Basic Git commands let you control what changes are committed using a three-stage workflow: working directory, staging area, repository.

Reference:

Essential:

Helpful but less essential:

Brief Example:

# Local repository workflow
git init                      # Start new repository
git add analysis.py           # Stage file
git commit -m "Add analysis script"  # Create commit
git branch feature-analysis  # Create branch
git checkout feature-analysis # Switch to branch

# Remote repository workflow
git clone https://github.com/user/repo.git  # Clone existing repo
git push origin main          # Push changes
git pull origin main          # Pull updates

Git Clone

Good vs. Bad Commit Messages

# Good commit message
git commit -m "Add data validation to analysis script

- Validate input file exists before processing
- Check data format matches expected schema
- Add error handling for malformed data

Fixes issue #123"

# Bad commit message
git commit -m "minor changes"

xkcd 1296: Git Commit

VS Code Git Integration

Setting Up Git in VS Code

Reference:

  1. Install VS Code (if not already done)
  2. Open VS Code → View → Source Control (or Ctrl+Shift+G)
  3. If first time: VS Code will prompt to configure Git username/email

VS Code's Source Control panel makes version control accessible without memorizing command-line syntax. This integration streamlines daily staging, committing, and managing changes.

Reference:

VS Code Git Workflow:

1. Edit files (e.g., analysis.py)
2. Ctrl+Shift+G → Open Source Control panel
3. Click + next to changed files to stage
4. Type commit message: "Add data validation to analysis script"
5. Ctrl+Enter to commit
6. Click sync button to push to GitHub

Git Workflow: Branching and Merging

Git branching develops features in isolation before merging to main, enabling parallel development and safe experimentation.

Reference:

Branching Workflow:

# Create feature branch
git checkout -b feature/data-analysis
# Make changes, commit
git add .
git commit -m "Add data analysis functionality"
git push origin feature/data-analysis

# Switch back to main and merge
git checkout main
git merge feature/data-analysis
git push origin main

# Clean up feature branch
git branch -d feature/data-analysis

Merge Conflict Resolution: When Git cannot automatically merge changes, it creates merge conflicts that must be resolved manually:

  1. Open conflicted files in VS Code
  2. Choose which changes to keep
  3. Remove conflict markers (<<<<<<<, =======, >>>>>>>)
  4. Stage resolved files: git add [file]
  5. Complete merge: git commit

GitHub Web Interface

GitHub's web interface manages repositories, enables collaboration, and organizes projects.

Reference:

Gitignore Files: A .gitignore file specifies which files and directories Git should ignore when tracking changes. This is crucial for data science projects to avoid committing sensitive data, large datasets, or generated files.

Reference:

Brief Example:

# Python cache files
__pycache__/
*.pyc

# Data and secrets
data/raw/*.csv
.env
*.key

# IDE files
.vscode/
.idea/

# Track important files
!data/processed/important_results.csv

This prevents accidentally committing sensitive information, large files, or generated content while preserving important project files.

Brief Example:

Create repository: github.com → "+" → "New repository" → Name, description, add README → Create.

Add files: "Add file" → "Create new file" → Name, add code, commit message → Commit.

Markdown Documentation

Markdown is a lightweight markup language for formatted text, essential for documentation and project communication. Files are human-readable and render beautifully on GitHub.

Reference:

Brief Example:

# Data Analysis Report

## Overview
Analyzes study time vs. performance.

## Key Findings
- More hours → higher grades
- Regular habits help

## Code Example
```python
import pandas as pd
df = pd.read_csv('study_data.csv')
print(f"Correlation: {df['hours'].corr(df['grade']):.2f}")

Raw data


# Python Fundamentals (McKinney Ch2+3)

Python emphasizes readability for data analysis. Everything is an object, enabling consistent behavior. Focus is on practical data manipulation.

![Python Import](media/python_import.webp)

## Language Semantics and Object Model


Python uses indentation for code structure, creating clean code. Every value is an object with type information, enabling dynamic behavior.

**Reference:**
- Indentation defines code blocks (4 spaces recommended)
- `#` for comments
- `type(object)` - Get object type
- `isinstance(object, type)` - Type checking
- `id(object)` - Get object identity
- `dir(object)` - List object attributes

**Brief Example:**
```python
# Indentation matters
if x > 0:
    print("Positive")
    y = x * 2

print(type(42))        # <class 'int'>
print(isinstance("hello", str))  # True

Object Introspection and Dynamic Type Checking

Object introspection examines objects at runtime—their type, attributes, and methods. Valuable for unknown datasets and flexible code.

Python uses duck typing: "If it walks like a duck and quacks like a duck, then it must be a duck." If an object supports the needed methods, you can use it—regardless of its actual type.

Duck Typing

This means functions work with any object that behaves as expected, not just those of a specific type.

Reference:

# Duck typing: treat the same object as different types
big_number = 12345
print(f"As a number: {big_number} (type: {type(big_number).__name__})")

# Convert to string - now we can iterate through digits
number_as_string = str(big_number)
print(f"As a string: '{number_as_string}' (type: {type(number_as_string).__name__})")

# Duck typing: if it acts like an iterable, treat it like one
digit_sum = 0
for digit_char in number_as_string:  # Treating string like a list
    digit_sum += int(digit_char)     # Converting back to int
    
print(f"Sum of digits: {digit_sum}")

Scalar Types and Operations

Scalar types represent single values. Python provides rich support for numeric operations, string manipulation, and boolean logic.

Reference:

Brief Example:

# Numeric operations
count = 150
average = 87.3
population = 1.4e9  # Scientific notation

# String operations
name = "Alice Johnson"
clean_name = "  Bob Smith  ".strip()

# Boolean logic
has_data = True
analysis_ready = has_data and count > 0

String Operations for Data Cleaning

String operations are fundamental for data cleaning. Python provides built-in methods for transforming, cleaning, and validating text data.

Reference:

Brief Example:

# Data cleaning operations
messy_data = "  Alice Johnson  "
clean_name = messy_data.strip().title()

# Text processing
filename = "data_2023_report.csv"
if filename.endswith(".csv"):
    print("Processing CSV file")

# Data validation
user_input = "123abc"
if user_input.isalnum():
    print("Input contains only letters and numbers")

Print Statements and Output Formatting

Print statements communicate results and debug code. Understanding formatting options enables clear output.

Reference:

Brief Example:

# Basic printing
print("Analysis complete")
print("Value:", 42)

# F-string formatting (preferred)
name = "Alice"
score = 87.3456
print(f"Student: {name}")
print(f"Score: {score:.1f}%")

# Debugging with print
data = [1, 2, 3, 4, 5]
print(f"Debug: data = {data}, length = {len(data)}")

Basic File I/O Operations

File I/O operations are essential for data science. Python provides simple tools for reading and writing files.

Reference:

Brief Example:

# Reading from a file
with open('data.txt', 'r') as file:
    content = file.read()
    print(f"File content: {content}")

# Writing to a file
results = ["Alice: 95", "Bob: 87", "Charlie: 92"]
with open('grades.txt', 'w') as file:
    for result in results:
        file.write(f"{result}\n")

# Appending to a file
with open('log.txt', 'a') as file:
    file.write("2023-12-01: Analysis completed\n")

# Print to file examples
with open('results.txt', 'w') as file:
    print("Analysis Results", file=file)
    print(f"Average score: {score:.1f}", file=file)

# One-liner file output
print("Debug info", file=open('debug.log', 'a'))

# Multiple outputs to same file
with open('report.txt', 'w') as report:
    print("Data Science Report", file=report)
    print("=" * 20, file=report)
    print(f"Total samples: {len(data)}", file=report)

Control Flow Structures

Control flow determines execution order through conditionals and loops. Python's syntax emphasizes readability and iteration.

Reference:

Brief Example:

# Conditional logic
score = 85
if score >= 90:
    grade = "A"
elif score >= 80:
    grade = "B"
else:
    grade = "C"

# Iteration
for i in range(5):
    print(f"Count: {i}")

# List iteration
grades = [85, 92, 78, 96]
for grade in grades:
    if grade >= 90:
        print(f"Excellent: {grade}")

Data Structures: Lists and Tuples

Lists provide mutable sequences for data. Tuples offer immutable sequences useful for fixed records.

Reference:

Brief Example:

# Lists - mutable sequences
grades = [85, 92, 78, 96, 88]
grades.append(90)
grades.insert(1, 87)
total = sum(grades)

# Tuples - immutable sequences
coordinates = (40.7128, -74.0060)
name, age, gpa = ("Alice", 22, 3.8)  # Unpacking

Data Structures: Dictionaries and Sets

Dictionaries provide key-value storage for structured data. Sets offer unique collections with mathematical operations.

Reference:

Brief Example:

# Dictionaries - key-value storage
student = {"name": "Alice", "grade": 85, "major": "Data Science"}
print(student["name"])  # "Alice"
print(student.get("gpa", 0.0))  # Safe access

# Sets - unique collections
math_students = {"Alice", "Bob", "Charlie"}
cs_students = {"Alice", "Diana", "Eve"}
both_subjects = math_students & cs_students  # Intersection

List Comprehensions and Sequence Functions

List comprehensions provide concise syntax for creating lists through transformation and filtering. Sequence functions offer efficient operations.

Reference:

Brief Example:

# List comprehensions
grades = [85, 92, 78, 96, 88]
passing_grades = [g for g in grades if g >= 80]

# Sequence functions
for index, grade in enumerate(grades):
    print(f"Student {index + 1}: {grade}")

names = ["Alice", "Bob", "Charlie"]
scores = [85, 92, 78]
for name, score in zip(names, scores):
    print(f"{name}: {score}")

Functions

Functions organize code into reusable units with clear interfaces. They enable reuse, testing, and modular design.

Reference:

Brief Example:

# Function definition
def calculate_average(grades):
    """Calculate the average of a list of grades."""
    if not grades:
        return 0
    return sum(grades) / len(grades)

### Function usage
grades = [85, 92, 78, 96, 88]
average = calculate_average(grades)
print(f"Average grade: {average:.1f}")

__main__ for script execution

if name == "main": # This code runs when script is executed directly grades = [85, 92, 78, 96, 88] average = calculate_average(grades) print(f"Average grade: {average:.1f}")

Command Line Mastery (review)

Essential Navigation Commands

Navigation commands orient you within the file system.

Reference:

Brief Example:

pwd                    # /Users/username/Documents
ls -la                 # Show all files with details
cd projects/data_science
pwd                    # /Users/username/Documents/projects/data_science

File and Directory Operations

File operations create and organize project structures.

Reference:

Brief Example:

mkdir -p data/raw data/processed scripts
touch scripts/analysis.py
cp data/raw/dataset.csv data/processed/

Text Processing and Search

Text processing commands explore and manipulate text data.

Reference:

Brief Example:

head -20 data.csv              # Preview first 20 lines
grep "error" logfile.txt       # Find error messages

Visual Directory Structure

The tree command shows directory structure hierarchically.

Reference:

Brief Example:

tree                    # Show full directory structure
tree -L 2              # Show only 2 levels deep
tree -d                # Show only directories

History Navigation and Shortcuts

Shortcuts and history navigation improve command line efficiency.

Reference:

Brief Example:

# Use up arrow to recall previous commands
# Use Tab to complete: cd pro<Tab> → cd projects/
# Use Ctrl+R to search: Ctrl+R then type "git" to find git commands

Shell Scripting Fundamentals

Shell scripting automates tasks and creates reusable command sequences.

Reference:

Brief Example:

#!/bin/bash
# Create project structure
echo "Setting up project..."
mkdir -p src data output
echo "Directories created"

# Make script executable
chmod +x setup.sh

# Create files using here-documents
cat > data/sample.csv << 'EOF'
name,age,grade
Alice,20,85
Bob,19,92
EOF

echo "Setup complete!"

Command Chaining and Redirection

Command chaining creates data processing pipelines. Redirection controls input and output.

Reference:

Brief Example:

grep "error" logfile.txt | wc -l    # Count error lines
ls *.csv | head -5 > filelist.txt   # Save first 5 CSV files to list