Skip to content

Python Fundamentals for Data Science

Python for DS is primarily an interface to existing libraries (pandas, sklearn, torch), not general-purpose programming. Treat it as a platform for running pre-built tools.

Environment

Jupyter Notebook / Google Colab: primary workspace. Interactive, cell-by-cell execution.

Colab tips: - Switch to English interface (better for googling errors) - Shift+Enter: run cell, move to next - Ctrl+M, B: insert cell below - Ctrl+M, D: delete cell - Files are session-only - re-download after restart

Core Types

# Numbers
x = 42
y = 3.14
z = x + y  # 45.14

# Strings
name = 'data science'
f"My field is {name}"  # f-string formatting

# Booleans
flag = True  # also: False
# 0, None, [], '' are falsy; everything else truthy

# Lists (ordered, mutable)
nums = [1, 2, 3]
nums.append(4)     # [1, 2, 3, 4]
nums[0]            # 1 (indexing from 0)
nums[-1]           # 4 (last element)

# Tuples (ordered, immutable)
point = (3, 4)     # can't modify

# Dictionaries
d = {'name': 'Alice', 'age': 30}
d['name']          # 'Alice'
d['city'] = 'NYC'  # add key

Slicing

Works on lists and strings: [start:stop:step], stop is exclusive.

nums = [1, 2, 3, 4, 5]
nums[:2]    # [1, 2]
nums[1:]    # [2, 3, 4, 5]
nums[-1]    # 5
nums[::2]   # [1, 3, 5]
nums[::-1]  # [5, 4, 3, 2, 1]

Control Flow

# Conditionals
if x > 10:
    print('big')
elif x > 5:
    print('medium')
else:
    print('small')

# Loops
for item in my_list:
    print(item)

for i, item in enumerate(my_list):
    print(i, item)

for i in range(10):      # 0 to 9
    print(i)

# While
while condition:
    do_something()

List Comprehensions

Replace 3-line loops with 1 line:

# Transform
doubled = [x * 2 for x in nums]

# Filter
big = [x for x in nums if x > 3]

# Transform + filter
result = [x * 2 for x in nums if x > 3]

# Conditional expression
labels = ['pos' if x > 0 else 'neg' for x in values]

Functions

def calculate_metric(actual, predicted):
    """Calculate MAE between actual and predicted values."""
    errors = [abs(a - p) for a, p in zip(actual, predicted)]
    return sum(errors) / len(errors)

# Lambda (anonymous functions)
df['col'].apply(lambda x: x * 2)

Dictionaries

# Merge
merged = {**dict_a, **dict_b}  # right overrides left

# Comprehension
squares = {x: x**2 for x in range(10)}

# Iteration
for key, value in d.items():
    print(key, value)

String Formatting

f"{value:.2f}"     # 2 decimal places
f"{name:>10}"      # right-align, width 10
'...'.join(['a', 'b', 'c'])  # 'a...b...c'

Error Handling

try:
    result = risky_operation()
except ValueError as e:
    print(f"Value error: {e}")
except Exception as e:
    print(f"Unexpected: {e}")

Naming Conventions

  • Variables: snake_case (not camelCase)
  • Classes: CamelCase
  • Files: no spaces, no Cyrillic: my_analysis.ipynb
  • Meaningful names: salary_range not x

Libraries Import Pattern

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

Gotchas

  • Indentation matters (4 spaces standard) - mixing tabs and spaces causes errors
  • = is assignment, == is comparison
  • Lists are mutable: b = a creates reference, not copy. Use b = a.copy()
  • Integer division: 7 / 2 = 3.5, 7 // 2 = 3
  • Never use print() for pandas DataFrames in Jupyter - destroys formatting

See Also