Data Analysis Basics (Pandas and Matplotlib)¶
Pandas provides tabular data structures (DataFrame, Series) for data manipulation and analysis. Matplotlib and Seaborn handle visualization. Together with NumPy, they form the foundation of Python's data science ecosystem.
Key Facts¶
- DataFrame is a 2D labeled table (like a spreadsheet); Series is a 1D labeled array (column)
- Pandas is for data that fits in memory; PySpark for distributed large-scale data
pd.concat()for stacking DataFrames;pd.merge()for SQL-style joins- Matplotlib has two APIs: pyplot (MATLAB-style) and OOP (fig, ax - more control)
- Seaborn wraps Matplotlib with better defaults and statistical plotting
- Miller's rule: plots with >8 categories become unreadable - use a table instead
Patterns¶
Pandas Core Operations¶
import pandas as pd
# Create
df = pd.DataFrame({'name': ['Alice', 'Bob'], 'age': [25, 30]})
# Select
df['name'] # column -> Series
df[['name', 'age']] # multiple columns -> DataFrame
df.loc[0] # row by label
df.iloc[0] # row by position
df[df['age'] > 25] # boolean filter
# Aggregate
df.groupby('dept').mean()
df.describe() # summary statistics
# Clean
df.fillna(0) # replace NaN
df.dropna() # remove NaN rows
df.drop_duplicates() # remove duplicate rows
df.astype({'age': int}) # type conversion
Concat and Merge¶
# Vertical stack
pd.concat([df1, df2])
pd.concat([df1, df2], join='inner') # common columns only
# SQL-style join
pd.merge(employees, departments, on='dept_id')
pd.merge(left, right, left_on='id', right_on='emp_id', how='left')
Matplotlib¶
import matplotlib.pyplot as plt
import numpy as np
# OOP style (recommended)
fig, ax = plt.subplots(figsize=(10, 6))
ax.plot(x, y, label='data')
ax.set_title('Title')
ax.set_xlabel('X')
ax.legend()
plt.savefig('plot.png', dpi=150, bbox_inches='tight')
# Multiple subplots
fig, axes = plt.subplots(2, 2, figsize=(10, 8))
axes[0, 0].plot(x, y)
axes[0, 1].scatter(x, y)
axes[1, 0].bar(categories, values)
axes[1, 1].hist(data, bins=30)
Common Plot Types¶
plt.plot(x, y) # line
plt.scatter(x, y, c=colors) # scatter
plt.bar(categories, values) # bar
plt.hist(data, bins=30) # histogram
plt.contourf(X, Y, Z, cmap='RdGy') # contour
Seaborn¶
import seaborn as sns
sns.pairplot(df) # all variables against each other
sns.heatmap(df.corr(), annot=True) # correlation matrix
sns.boxplot(x='category', y='value', data=df)
NumPy Basics¶
import numpy as np
arr = np.array([1, 2, 3])
zeros = np.zeros((3, 4)) # 3x4 matrix of zeros
rand = np.random.random((2, 3)) # 2x3 random matrix
np.random.seed(42) # reproducibility
Visualization Best Practices¶
- Trends over time -> line chart
- Comparisons -> bar chart
- Distributions -> histogram or box plot
- Relationships -> scatter plot
- Proportions -> pie chart (use sparingly)
- 3D on 2D -> contour or heatmap
- Max ~8 categories per chart; beyond that, use a table
Gotchas¶
- Pandas
concatwith repeatedappendhas quadratic complexity - collect all DataFrames, then concat once pd.merge()auto-detects join column from common names; specifyon=to be explicit- Matplotlib pyplot state API is convenient but hard to manage for complex figures - prefer OOP style
- For large datasets, Matplotlib is slow - use
datashaderor Plotly/Bokeh with WebGL
See Also¶
- data structures - Python built-in collections
- file io - reading CSV, JSON data files
- [[data-science/index]] - advanced data science topics