Python Data Analysis Basics
Master data manipulation, computation, and visualization with pandas, numpy, and matplotlib!
1. pandas: Data Manipulation
pandas is used for structured data operations (think Excel in Python).
Installation
pip install pandas numpy matplotlib
Key Concepts
- DataFrame: A 2D table (rows and columns).
- Series: A single column of data.
Example: Sales Data Analysis
import pandas as pd
# Create DataFrame
data = {
"Month": ["Jan", "Feb", "Mar"],
"Sales": [2000, 3000, 2500],
"Expenses": [1500, 1700, 1600]
}
df = pd.DataFrame(data)
# Basic Operations
print(df.head()) # First 5 rows
print(df.describe()) # Summary stats
# Add a Profit column
df["Profit"] = df["Sales"] - df["Expenses"]
# Filter data
high_sales = df[df["Sales"] > 2500]
# Group by Month
monthly_summary = df.groupby("Month").sum()
2. numpy: Numerical Computing
numpy handles arrays and mathematical operations efficiently.
Key Features
- Arrays: Faster than Python lists for numerical tasks.
- Broadcasting: Perform operations on arrays of different shapes.
Example: Array Operations
import numpy as np
# Create arrays
prices = np.array([10, 20, 30])
quantities = np.array([5, 3, 2])
# Vectorized operations
revenue = prices * quantities # [50, 60, 60]
# Universal functions
print(np.sqrt(revenue)) # Square roots: [7.07, 7.75, 7.75]
# Reshaping
matrix = np.array([[1, 2], [3, 4], [5, 6]])
3. matplotlib: Data Visualization
matplotlib creates static, interactive, or animated visualizations.
Basic Plotting
import matplotlib.pyplot as plt
# Line plot
plt.plot(df["Month"], df["Sales"], marker="o", label="Sales")
plt.plot(df["Month"], df["Expenses"], marker="s", label="Expenses")
# Customize
plt.title("Monthly Sales vs Expenses")
plt.xlabel("Month")
plt.ylabel("Amount ($)")
plt.legend()
plt.grid(True)
plt.show()
Bar Chart
plt.bar(df["Month"], df["Profit"], color="green")
plt.title("Monthly Profit")
plt.show()
Real-World Project: Sales Insights
Combine all three libraries to analyze and visualize sales data:
# 1. Load data from CSV
df = pd.read_csv("sales_data.csv")
# 2. Clean data
df.dropna(inplace=True) # Remove missing values
# 3. Calculate metrics with numpy
average_sales = np.mean(df["Sales"])
# 4. Plot trends
plt.figure(figsize=(10, 5))
plt.scatter(df["Month"], df["Sales"], color="blue", label="Sales")
plt.axhline(average_sales, color="red", linestyle="--", label="Average Sales")
plt.title("Sales Trends")
plt.legend()
plt.show()
Key Comparisons
Library | Purpose | Key Features |
---|---|---|
pandas | Data manipulation | DataFrames, filtering, grouping |
numpy | Numerical operations | Arrays, vectorization, math ops |
matplotlib | Visualization | Line/bar charts, customization |
Common Mistakes
- ❌ Not handling NaN values: Use
df.dropna()
ordf.fillna()
. - ❌ Using loops instead of vectorization: Leverage numpy/pandas operations.
- ❌ Overcomplicating plots: Start simple, then customize.
Best Practices
- Inspect data first: Use
df.head()
,df.info()
, anddf.describe()
. - Use vectorized operations: Avoid Python loops with numpy/pandas.
- Label plots clearly: Always add titles, axis labels, and legends.
Practice Problem
Create a DataFrame with student grades.
Calculate average grades per subject using numpy.
Plot a bar chart showing subject averages.
Key Takeaways
- ✅ pandas: Clean, filter, and analyze tabular data.
- ✅ numpy: Perform fast numerical computations.
- ✅ matplotlib: Visualize trends and patterns.
What’s Next?
Learn Seaborn for advanced visualizations or scikit-learn for machine learning!
Tags:
python