Data Analysis Basics pandas, numpy, matplotlib

Python Data Analysis Basics

Python Data Analysis Basics

Master data manipulation, computation, and visualization with pandas, numpy, and matplotlib!

1. pandas: Data Manipulation

pandas is used for structured data operations (think Excel in Python).

Installation


pip install pandas numpy matplotlib  

            

Key Concepts

  • DataFrame: A 2D table (rows and columns).
  • Series: A single column of data.

Example: Sales Data Analysis


import pandas as pd  

# Create DataFrame  
data = {  
    "Month": ["Jan", "Feb", "Mar"],  
    "Sales": [2000, 3000, 2500],  
    "Expenses": [1500, 1700, 1600]  
}  
df = pd.DataFrame(data)  

# Basic Operations  
print(df.head())       # First 5 rows  
print(df.describe())   # Summary stats  

# Add a Profit column  
df["Profit"] = df["Sales"] - df["Expenses"]  

# Filter data  
high_sales = df[df["Sales"] > 2500]  

# Group by Month  
monthly_summary = df.groupby("Month").sum()  

            

2. numpy: Numerical Computing

numpy handles arrays and mathematical operations efficiently.

Key Features

  • Arrays: Faster than Python lists for numerical tasks.
  • Broadcasting: Perform operations on arrays of different shapes.

Example: Array Operations


import numpy as np  

# Create arrays  
prices = np.array([10, 20, 30])  
quantities = np.array([5, 3, 2])  

# Vectorized operations  
revenue = prices * quantities  # [50, 60, 60]  

# Universal functions  
print(np.sqrt(revenue))        # Square roots: [7.07, 7.75, 7.75]  

# Reshaping  
matrix = np.array([[1, 2], [3, 4], [5, 6]])  

            

3. matplotlib: Data Visualization

matplotlib creates static, interactive, or animated visualizations.

Basic Plotting


import matplotlib.pyplot as plt  

# Line plot  
plt.plot(df["Month"], df["Sales"], marker="o", label="Sales")  
plt.plot(df["Month"], df["Expenses"], marker="s", label="Expenses")  

# Customize  
plt.title("Monthly Sales vs Expenses")  
plt.xlabel("Month")  
plt.ylabel("Amount ($)")  
plt.legend()  
plt.grid(True)  
plt.show()  

            

Bar Chart


plt.bar(df["Month"], df["Profit"], color="green")  
plt.title("Monthly Profit")  
plt.show()  

            

Real-World Project: Sales Insights

Combine all three libraries to analyze and visualize sales data:


# 1. Load data from CSV  
df = pd.read_csv("sales_data.csv")  

# 2. Clean data  
df.dropna(inplace=True)  # Remove missing values  

# 3. Calculate metrics with numpy  
average_sales = np.mean(df["Sales"])  

# 4. Plot trends  
plt.figure(figsize=(10, 5))  
plt.scatter(df["Month"], df["Sales"], color="blue", label="Sales")  
plt.axhline(average_sales, color="red", linestyle="--", label="Average Sales")  
plt.title("Sales Trends")  
plt.legend()  
plt.show()  

            

Key Comparisons

Library Purpose Key Features
pandas Data manipulation DataFrames, filtering, grouping
numpy Numerical operations Arrays, vectorization, math ops
matplotlib Visualization Line/bar charts, customization

Common Mistakes

  • ❌ Not handling NaN values: Use df.dropna() or df.fillna().
  • ❌ Using loops instead of vectorization: Leverage numpy/pandas operations.
  • ❌ Overcomplicating plots: Start simple, then customize.

Best Practices

  • Inspect data first: Use df.head(), df.info(), and df.describe().
  • Use vectorized operations: Avoid Python loops with numpy/pandas.
  • Label plots clearly: Always add titles, axis labels, and legends.

Practice Problem

Create a DataFrame with student grades.

Calculate average grades per subject using numpy.

Plot a bar chart showing subject averages.

Key Takeaways

  • ✅ pandas: Clean, filter, and analyze tabular data.
  • ✅ numpy: Perform fast numerical computations.
  • ✅ matplotlib: Visualize trends and patterns.

What’s Next?

Learn Seaborn for advanced visualizations or scikit-learn for machine learning!

Post a Comment

Previous Post Next Post