Python Web Scraping: Extract Data Like a Pro
Master tools like requests, BeautifulSoup, and Scrapy to gather data from websites!
1. requests: Fetch Web Pages
The requests library fetches HTML content from URLs.
Basic Usage
import requests
url = "https://books.toscrape.com/"
response = requests.get(url)
if response.status_code == 200:
print("Success!")
html_content = response.text
else:
print(f"Error: {response.status_code}")
Key Features:
- Handle headers, cookies, and sessions.
- Support for POST requests.
2. BeautifulSoup: Parse HTML
BeautifulSoup parses HTML/XML and extracts data using tags, classes, or IDs.
Installation
pip install beautifulsoup4
Example: Scrape Book Titles
from bs4 import BeautifulSoup
# Parse HTML
soup = BeautifulSoup(html_content, "html.parser")
# Extract all book titles
books = soup.select("article.product_pod h3 a")
for book in books:
print(book["title"])
Output:
A Light in the Attic
Tipping the Velvet
Soumission
...
Common Methods:
Method | Purpose |
---|---|
soup.find("div", class_="...") | Find first matching element. |
soup.find_all("a") | Find all matching elements. |
soup.select(".price_color") | CSS selector for elements. |
3. Scrapy: Advanced Scraping Framework
Scrapy is a powerful framework for large-scale scraping projects.
Installation
pip install scrapy
Create a Scrapy Project
Generate Project:
scrapy startproject book_scraper
cd book_scraper
Create a Spider:
# book_scraper/spiders/books.py
import scrapy
class BookSpider(scrapy.Spider):
name = "books"
start_urls = ["https://books.toscrape.com/"]
def parse(self, response):
for book in response.css("article.product_pod"):
yield {
"title": book.css("h3 a::attr(title)").get(),
"price": book.css(".price_color::text").get(),
}
# Follow pagination
next_page = response.css("li.next a::attr(href)").get()
if next_page:
yield response.follow(next_page, callback=self.parse)
Run the Spider:
scrapy crawl books -O books.json
Output: A books.json
file with all scraped data.
4. Real-World Projects
Project 1: Price Comparison Tool
- Scrape prices of a product from Amazon, eBay, and Walmart.
- Compare prices using pandas.
Project 2: News Aggregator
- Extract headlines from CNN, BBC, and NYT.
- Save to a CSV file.
5. Common Challenges & Solutions
Challenge | Solution |
---|---|
Dynamic Content | Use selenium (browser automation). |
Anti-Scraping Mechanisms | Rotate user agents, use proxies, add delays. |
CAPTCHAs | Avoid sites with CAPTCHAs or use paid services. |
Best Practices
- Respect
robots.txt
: Checkhttps://example.com/robots.txt
. - Limit Request Rate: Use
time.sleep(2)
between requests. - Cache Responses: Avoid re-scraping the same page.
- Handle Errors: Retry failed requests.
Comparison of Tools
Tool | Use Case | Learning Curve |
---|---|---|
requests | Simple HTTP requests. | Low |
BeautifulSoup | Small-scale HTML parsing. | Moderate |
Scrapy | Large-scale, structured scraping. | High |
Ethical & Legal Note
- Always check a website’s terms of service before scraping.
- Never scrape personal or sensitive data.
Key Takeaways
- ✅
requests
: Fetch web content. - ✅
BeautifulSoup
: Parse and extract data from HTML. - ✅
Scrapy
: Build scalable scrapers with built-in pipelines.
What’s Next?
Learn APIs to access data more efficiently (e.g., Twitter API, Reddit API).
Tags:
python