Web Scraping requests, BeautifulSoup, Scrapy

Python Web Scraping: Extract Data Like a Pro

Python Web Scraping: Extract Data Like a Pro

Master tools like requests, BeautifulSoup, and Scrapy to gather data from websites!

1. requests: Fetch Web Pages

The requests library fetches HTML content from URLs.

Basic Usage


import requests  

url = "https://books.toscrape.com/"  
response = requests.get(url)  

if response.status_code == 200:  
    print("Success!")  
    html_content = response.text  
else:  
    print(f"Error: {response.status_code}")  

            

Key Features:

  • Handle headers, cookies, and sessions.
  • Support for POST requests.

2. BeautifulSoup: Parse HTML

BeautifulSoup parses HTML/XML and extracts data using tags, classes, or IDs.

Installation


pip install beautifulsoup4  

            

Example: Scrape Book Titles


from bs4 import BeautifulSoup  

# Parse HTML  
soup = BeautifulSoup(html_content, "html.parser")  

# Extract all book titles  
books = soup.select("article.product_pod h3 a")  
for book in books:  
    print(book["title"])  

            

Output:


A Light in the Attic  
Tipping the Velvet  
Soumission  
...  

            

Common Methods:

Method Purpose
soup.find("div", class_="...") Find first matching element.
soup.find_all("a") Find all matching elements.
soup.select(".price_color") CSS selector for elements.

3. Scrapy: Advanced Scraping Framework

Scrapy is a powerful framework for large-scale scraping projects.

Installation


pip install scrapy  

            

Create a Scrapy Project

Generate Project:


scrapy startproject book_scraper  
cd book_scraper  

            

Create a Spider:


# book_scraper/spiders/books.py  
import scrapy  

class BookSpider(scrapy.Spider):  
    name = "books"  
    start_urls = ["https://books.toscrape.com/"]  

    def parse(self, response):  
        for book in response.css("article.product_pod"):  
            yield {  
                "title": book.css("h3 a::attr(title)").get(),  
                "price": book.css(".price_color::text").get(),  
            }  
        # Follow pagination  
        next_page = response.css("li.next a::attr(href)").get()  
        if next_page:  
            yield response.follow(next_page, callback=self.parse)  

            

Run the Spider:


scrapy crawl books -O books.json  

            

Output: A books.json file with all scraped data.

4. Real-World Projects

Project 1: Price Comparison Tool

  • Scrape prices of a product from Amazon, eBay, and Walmart.
  • Compare prices using pandas.

Project 2: News Aggregator

  • Extract headlines from CNN, BBC, and NYT.
  • Save to a CSV file.

5. Common Challenges & Solutions

Challenge Solution
Dynamic Content Use selenium (browser automation).
Anti-Scraping Mechanisms Rotate user agents, use proxies, add delays.
CAPTCHAs Avoid sites with CAPTCHAs or use paid services.

Best Practices

  • Respect robots.txt: Check https://example.com/robots.txt.
  • Limit Request Rate: Use time.sleep(2) between requests.
  • Cache Responses: Avoid re-scraping the same page.
  • Handle Errors: Retry failed requests.

Comparison of Tools

Tool Use Case Learning Curve
requests Simple HTTP requests. Low
BeautifulSoup Small-scale HTML parsing. Moderate
Scrapy Large-scale, structured scraping. High

Ethical & Legal Note

  • Always check a website’s terms of service before scraping.
  • Never scrape personal or sensitive data.

Key Takeaways

  • requests: Fetch web content.
  • BeautifulSoup: Parse and extract data from HTML.
  • Scrapy: Build scalable scrapers with built-in pipelines.

What’s Next?

Learn APIs to access data more efficiently (e.g., Twitter API, Reddit API).

Post a Comment

Previous Post Next Post