Python Snippets

Asynchronous Web Scraper with aiohttp and BeautifulSoup

import aiohttp
import asyncio
from bs4 import BeautifulSoup

async def fetch_url(session, url):
    try:
        async with session.get(url) as response:
            if response.status == 200:
                return await response.text()
            return None
    except Exception as e:
        print(f"Error fetching {url}: {e}")
        return None

async def scrape_urls(urls):
    async with aiohttp.ClientSession() as session:
        tasks = [fetch_url(session, url) for url in urls]
        html_pages = await asyncio.gather(*tasks)
        
        results = []
        for url, html in zip(urls, html_pages):
            if html:
                soup = BeautifulSoup(html, 'html.parser')
                title = soup.title.string if soup.title else "No title"
                results.append((url, title))
        return results

async def main():
    urls = [
        'https://python.org',
        'https://github.com',
        'https://stackoverflow.com',
        'https://pypi.org'
    ]
    scraped_data = await scrape_urls(urls)
    for url, title in scraped_data:
        print(f"{url}: {title}")

if __name__ == "__main__":
    asyncio.run(main())

Explanation

This code snippet demonstrates a modern, asynchronous web scraper that efficiently fetches and processes multiple web pages concurrently. Here’s why it’s useful:

  1. Performance: Uses asyncio and aiohttp for concurrent HTTP requests, significantly faster than sequential requests
  2. Robustness: Includes error handling for failed requests
  3. Practical: Extracts page titles, but can be easily modified to scrape other data
  4. Modern: Uses Python’s async/await syntax for clean asynchronous code

How It Works

  1. fetch_url() - An async function that:
    • Makes a GET request to a URL using aiohttp
    • Returns the HTML content if successful (status 200)
    • Gracefully handles errors
  2. scrape_urls() - The main scraping function that:
    • Creates a ClientSession for making HTTP requests
    • Creates concurrent tasks for all URLs
    • Uses BeautifulSoup to parse the HTML and extract titles
    • Returns a list of (url, title) tuples
  3. main() - Demonstrates usage with example URLs

How to Run

  1. Install dependencies:
    pip install aiohttp beautifulsoup4
    
  2. Save the code to a file (e.g., async_scraper.py)

  3. Run it:
    python async_scraper.py
    

Customization

This pattern is particularly useful for data collection tasks where you need to process multiple web pages efficiently.