Python Snippets

Asynchronous Web Scraper with aiohttp and BeautifulSoup

import aiohttp
import asyncio
from bs4 import BeautifulSoup

async def fetch_url(session, url):
    try:
        async with session.get(url) as response:
            if response.status == 200:
                return await response.text()
            return None
    except Exception as e:
        print(f"Error fetching {url}: {e}")
        return None

async def scrape_urls(urls):
    async with aiohttp.ClientSession() as session:
        tasks = [fetch_url(session, url) for url in urls]
        html_pages = await asyncio.gather(*tasks)
        
        results = []
        for url, html in zip(urls, html_pages):
            if html:
                soup = BeautifulSoup(html, 'html.parser')
                title = soup.title.string if soup.title else "No title"
                results.append((url, title))
        return results

async def main():
    urls = [
        'https://python.org',
        'https://github.com',
        'https://stackoverflow.com',
        'https://pypi.org'
    ]
    scraped_data = await scrape_urls(urls)
    for url, title in scraped_data:
        print(f"{url}: {title}")

if __name__ == "__main__":
    asyncio.run(main())

Explanation

This code snippet demonstrates a modern, asynchronous web scraper that efficiently fetches and processes multiple web pages concurrently. Here’s why it’s useful:

Performance: Uses asyncio and aiohttp for concurrent HTTP requests, significantly faster than sequential requests
Robustness: Includes error handling for failed requests
Practical: Extracts page titles, but can be easily modified to scrape other data
Modern: Uses Python’s async/await syntax for clean asynchronous code

How It Works

fetch_url() - An async function that:
- Makes a GET request to a URL using aiohttp
- Returns the HTML content if successful (status 200)
- Gracefully handles errors
scrape_urls() - The main scraping function that:
- Creates a ClientSession for making HTTP requests
- Creates concurrent tasks for all URLs
- Uses BeautifulSoup to parse the HTML and extract titles
- Returns a list of (url, title) tuples
main() - Demonstrates usage with example URLs

How to Run

Install dependencies:
```
pip install aiohttp beautifulsoup4
```
Save the code to a file (e.g., async_scraper.py)
Run it:
```
python async_scraper.py
```

Customization

To scrape different data, modify the BeautifulSoup parsing in scrape_urls()
Add more URLs to the urls list in main()
Implement rate limiting by adding asyncio.sleep() between requests if needed

This pattern is particularly useful for data collection tasks where you need to process multiple web pages efficiently.