Python Snippets

Parallel File Processing with ThreadPoolExecutor

This snippet demonstrates how to efficiently process multiple files in parallel using Python’s concurrent.futures.ThreadPoolExecutor. This is particularly useful for IO-bound tasks like reading/writing files or processing large datasets.

import os
from concurrent.futures import ThreadPoolExecutor

def process_file(file_path):
    """Example function to process a single file."""
    try:
        with open(file_path, 'r') as f:
            content = f.read()
        # Example processing: count lines
        line_count = len(content.split('\n'))
        print(f"Processed {file_path}: {line_count} lines")
        return line_count
    except Exception as e:
        print(f"Error processing {file_path}: {str(e)}")
        return None

def process_files_parallel(directory, max_workers=4):
    """Process all files in a directory in parallel."""
    files = [
        os.path.join(directory, f) 
        for f in os.listdir(directory) 
        if os.path.isfile(os.path.join(directory, f))
    ]
    
    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        results = list(executor.map(process_file, files))
    
    # Filter out None results (errors)
    successful_results = [r for r in results if r is not None]
    print(f"\nTotal files processed: {len(successful_results)}")
    print(f"Total lines counted: {sum(successful_results)}")

if __name__ == "__main__":
    directory = "./sample_files"  # Change to your target directory
    process_files_parallel(directory, max_workers=8)

Explanation

Why This Is Useful

Key Features

  1. ThreadPoolExecutor: Manages a pool of worker threads for parallel execution.
  2. Dynamic File Discovery: Scans a directory for files automatically.
  3. Result Aggregation: Collects and summarizes results from all files.

How to Run

  1. Save the code to a file (e.g., parallel_processor.py).
  2. Create a directory (e.g., ./sample_files) and populate it with text files.
  3. Adjust max_workers based on your system’s capabilities (default: 8).
  4. Run with:
    python parallel_processor.py
    

Use Cases