Crawlee-Python: A Powerful Web Scraping and Browser Automation Library

Crawlee-Python is a robust and versatile web scraping and browser automation library designed to help developers build reliable and efficient scrapers. As an open-source project, Crawlee-Python offers a comprehensive set of tools and features that simplify the process of crawling websites, extracting data, and managing the complexities of web scraping at scale.

💡

Want to create your own Agentic AI Workflow with No Code?

You can easily create AI workflows with Anakin AI without any coding knowledge. Connect to LLM APIs such as: GPT-4, Claude 3.5 Sonnet, Uncensored Dolphin-Mixtral, Stable Diffusion, DALLE, Web Scraping.... into One Workflow!

Forget about complicated coding, automate your madane work with Anakin AI!

For a limited time, you can also use Google Gemini 1.5 and Stable Diffusion for Free!

Easily Build AI Agentic Workflows with Anakin AI! — Easily Build AI Agentic Workflows with Anakin AI

Start for free

Key Features of Crawlee-Python

Crawlee-Python stands out from other web scraping libraries due to its unique combination of features:

Unified Interface: Crawlee provides a consistent interface for both HTTP and headless browser crawling, allowing developers to switch between methods easily.

Automatic Parallel Crawling: The library optimizes resource utilization by automatically scaling crawling operations based on available system resources.

Type Hints: Written in Python with full type hint coverage, Crawlee enhances developer experience through improved IDE autocompletion and early bug detection.

Automatic Retries: Built-in mechanisms for handling errors and retrying requests when blocked by websites.

Proxy Rotation and Session Management: Integrated tools for managing proxies and sessions to improve scraping reliability.

Configurable Request Routing: Easily direct URLs to appropriate handlers for efficient data extraction.

Persistent Queue: A robust queue system for managing URLs to be crawled.

Pluggable Storage: Flexible options for storing both tabular data and files.

Headless Browser Support: Out-of-the-box support for headless browser crawling using Playwright.

Asyncio-Based: Built on Python's standard Asyncio library for efficient asynchronous operations.

Getting Started with Crawlee-Python

Let's walk through the process of setting up and using Crawlee-Python for your web scraping projects.

Installation

To get started with Crawlee-Python, you'll need Python 3.9 or higher. The easiest way to install Crawlee is using pip:

pip install crawlee

For additional features, you can install optional extras:

pip install 'crawlee[beautifulsoup]'  # For BeautifulSoupCrawler
pip install 'crawlee[playwright]'     # For PlaywrightCrawler

If you're using the PlaywrightCrawler, don't forget to install the Playwright dependencies:

playwright install

Creating Your First Crawler

Let's create a simple crawler using the PlaywrightCrawler to scrape web page titles and content.

import asyncio
from crawlee import PlaywrightCrawler, PlaywrightCrawlingContext

async def main():
    crawler = PlaywrightCrawler(
        max_requests_per_crawl=50,  # Limit to 50 pages
    )

    @crawler.router.default_handler
    async def request_handler(context: PlaywrightCrawlingContext):
        data = {
            'url': context.page.url,
            'title': await context.page.title(),
            'content': (await context.page.content())[:1000]  # First 1000 characters
        }
        await context.push_data(data)

    await crawler.run(['https://example.com'])
    
    # Export data to CSV
    await crawler.export_data('./result.csv')

if __name__ == '__main__':
    asyncio.run(main())

This script does the following:

Creates a PlaywrightCrawler instance with a limit of 50 pages.
Defines a request handler that extracts the page URL, title, and first 1000 characters of content.
Runs the crawler starting from 'https://example.com'.
Exports the collected data to a CSV file.

Advanced Usage: Crawling Multiple URLs

Let's expand our crawler to handle multiple starting URLs and implement more advanced features.

import asyncio
from crawlee import PlaywrightCrawler, PlaywrightCrawlingContext

async def main():
    crawler = PlaywrightCrawler(
        max_requests_per_crawl=100,
        headless=True,
        browser_type='chromium'
    )

    @crawler.router.default_handler
    async def request_handler(context: PlaywrightCrawlingContext):
        page = context.page
        url = page.url

        # Extract title and main content
        title = await page.title()
        content = await page.evaluate('() => document.body.innerText')

        # Extract all links on the page
        links = await page.evaluate('() => Array.from(document.links).map(link => link.href)')

        # Store the data
        await context.push_data({
            'url': url,
            'title': title,
            'content': content[:500],  # First 500 characters
            'links_found': links
        })

        # Enqueue found links for crawling
        await context.enqueue_links()

    # Start URLs
    start_urls = [
        'https://example.com',
        'https://another-example.com',
        'https://third-example.com'
    ]

    await crawler.run(start_urls)
    
    # Export data to JSON
    await crawler.export_data('./result.json')

if __name__ == '__main__':
    asyncio.run(main())

This enhanced script demonstrates:

Configuring the crawler with more options (headless mode, browser type).
Extracting multiple data points from each page (title, content, links).
Using context.enqueue_links() to automatically add discovered links to the crawl queue.
Starting the crawler with multiple URLs.
Exporting data in JSON format.

Implementing Custom Logic: Filtering and Processing

Let's add some custom logic to filter pages and process data before storing:

import asyncio
from crawlee import PlaywrightCrawler, PlaywrightCrawlingContext
import re

async def main():
    crawler = PlaywrightCrawler(
        max_requests_per_crawl=200,
        headless=True
    )

    @crawler.router.default_handler
    async def request_handler(context: PlaywrightCrawlingContext):
        page = context.page
        url = page.url

        # Only process pages with 'blog' in the URL
        if 'blog' not in url:
            return

        title = await page.title()
        content = await page.evaluate('() => document.body.innerText')

        # Extract date if available
        date_match = re.search(r'\d{4}-\d{2}-\d{2}', await page.content())
        date = date_match.group(0) if date_match else 'Unknown'

        # Process content: remove extra whitespace and truncate
        processed_content = ' '.join(content.split())[:1000]

        await context.push_data({
            'url': url,
            'title': title,
            'date': date,
            'content_preview': processed_content
        })

        # Only enqueue links that contain 'blog'
        await context.enqueue_links(
            glob_patterns=['**/blog/**', '**/blog*'],
            strategy='same-domain'
        )

    start_url = 'https://example-blog.com'
    await crawler.run([start_url])
    
    # Export data to CSV
    await crawler.export_data('./blog_data.csv')

if __name__ == '__main__':
    asyncio.run(main())

This script showcases:

URL filtering to only process blog pages.
Custom data extraction (like finding dates in the content).
Content processing (removing extra whitespace and truncating).
Selective link enqueueing using glob patterns and domain strategy.

Handling Different Page Types

In real-world scenarios, you might need to handle different types of pages differently. Here's how you can use Crawlee's router to achieve this:

import asyncio
from crawlee import PlaywrightCrawler, PlaywrightCrawlingContext

async def main():
    crawler = PlaywrightCrawler(
        max_requests_per_crawl=300,
        headless=True
    )

    @crawler.router.match(pattern=r'/product/.*')
    async def product_handler(context: PlaywrightCrawlingContext):
        page = context.page
        
        product_name = await page.query_selector_eval('.product-name', 'el => el.textContent')
        price = await page.query_selector_eval('.price', 'el => el.textContent')
        description = await page.query_selector_eval('.description', 'el => el.textContent')

        await context.push_data({
            'type': 'product',
            'url': page.url,
            'name': product_name,
            'price': price,
            'description': description
        })

    @crawler.router.match(pattern=r'/category/.*')
    async def category_handler(context: PlaywrightCrawlingContext):
        page = context.page
        
        category_name = await page.query_selector_eval('.category-name', 'el => el.textContent')
        product_links = await page.evaluate('() => Array.from(document.querySelectorAll(".product-link")).map(a => a.href)')

        await context.push_data({
            'type': 'category',
            'url': page.url,
            'name': category_name,
            'product_count': len(product_links)
        })

        # Enqueue product links for crawling
        for link in product_links:
            await context.enqueue_request(link)

    @crawler.router.default_handler
    async def default_handler(context: PlaywrightCrawlingContext):
        page = context.page
        
        title = await page.title()
        await context.push_data({
            'type': 'other',
            'url': page.url,
            'title': title
        })

        # Enqueue all links on the page
        await context.enqueue_links()

    start_url = 'https://example-store.com'
    await crawler.run([start_url])
    
    # Export data to JSON
    await crawler.export_data('./store_data.json')

if __name__ == '__main__':
    asyncio.run(main())

This example demonstrates:

Using route patterns to handle different page types (products, categories, and others).
Extracting specific data for each page type.
Enqueueing links selectively based on the page type.
Handling unknown page types with a default handler.

Best Practices and Tips

Respect Robots.txt: Always check and respect the website's robots.txt file to ensure ethical scraping.

Use Delays: Implement reasonable delays between requests to avoid overwhelming the target server.

Handle Errors Gracefully: Implement try-except blocks to handle unexpected errors and ensure your crawler doesn't crash on a single failure.

Monitor Your Crawler: Use Crawlee's logging features to monitor your crawler's progress and identify any issues.

Optimize Storage: For large-scale crawls, consider using Crawlee's storage options to efficiently manage the data you collect.

Stay Updated: Keep your Crawlee installation up to date to benefit from the latest features and bug fixes.

Use Proxy Rotation: For extensive crawling, consider implementing proxy rotation to avoid IP bans.

Implement User-Agents: Rotate user-agents to make your requests appear more natural.

Conclusion

Crawlee-Python offers a powerful and flexible solution for web scraping and browser automation tasks. Its rich feature set, including support for both HTTP and headless browser crawling, automatic parallel processing, and robust error handling, makes it an excellent choice for both simple and complex scraping projects.

By following the examples and best practices outlined in this guide, you can leverage Crawlee-Python to build efficient, scalable, and maintainable web scrapers. Whether you're extracting data for analysis, monitoring websites, or building a search engine, Crawlee-Python provides the tools you need to accomplish your goals effectively.

Remember to always scrape responsibly, respecting website terms of service and ethical guidelines. Happy scraping!

💡

Want to create your own Agentic AI Workflow with No Code?

Anakin AI gives you the power to build AI Workflow with a Visual UI online. Combine the power of GPT-4, Claude 3.5 Sonnet, Uncensored Dolphine Models, Stable Diffusion, DALLE.... into One Workflow!

Forget about complicated coding, automate your madane work with Anakin AI!

Start for free