Crawlee-Python is a robust and versatile web scraping and browser automation library designed to help developers build reliable and efficient scrapers. As an open-source project, Crawlee-Python offers a comprehensive set of tools and features that simplify the process of crawling websites, extracting data, and managing the complexities of web scraping at scale.
You can easily create AI workflows with Anakin AI without any coding knowledge. Connect to LLM APIs such as: GPT-4, Claude 3.5 Sonnet, Uncensored Dolphin-Mixtral, Stable Diffusion, DALLE, Web Scraping.... into One Workflow!
Forget about complicated coding, automate your madane work with Anakin AI!
For a limited time, you can also use Google Gemini 1.5 and Stable Diffusion for Free!
Key Features of Crawlee-Python
Crawlee-Python stands out from other web scraping libraries due to its unique combination of features:
Unified Interface: Crawlee provides a consistent interface for both HTTP and headless browser crawling, allowing developers to switch between methods easily.
Automatic Parallel Crawling: The library optimizes resource utilization by automatically scaling crawling operations based on available system resources.
Type Hints: Written in Python with full type hint coverage, Crawlee enhances developer experience through improved IDE autocompletion and early bug detection.
Automatic Retries: Built-in mechanisms for handling errors and retrying requests when blocked by websites.
Proxy Rotation and Session Management: Integrated tools for managing proxies and sessions to improve scraping reliability.
Configurable Request Routing: Easily direct URLs to appropriate handlers for efficient data extraction.
Persistent Queue: A robust queue system for managing URLs to be crawled.
Pluggable Storage: Flexible options for storing both tabular data and files.
Headless Browser Support: Out-of-the-box support for headless browser crawling using Playwright.
Asyncio-Based: Built on Python's standard Asyncio library for efficient asynchronous operations.
Getting Started with Crawlee-Python
Let's walk through the process of setting up and using Crawlee-Python for your web scraping projects.
Installation
To get started with Crawlee-Python, you'll need Python 3.9 or higher. The easiest way to install Crawlee is using pip:
pip install crawlee
For additional features, you can install optional extras:
pip install 'crawlee[beautifulsoup]' # For BeautifulSoupCrawler
pip install 'crawlee[playwright]' # For PlaywrightCrawler
If you're using the PlaywrightCrawler, don't forget to install the Playwright dependencies:
playwright install
Creating Your First Crawler
Let's create a simple crawler using the PlaywrightCrawler to scrape web page titles and content.
import asyncio
from crawlee import PlaywrightCrawler, PlaywrightCrawlingContext
async def main():
crawler = PlaywrightCrawler(
max_requests_per_crawl=50, # Limit to 50 pages
)
@crawler.router.default_handler
async def request_handler(context: PlaywrightCrawlingContext):
data = {
'url': context.page.url,
'title': await context.page.title(),
'content': (await context.page.content())[:1000] # First 1000 characters
}
await context.push_data(data)
await crawler.run(['https://example.com'])
# Export data to CSV
await crawler.export_data('./result.csv')
if __name__ == '__main__':
asyncio.run(main())
This script does the following:
- Creates a PlaywrightCrawler instance with a limit of 50 pages.
- Defines a request handler that extracts the page URL, title, and first 1000 characters of content.
- Runs the crawler starting from 'https://example.com'.
- Exports the collected data to a CSV file.
Advanced Usage: Crawling Multiple URLs
Let's expand our crawler to handle multiple starting URLs and implement more advanced features.
import asyncio
from crawlee import PlaywrightCrawler, PlaywrightCrawlingContext
async def main():
crawler = PlaywrightCrawler(
max_requests_per_crawl=100,
headless=True,
browser_type='chromium'
)
@crawler.router.default_handler
async def request_handler(context: PlaywrightCrawlingContext):
page = context.page
url = page.url
# Extract title and main content
title = await page.title()
content = await page.evaluate('() => document.body.innerText')
# Extract all links on the page
links = await page.evaluate('() => Array.from(document.links).map(link => link.href)')
# Store the data
await context.push_data({
'url': url,
'title': title,
'content': content[:500], # First 500 characters
'links_found': links
})
# Enqueue found links for crawling
await context.enqueue_links()
# Start URLs
start_urls = [
'https://example.com',
'https://another-example.com',
'https://third-example.com'
]
await crawler.run(start_urls)
# Export data to JSON
await crawler.export_data('./result.json')
if __name__ == '__main__':
asyncio.run(main())
This enhanced script demonstrates:
- Configuring the crawler with more options (headless mode, browser type).
- Extracting multiple data points from each page (title, content, links).
- Using
context.enqueue_links()
to automatically add discovered links to the crawl queue. - Starting the crawler with multiple URLs.
- Exporting data in JSON format.
Implementing Custom Logic: Filtering and Processing
Let's add some custom logic to filter pages and process data before storing:
import asyncio
from crawlee import PlaywrightCrawler, PlaywrightCrawlingContext
import re
async def main():
crawler = PlaywrightCrawler(
max_requests_per_crawl=200,
headless=True
)
@crawler.router.default_handler
async def request_handler(context: PlaywrightCrawlingContext):
page = context.page
url = page.url
# Only process pages with 'blog' in the URL
if 'blog' not in url:
return
title = await page.title()
content = await page.evaluate('() => document.body.innerText')
# Extract date if available
date_match = re.search(r'\d{4}-\d{2}-\d{2}', await page.content())
date = date_match.group(0) if date_match else 'Unknown'
# Process content: remove extra whitespace and truncate
processed_content = ' '.join(content.split())[:1000]
await context.push_data({
'url': url,
'title': title,
'date': date,
'content_preview': processed_content
})
# Only enqueue links that contain 'blog'
await context.enqueue_links(
glob_patterns=['**/blog/**', '**/blog*'],
strategy='same-domain'
)
start_url = 'https://example-blog.com'
await crawler.run([start_url])
# Export data to CSV
await crawler.export_data('./blog_data.csv')
if __name__ == '__main__':
asyncio.run(main())
This script showcases:
- URL filtering to only process blog pages.
- Custom data extraction (like finding dates in the content).
- Content processing (removing extra whitespace and truncating).
- Selective link enqueueing using glob patterns and domain strategy.
Handling Different Page Types
In real-world scenarios, you might need to handle different types of pages differently. Here's how you can use Crawlee's router to achieve this:
import asyncio
from crawlee import PlaywrightCrawler, PlaywrightCrawlingContext
async def main():
crawler = PlaywrightCrawler(
max_requests_per_crawl=300,
headless=True
)
@crawler.router.match(pattern=r'/product/.*')
async def product_handler(context: PlaywrightCrawlingContext):
page = context.page
product_name = await page.query_selector_eval('.product-name', 'el => el.textContent')
price = await page.query_selector_eval('.price', 'el => el.textContent')
description = await page.query_selector_eval('.description', 'el => el.textContent')
await context.push_data({
'type': 'product',
'url': page.url,
'name': product_name,
'price': price,
'description': description
})
@crawler.router.match(pattern=r'/category/.*')
async def category_handler(context: PlaywrightCrawlingContext):
page = context.page
category_name = await page.query_selector_eval('.category-name', 'el => el.textContent')
product_links = await page.evaluate('() => Array.from(document.querySelectorAll(".product-link")).map(a => a.href)')
await context.push_data({
'type': 'category',
'url': page.url,
'name': category_name,
'product_count': len(product_links)
})
# Enqueue product links for crawling
for link in product_links:
await context.enqueue_request(link)
@crawler.router.default_handler
async def default_handler(context: PlaywrightCrawlingContext):
page = context.page
title = await page.title()
await context.push_data({
'type': 'other',
'url': page.url,
'title': title
})
# Enqueue all links on the page
await context.enqueue_links()
start_url = 'https://example-store.com'
await crawler.run([start_url])
# Export data to JSON
await crawler.export_data('./store_data.json')
if __name__ == '__main__':
asyncio.run(main())
This example demonstrates:
- Using route patterns to handle different page types (products, categories, and others).
- Extracting specific data for each page type.
- Enqueueing links selectively based on the page type.
- Handling unknown page types with a default handler.
Best Practices and Tips
Respect Robots.txt: Always check and respect the website's robots.txt file to ensure ethical scraping.
Use Delays: Implement reasonable delays between requests to avoid overwhelming the target server.
Handle Errors Gracefully: Implement try-except blocks to handle unexpected errors and ensure your crawler doesn't crash on a single failure.
Monitor Your Crawler: Use Crawlee's logging features to monitor your crawler's progress and identify any issues.
Optimize Storage: For large-scale crawls, consider using Crawlee's storage options to efficiently manage the data you collect.
Stay Updated: Keep your Crawlee installation up to date to benefit from the latest features and bug fixes.
Use Proxy Rotation: For extensive crawling, consider implementing proxy rotation to avoid IP bans.
Implement User-Agents: Rotate user-agents to make your requests appear more natural.
Conclusion
Crawlee-Python offers a powerful and flexible solution for web scraping and browser automation tasks. Its rich feature set, including support for both HTTP and headless browser crawling, automatic parallel processing, and robust error handling, makes it an excellent choice for both simple and complex scraping projects.
By following the examples and best practices outlined in this guide, you can leverage Crawlee-Python to build efficient, scalable, and maintainable web scrapers. Whether you're extracting data for analysis, monitoring websites, or building a search engine, Crawlee-Python provides the tools you need to accomplish your goals effectively.
Remember to always scrape responsibly, respecting website terms of service and ethical guidelines. Happy scraping!
Anakin AI gives you the power to build AI Workflow with a Visual UI online. Combine the power of GPT-4, Claude 3.5 Sonnet, Uncensored Dolphine Models, Stable Diffusion, DALLE.... into One Workflow!
Forget about complicated coding, automate your madane work with Anakin AI!