Crawl4AI：面向 LLM 的开源网页抓取与数据提取指南

前言#

如果你需要把网页内容整理成适合 RAG、智能体或数据管道使用的“干净文本”，Crawl4AI 是一个值得关注的开源爬取与抽取框架。它提供异步爬虫、Markdown 生成、结构化提取，以及一套方便的 CLI，让你从“抓页面”到“可用数据”只差几行代码。

Crawl4AI 能做什么#

异步爬取：基于 AsyncWebCrawler 进行快速抓取。
可配置浏览器/爬取流程：通过 BrowserConfig 与 CrawlerRunConfig 细调行为。
HTML → Markdown：自动生成 Markdown，支持内容过滤与裁剪。
结构化提取：CSS/XPath 或 LLM 方式抽取结构化数据。
CLI 一键运行：crwl 命令行快速抓取与导出。

安装#

1) 基础安装（推荐）#

1
pip install crawl4ai
2
playwright install

2) 可选增强功能#

1
# 高级聚类（PyTorch）
2
pip install crawl4ai[torch]
3

4
# Transformers / Hugging Face 相关
5
pip install crawl4ai[transformer]
6

7
# 全量功能
8
pip install crawl4ai[all]

如果安装了 torch/transformer/all，可选执行一次模型下载：

1
crawl4ai-download-models

快速上手：抓取并输出 Markdown#

1
import asyncio
2
from crawl4ai import AsyncWebCrawler
3

4
async def main():
5
  async with AsyncWebCrawler() as crawler:
6
    result = await crawler.arun("https://example.com")
7
    print(result.markdown[:300])
8

9
if __name__ == "__main__":
10
  asyncio.run(main())

基本配置：BrowserConfig / CrawlerRunConfig#

1
import asyncio
2
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
3

4
async def main():
5
  browser_conf = BrowserConfig(headless=True)
6
  run_conf = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
7

8
  async with AsyncWebCrawler(config=browser_conf) as crawler:
9
    result = await crawler.arun(
10
      url="https://example.com",
11
      config=run_conf,
12
    )
13
    print(result.markdown)
14

15
if __name__ == "__main__":
16
  asyncio.run(main())

提示：如果想开启缓存，将 CacheMode.BYPASS 改为 CacheMode.ENABLED。

Markdown 生成与内容裁剪#

使用 DefaultMarkdownGenerator + PruningContentFilter 可以得到更“干净”的正文：

1
import asyncio
2
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
3
from crawl4ai.content_filter_strategy import PruningContentFilter
4
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
5

6
md_generator = DefaultMarkdownGenerator(
7
  content_filter=PruningContentFilter(threshold=0.4, threshold_type="fixed")
8
)
9

10
config = CrawlerRunConfig(
11
  cache_mode=CacheMode.BYPASS,
12
  markdown_generator=md_generator,
13
)
14

15
async def main():
16
  async with AsyncWebCrawler() as crawler:
17
    result = await crawler.arun("https://news.ycombinator.com", config=config)
18
    print("Raw:", len(result.markdown.raw_markdown))
19
    print("Fit:", len(result.markdown.fit_markdown))
20

21
if __name__ == "__main__":
22
  asyncio.run(main())

结构化提取：JsonCssExtractionStrategy#

适合结构清晰的列表页、商品页等：

1
import asyncio
2
import json
3
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
4
from crawl4ai import JsonCssExtractionStrategy
5

6
schema = {
7
  "name": "Example Items",
8
  "baseSelector": "div.item",
9
  "fields": [
10
    {"name": "title", "selector": "h2", "type": "text"},
11
    {"name": "link", "selector": "a", "type": "attribute", "attribute": "href"},
12
  ],
13
}
14

15
raw_html = "<div class='item'><h2>Item 1</h2><a href='https://example.com/item1'>Link 1</a></div>"
16

17
async def main():
18
  async with AsyncWebCrawler() as crawler:
19
    result = await crawler.arun(
20
      url="raw://" + raw_html,
21
      config=CrawlerRunConfig(
22
        cache_mode=CacheMode.BYPASS,
23
        extraction_strategy=JsonCssExtractionStrategy(schema),
24
      ),
25
    )
26
    data = json.loads(result.extracted_content)
27
    print(data)
28

29
if __name__ == "__main__":
30
  asyncio.run(main())

CLI 用法（crwl）#

1
# 基础抓取
2
crwl https://example.com
3

4
# 输出 Markdown
5
crwl https://example.com -o markdown
6

7
# JSON 输出 + 关闭缓存
8
crwl https://example.com -o json -v --bypass-cache
9

10
# 使用配置文件
11
crwl https://example.com -B browser.yml -C crawler.yml
12

13
# 提问式抽取
14
crwl https://example.com -q "这篇文章的核心观点是什么？"
15

16
# 更多示例
17
crwl --example

适合哪些场景？#

构建 RAG 知识库、自动化文档抓取
监控新闻/公告/博客更新
批量提取列表页结构化数据（商品、招聘、论文等）
结合 LLM 做复杂页面内容理解

使用建议#

尊重网站 robots 协议与服务条款，合理控制频率。
动态页面建议适当延迟或启用完整渲染。
结构化数据优先用 CSS/XPath；复杂页面再考虑 LLM 抽取。
对重复抓取任务开启缓存以节省成本。

参考链接#

官方文档：https://docs.crawl4ai.com/
安装指南：https://docs.crawl4ai.com/basic/installation/
快速上手：https://docs.crawl4ai.com/core/quickstart/
CLI 指南：https://docs.crawl4ai.com/core/cli/
GitHub：https://github.com/unclecode/crawl4ai