跳到主要内容

设计 Web 爬虫系统

问题

如何用 Python 设计一个高效的 Web 爬虫系统?Scrapy 的核心架构是什么?

答案

Scrapy 架构

Scrapy 爬虫示例

spiders/product_spider.py
import scrapy
from items import ProductItem

class ProductSpider(scrapy.Spider):
name = "products"
start_urls = ["https://example.com/products?page=1"]

# 自定义配置
custom_settings = {
"CONCURRENT_REQUESTS": 16,
"DOWNLOAD_DELAY": 0.5,
"RETRY_TIMES": 3,
}

def parse(self, response):
# 解析商品列表
for card in response.css("div.product-card"):
item = ProductItem()
item["name"] = card.css("h3::text").get()
item["price"] = card.css("span.price::text").get()
detail_url = card.css("a::attr(href)").get()
# 跟进详情页
yield response.follow(detail_url, self.parse_detail, meta={"item": item})

# 翻页
next_page = response.css("a.next-page::attr(href)").get()
if next_page:
yield response.follow(next_page, self.parse)

def parse_detail(self, response):
item = response.meta["item"]
item["description"] = response.css("div.detail::text").get()
yield item

异步爬虫(aiohttp)

async_crawler.py
import asyncio
import aiohttp
from urllib.parse import urljoin

class AsyncCrawler:
def __init__(self, max_concurrent: int = 10):
self.semaphore = asyncio.Semaphore(max_concurrent)
self.visited: set[str] = set()
self.results: list[dict] = []

async def fetch(self, session: aiohttp.ClientSession, url: str) -> str | None:
async with self.semaphore:
try:
async with session.get(url, timeout=aiohttp.ClientTimeout(total=10)) as resp:
if resp.status == 200:
return await resp.text()
except (aiohttp.ClientError, asyncio.TimeoutError):
return None

async def crawl(self, start_url: str, max_pages: int = 100):
async with aiohttp.ClientSession() as session:
queue: asyncio.Queue[str] = asyncio.Queue()
await queue.put(start_url)

while not queue.empty() and len(self.visited) < max_pages:
url = await queue.get()
if url in self.visited:
continue
self.visited.add(url)

html = await self.fetch(session, url)
if html:
self.parse(html, url, queue)

def parse(self, html: str, base_url: str, queue: asyncio.Queue):
# 使用 BeautifulSoup/lxml 解析
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "lxml")
for link in soup.find_all("a", href=True):
abs_url = urljoin(base_url, link["href"])
if abs_url not in self.visited:
queue.put_nowait(abs_url)

反爬应对

middlewares.py
import random

class RandomUserAgentMiddleware:
USER_AGENTS = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) ...",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) ...",
]

def process_request(self, request, spider):
request.headers["User-Agent"] = random.choice(self.USER_AGENTS)

class ProxyMiddleware:
def process_request(self, request, spider):
proxy = get_proxy_from_pool() # 从代理池获取
request.meta["proxy"] = f"http://{proxy}"

常见面试问题

Q1: 如何处理 JS 动态渲染的页面?

答案

  1. Splash:轻量级浏览器渲染服务,Scrapy-Splash 集成
  2. Playwright/Selenium:无头浏览器执行 JS
  3. 逆向接口:分析 XHR/Fetch 请求直接调 API(首选)

Q2: 爬虫去重策略?

答案

  • URL 去重:布隆过滤器(内存小,允许少量误判)
  • 内容去重:SimHash / MinHash 判断页面相似度
  • Scrapy 内置RFPDupeFilter 使用 URL 指纹

Q3: 分布式爬虫怎么实现?

答案

  • Scrapy-Redis:用 Redis 作为共享调度队列,多个 Worker 消费
  • URL 分片:按域名 Hash 分配给不同 Worker
  • 去重共享:Redis 布隆过滤器

Q4: 如何避免被封 IP?

答案

  1. 降低请求频率(DOWNLOAD_DELAY
  2. 随机 User-Agent
  3. IP 代理池轮换
  4. Cookie 池
  5. 遵守 robots.txt

相关链接