设计爬虫系统

问题

如何用 Rust 设计一个高性能的 Web 爬虫？

答案

架构设计

核心实现

use reqwest::Client;
use scraper::{Html, Selector};
use std::collections::HashSet;
use std::sync::Arc;
use tokio::sync::{mpsc, Mutex, Semaphore};

pub struct Crawler {
    client: Client,
    visited: Arc<Mutex<HashSet<String>>>,
    concurrency: Arc<Semaphore>,
}

impl Crawler {
    pub fn new(max_concurrent: usize) -> Self {
        Self {
            client: Client::builder()
                .timeout(std::time::Duration::from_secs(10))
                .user_agent("RustCrawler/1.0")
                .build()
                .unwrap(),
            visited: Arc::new(Mutex::new(HashSet::new())),
            concurrency: Arc::new(Semaphore::new(max_concurrent)),
        }
    }

    /// 爬取单个页面
    async fn fetch(&self, url: &str) -> Result<String, reqwest::Error> {
        let _permit = self.concurrency.acquire().await.unwrap();
        let resp = self.client.get(url).send().await?;
        resp.text().await
    }

    /// 从 HTML 中提取链接
    fn extract_links(html: &str, base_url: &str) -> Vec<String> {
        let document = Html::parse_document(html);
        let selector = Selector::parse("a[href]").unwrap();

        document.select(&selector)
            .filter_map(|el| el.value().attr("href"))
            .filter_map(|href| {
                if href.starts_with("http") {
                    Some(href.to_string())
                } else if href.starts_with('/') {
                    Some(format!("{}{}", base_url, href))
                } else {
                    None
                }
            })
            .collect()
    }

    /// BFS 爬取
    pub async fn crawl(&self, seed_urls: Vec<String>, max_pages: usize) {
        let (tx, mut rx) = mpsc::channel::<String>(1000);

        // 初始种子 URL
        for url in seed_urls {
            let _ = tx.send(url).await;
        }

        let mut count = 0;
        while let Some(url) = rx.recv().await {
            if count >= max_pages { break; }

            // 去重
            {
                let mut visited = self.visited.lock().await;
                if visited.contains(&url) { continue; }
                visited.insert(url.clone());
            }

            // 爬取
            match self.fetch(&url).await {
                Ok(html) => {
                    count += 1;
                    println!("[{}] Crawled: {}", count, url);

                    // 提取新链接
                    let links = Self::extract_links(&html, &url);
                    for link in links {
                        let _ = tx.try_send(link);
                    }
                }
                Err(e) => eprintln!("Error crawling {}: {}", url, e),
            }
        }
    }
}

关键设计决策

决策	选择	原因
并发控制	`Semaphore`	限制并发请求数
URL 去重	`HashSet` / 布隆过滤器	小规模用 Set，大规模用布隆
HTML 解析	`scraper`	CSS 选择器，类似 jQuery
HTTP 客户端	`reqwest`	异步、连接池、自动重定向
调度	BFS（队列）	广度优先保证层次
礼貌爬取	延迟 + robots.txt	尊重网站规则

常见面试问题

Q1: 大规模爬虫如何做 URL 去重？

答案：

方案	内存	误判	适用规模
HashSet	高	无	百万级
布隆过滤器	极低	有（可控）	亿级
Redis Set	外部存储	无	分布式
RocksDB	磁盘	无	超大规模

布隆过滤器用 bloomfilter crate，设置合适的假阳性率（如 0.01%），10 亿 URL 只需约 1GB 内存。

问题​

答案​

架构设计​

核心实现​

关键设计决策​

常见面试问题​

Q1: 大规模爬虫如何做 URL 去重？​

相关链接​

问题

答案

架构设计

核心实现

关键设计决策

常见面试问题

Q1: 大规模爬虫如何做 URL 去重？

相关链接