跳到主要内容

设计搜索引擎

问题

如何用 Rust 设计一个全文搜索引擎?

答案

倒排索引架构

简化的倒排索引实现

use std::collections::HashMap;

/// 倒排索引
pub struct InvertedIndex {
/// term → (doc_id, term_frequency)
index: HashMap<String, Vec<(usize, u32)>>,
/// doc_id → 原始文档
documents: Vec<String>,
}

impl InvertedIndex {
pub fn new() -> Self {
Self {
index: HashMap::new(),
documents: Vec::new(),
}
}

/// 索引一个文档
pub fn add_document(&mut self, content: &str) {
let doc_id = self.documents.len();
self.documents.push(content.to_string());

// 分词并统计词频
let mut term_freq: HashMap<String, u32> = HashMap::new();
for word in content.split_whitespace() {
let normalized = word.to_lowercase();
*term_freq.entry(normalized).or_default() += 1;
}

// 更新倒排索引
for (term, freq) in term_freq {
self.index.entry(term)
.or_default()
.push((doc_id, freq));
}
}

/// 搜索:返回匹配的文档 ID(按相关性排序)
pub fn search(&self, query: &str) -> Vec<(usize, f64)> {
let terms: Vec<String> = query.split_whitespace()
.map(|w| w.to_lowercase())
.collect();

let mut scores: HashMap<usize, f64> = HashMap::new();
let total_docs = self.documents.len() as f64;

for term in &terms {
if let Some(postings) = self.index.get(term) {
// IDF = log(N / df)
let idf = (total_docs / postings.len() as f64).ln();

for &(doc_id, tf) in postings {
// TF-IDF 评分
let tf_score = 1.0 + (tf as f64).ln();
*scores.entry(doc_id).or_default() += tf_score * idf;
}
}
}

let mut results: Vec<(usize, f64)> = scores.into_iter().collect();
results.sort_by(|a, b| b.1.partial_cmp(&a.1).unwrap());
results
}
}

Tantivy:Rust 版 Lucene

Tantivy 是 Rust 实现的全文搜索引擎库,相当于 Java 的 Lucene:

use tantivy::schema::*;
use tantivy::{Index, doc};

// 定义 Schema
let mut schema_builder = Schema::builder();
let title = schema_builder.add_text_field("title", TEXT | STORED);
let body = schema_builder.add_text_field("body", TEXT);
let schema = schema_builder.build();

// 创建索引
let index = Index::create_in_ram(schema);
let mut writer = index.writer(50_000_000)?; // 50MB 缓冲

writer.add_document(doc!(
title => "Rust Programming",
body => "Rust is a systems programming language"
))?;
writer.commit()?;

// 搜索
let reader = index.reader()?;
let searcher = reader.searcher();
let query_parser = tantivy::query::QueryParser::for_index(&index, vec![title, body]);
let query = query_parser.parse_query("rust programming")?;
let top_docs = searcher.search(&query, &tantivy::collector::TopDocs::with_limit(10))?;
维度ElasticsearchTantivy
语言Java (Lucene)Rust
部署独立服务嵌入式库
性能更高(无 GC)
分布式内置需要 Quickwit

常见面试问题

Q1: 倒排索引的空间优化?

答案

  1. 前缀压缩:相邻 term 共享前缀(FST: Finite State Transducer)
  2. 差值编码:Posting List 存差值而非绝对值 [1, 5, 8][1, 4, 3]
  3. VByte 编码:变长字节编码,小数字用更少字节
  4. Roaring Bitmap:高效的位图存储 Doc ID 集合

Tantivy 内置了这些优化。

相关链接