Document Score · quickwit-oss/tantivy#703

Repository metrics

Stars: (8,354 stars)
PR merge metrics: (Avg merge 12d 14h) (20 merged PRs in 30d)

Description

I have been asked about the scoring algorithm that tantivy uses and realised that neither I, nor the documentation have a canonical description for it apart from:

The larger the number, the more relevant the document to the search

https://docs.rs/tantivy/0.10.3/tantivy/type.Score.html

I think it will be great to add more information and run through an example query on an index to show why queries return results in that order and how a user might debug specific queries.

Who do we expect to read this?

People building a full-text search engine are interested in efficiently storing and ranking documents against queries. The score of each document is arguably THE most important data type that we return to users in every query. I expect most users of tantivy will want to read about the Score type at one point or another.

2 types of users:

knowledgeable about building search engines and wants to confirm the validity of tantivy's scoring algorithm - expect to see tf/idf, BM25 and other known
someone for whom tantivy might be the first experience building a search application with little background on document scoring - want answers to specific questions and some further reading material.

Questions these users want to answer:

Why are search results in this order? What is this score field? Why is it a float?
How does each subquery in the full query (eg. q: "title:president AND (body:Obama OR body:barack) AND year:<2008") contribute to the final score of a document
I want to boost/expected a specific document higher up in the set of results for a given query - how do I do that?

Provide further reading material

Give links to tf-idf, BM25 wikipedia pages and the Query::explain method

If you do this ticket, you will learn:

The full life-cycle of a tantivy query from query to score per document
tantivy helper methods for debugging such queries
writing concise, yet informative documentation for power-users and amateurs at the same time

Contributor guide

Research direction: Explain how tantivy computes document scores using BM25, how to use Query::explain, and provide a step by step example with links to further reading.
Tech stack: rust
Domain: backend
Issue type: Documentation
Difficulty: 2
Estimated time: 1-3 hours
Activity status: Active
Clarity: Clear
Prerequisites: Rust
Newbie friendliness: 75