quickwit-oss/tantivy

Document Score

Open

#703 opened on Nov 14, 2019

View on GitHub
 (3 comments) (0 reactions) (0 assignees)Rust (8,354 stars) (499 forks)batch import
documentationgood first issue

Description

I have been asked about the scoring algorithm that tantivy uses and realised that neither I, nor the documentation have a canonical description for it apart from:

The larger the number, the more relevant the document to the search

https://docs.rs/tantivy/0.10.3/tantivy/type.Score.html

I think it will be great to add more information and run through an example query on an index to show why queries return results in that order and how a user might debug specific queries.

Who do we expect to read this?

People building a full-text search engine are interested in efficiently storing and ranking documents against queries. The score of each document is arguably THE most important data type that we return to users in every query. I expect most users of tantivy will want to read about the Score type at one point or another.

2 types of users:

  1. knowledgeable about building search engines and wants to confirm the validity of tantivy's scoring algorithm - expect to see tf/idf, BM25 and other known
  2. someone for whom tantivy might be the first experience building a search application with little background on document scoring - want answers to specific questions and some further reading material.

Questions these users want to answer:

  • Why are search results in this order? What is this score field? Why is it a float?
  • How does each subquery in the full query (eg. q: "title:president AND (body:Obama OR body:barack) AND year:<2008") contribute to the final score of a document
  • I want to boost/expected a specific document higher up in the set of results for a given query - how do I do that?

Suggested style of documentation

Prose: A detailed high-level explanation for document scoring - how is each query scored, how are scores of different sub-queries combined. Code: doc-test (doesn't need to assert/test anything) that walks through an example of debugging a unexpectedly low-ranking document, using Query::explain and showing how the example query can be re-written.

Provide further reading material

Give links to tf-idf, BM25 wikipedia pages and the Query::explain method

If you do this ticket, you will learn:

  • The full life-cycle of a tantivy query from query to score per document
  • tantivy helper methods for debugging such queries
  • writing concise, yet informative documentation for power-users and amateurs at the same time

Contributor guide