apache/lucene

Create a simple "real world" regexp benchmark [LUCENE-9986]

Open

#11,025 opened on Jun 2, 2021

View on GitHub
 (6 comments) (0 reactions) (1 assignee)Java (2,179 stars) (879 forks)batch import
good first issuelegacy-jira-priority:Majortype:enhancement

Description

For issues like #11022, where we are struggling to decide which low-level optimizations to make for our (complicated!) determinize method, it would really help to have a large, real-world corpus of regexps to evaluate performance metrics of our automata operations, like CPU and HEAP required to parse the regexp and determinize.

Does anyone know of such an existing, hopefully compatibly licensed, corpus?

Probably we would add these benchmarks to luceneutil.


Migrated from LUCENE-9986 by Michael McCandless (@mikemccand)

Contributor guide