apache/pinot

IngestionJobSpec: includeFileNamePattern with Regex does not work as documented

Open

#10611 opened on Apr 14, 2023

View on GitHub
 (1 comment) (0 reactions) (0 assignees)Java (4,937 stars) (1,234 forks)batch import
documentationgood first issue

Description

According to the docs, includeFileNamePattern and excludeFileNamePattern are documented like:

Only Files matching this pattern will be included from inputDirURI. Both glob and regex patterns are supported. Examples: Use 'glob:.avro'or 'regex:^..(avro)$' to include all avro files one level deep in the inputDirURI. Alternatively, use 'glob:*/.avro' to include all the avro files in inputDirURI as well as its subdirectories - bear in mind that, with this approach, the pattern needs to match the absolute path. You can use Glob tool or Regex Tool to test out your patterns.

A few issues here:

1️⃣ The example of regex:^..(avro)$ does not actually work. When running a job with this pattern, you'll get an error like this

Caused by: groovy.lang.GroovyRuntimeException: Failed to parse template script (your template may contain an error or be trying to use expressions not currently supported): startup failed:
SimpleTemplateScript1.groovy: 1: illegal string body character after dollar sign;
   solution: either escape a literal dollar sign "\$5" or bracket the value expression "${5}" @ line 1, column 10.
   out.print("""
            ^

1 error

I'm assuming this because of the templating that was introduced in #5341 (also not documented) , but job spec's appear to have special handling for both $, which needs to be escaped: \$, and backslashes which are automatically escaped to \\

2️⃣ Related to the above, it's not clear how someone would write a single backslash character in their regex. For example, I think this is an impossible regex to use .*\.parquet$ because it's not clear how to get the single backslash character. \ turns into \\ and \\ stays as \\. This issue can be worked around by using character classes and writing .*[.]parquet$, but it feels wrong.

3️⃣ What flavor of regex is actually being used here? regextester.com linked in the documentation only supports PCRE and Javascript regex. However, I suspect this really java regex, which has different syntax. Given the code uses PathMatcher, it's java regex. Pinot should link to a regex tester that will be accurate

4️⃣ Can you provide some examples of the absolute path I should be matching to? I've submitted an ingestion job spec that has includeFileNamePattern: regex:^s3://redactedCompanyName/metrics_rollup_dev/redactedTableName/v/4/ds=(2023-03-02)/.*[.]parquet$

I have an s3 file with the following name at the path: s3://redactedCompanyName/metrics_rollup_dev/redactedTableName/v/4/ds=2023-03-02/part-00000-d60ed2b8-30cd-4e7c-82e0-309f854991f5.c000.gz.parquet

According to regex101.com, this is a match using Java8 syntax: https://regex101.com/r/9ZKOhm/1

It's unclear to me what I'm doing wrong that's causing this pattern to not match.

Contributor guide