IngestionJobSpec: includeFileNamePattern with Regex does not work as documented
#10611 opened on Apr 14, 2023
Description
According to the docs, includeFileNamePattern and excludeFileNamePattern are documented like:
Only Files matching this pattern will be included from inputDirURI. Both glob and regex patterns are supported. Examples: Use 'glob:.avro'or 'regex:^..(avro)$' to include all avro files one level deep in the inputDirURI. Alternatively, use 'glob:*/.avro' to include all the avro files in inputDirURI as well as its subdirectories - bear in mind that, with this approach, the pattern needs to match the absolute path. You can use Glob tool or Regex Tool to test out your patterns.
A few issues here:
1️⃣ The example of regex:^..(avro)$ does not actually work. When running a job with this pattern, you'll get an error like this
Caused by: groovy.lang.GroovyRuntimeException: Failed to parse template script (your template may contain an error or be trying to use expressions not currently supported): startup failed:
SimpleTemplateScript1.groovy: 1: illegal string body character after dollar sign;
solution: either escape a literal dollar sign "\$5" or bracket the value expression "${5}" @ line 1, column 10.
out.print("""
^
1 error
I'm assuming this because of the templating that was introduced in #5341 (also not documented) , but job spec's appear to have special handling for both $, which needs to be escaped: \$, and backslashes which are automatically escaped to \\
2️⃣ Related to the above, it's not clear how someone would write a single backslash character in their regex. For example, I think this is an impossible regex to use .*\.parquet$ because it's not clear how to get the single backslash character. \ turns into \\ and \\ stays as \\.
This issue can be worked around by using character classes and writing .*[.]parquet$, but it feels wrong.
3️⃣ What flavor of regex is actually being used here? regextester.com linked in the documentation only supports PCRE and Javascript regex. However, I suspect this really java regex, which has different syntax. Given the code uses PathMatcher, it's java regex. Pinot should link to a regex tester that will be accurate
4️⃣ Can you provide some examples of the absolute path I should be matching to? I've submitted an ingestion job spec that has includeFileNamePattern: regex:^s3://redactedCompanyName/metrics_rollup_dev/redactedTableName/v/4/ds=(2023-03-02)/.*[.]parquet$
I have an s3 file with the following name at the path:
s3://redactedCompanyName/metrics_rollup_dev/redactedTableName/v/4/ds=2023-03-02/part-00000-d60ed2b8-30cd-4e7c-82e0-309f854991f5.c000.gz.parquet
According to regex101.com, this is a match using Java8 syntax: https://regex101.com/r/9ZKOhm/1
It's unclear to me what I'm doing wrong that's causing this pattern to not match.