apache/pinot

IngestionJobSpec: includeFileNamePattern with Regex does not work as documented

Open

#10,611 建立於 2023年4月14日

在 GitHub 查看
 (1 留言) (0 反應) (0 負責人)Java (4,937 star) (1,234 fork)batch import
documentationgood first issue

描述

According to the docs, includeFileNamePattern and excludeFileNamePattern are documented like:

Only Files matching this pattern will be included from inputDirURI. Both glob and regex patterns are supported. Examples: Use 'glob:.avro'or 'regex:^..(avro)$' to include all avro files one level deep in the inputDirURI. Alternatively, use 'glob:*/.avro' to include all the avro files in inputDirURI as well as its subdirectories - bear in mind that, with this approach, the pattern needs to match the absolute path. You can use Glob tool or Regex Tool to test out your patterns.

A few issues here:

1️⃣ The example of regex:^..(avro)$ does not actually work. When running a job with this pattern, you'll get an error like this

Caused by: groovy.lang.GroovyRuntimeException: Failed to parse template script (your template may contain an error or be trying to use expressions not currently supported): startup failed:
SimpleTemplateScript1.groovy: 1: illegal string body character after dollar sign;
   solution: either escape a literal dollar sign "\$5" or bracket the value expression "${5}" @ line 1, column 10.
   out.print("""
            ^

1 error

I'm assuming this because of the templating that was introduced in #5341 (also not documented) , but job spec's appear to have special handling for both $, which needs to be escaped: \$, and backslashes which are automatically escaped to \\

2️⃣ Related to the above, it's not clear how someone would write a single backslash character in their regex. For example, I think this is an impossible regex to use .*\.parquet$ because it's not clear how to get the single backslash character. \ turns into \\ and \\ stays as \\. This issue can be worked around by using character classes and writing .*[.]parquet$, but it feels wrong.

3️⃣ What flavor of regex is actually being used here? regextester.com linked in the documentation only supports PCRE and Javascript regex. However, I suspect this really java regex, which has different syntax. Given the code uses PathMatcher, it's java regex. Pinot should link to a regex tester that will be accurate

4️⃣ Can you provide some examples of the absolute path I should be matching to? I've submitted an ingestion job spec that has includeFileNamePattern: regex:^s3://redactedCompanyName/metrics_rollup_dev/redactedTableName/v/4/ds=(2023-03-02)/.*[.]parquet$

I have an s3 file with the following name at the path: s3://redactedCompanyName/metrics_rollup_dev/redactedTableName/v/4/ds=2023-03-02/part-00000-d60ed2b8-30cd-4e7c-82e0-309f854991f5.c000.gz.parquet

According to regex101.com, this is a match using Java8 syntax: https://regex101.com/r/9ZKOhm/1

It's unclear to me what I'm doing wrong that's causing this pattern to not match.

貢獻者指南