COLLECT_SET should apply to tables as well as streams
#9,140 opened on 2022年5月20日
説明
Currently, the COLLECT_SET is only available for windowed stream aggregations, whereas COLLECT_LIST is available for table and stream aggregations.
The reason for this discrepancy is that table and stream aggregations are computed differently in the runtime, and in particular table aggregations need to define a "subtractor" for how to remove a record from the aggregation. (Both table and stream aggregations define an "adder" for adding records to the set). The reason for this is that stream aggregations can never remove a record from a window (since historical stream events are immutable), whereas tables (obviously) support deleting and modifying existing records. Therefore, stream records can never "leave" an aggregation, whereas table records can.
The punchline for COLLECT_SET is that if I have an input table like:
key | color | number
a | red | 17
b | red | 17
c | red | 18
and I do a query like
CREATE ... AS SELECT color, COLLECT_SET(number) AS numbers FROM input GROUP_BY color;
Then, I should get the result
color | numbers
red | {17, 18}
At this point, if we delete record a or b from the input table, then the set should remain {17, 18}, but if we delete both a and b, then the set should only become {18}. In other words, we can't just handle the deletion of a (for example) by removing the corresponding element from the set.
One possible solution is to maintain a counter for each set element (internally) and decrement the counter when removing from the set.