Writing broken records or texts to a file to store them or retry later · embulk/embulk#27

(6 留言) (0 反應) (0 負責人)Java (1,711 star) (204 fork)batch import

help wantednew feature

描述

Data include a lot of broken records. A bulk import can skip them but we want to load them later as an exceptional case. To do it, we want to get those error records written to other files or databases.

Difficulty in terms of API design is that format of the records can be different depending on plugin types.

Encoder, Decoder, some Parser plugins
- These plugins read with buffer. They can't recognize "records". When they detect broken data, they skip the entire file
Line-based parser plugins
- Some parser plugins are based on lines (e.g. csv). They can skip a line and continue parsing from the next line.
Formatter and Output plugins
- Formatter plugins read records. They can skip a record whose schema is fixed by the previous plugins.
Filter plugins (#26)
- Filter plugins read records. They can skip a record whose schema is fixed by the previous plugins.

So, depending on plugins, error output needs to store 3 kinds of data:

a) file from a certain position b) line c) record with various schema

貢獻者指南

技術棧: java
領域: backenddata
議題類型: feature
難度: 4
預計時間: over 1 week
活動狀態: stale
清晰度: needs investigation
前置要求: understanding of Embulk plugin typesfamiliarity with bulk data loading
新手友善度: 15
研究方向: The issue discusses handling broken records in Embulk by storing them for later retry. Different plugin types (encoder, parser, formatter) have varying capabilities to skip records or files. The challenge is designing an API to output error records in three formats: file position, line number, or record schema. Research should explore existing patterns in similar tools (e.g., Logstash), and propose a solution that accommodates each plugin type's constraints.