embulk/embulk

Writing broken records or texts to a file to store them or retry later

Open

#27 建立於 2015年1月29日

在 GitHub 查看
 (6 留言) (0 反應) (0 負責人)Java (1,711 star) (204 fork)batch import
help wantednew feature

描述

Data include a lot of broken records. A bulk import can skip them but we want to load them later as an exceptional case. To do it, we want to get those error records written to other files or databases.

Difficulty in terms of API design is that format of the records can be different depending on plugin types.

  • Encoder, Decoder, some Parser plugins
    • These plugins read with buffer. They can't recognize "records". When they detect broken data, they skip the entire file
  • Line-based parser plugins
    • Some parser plugins are based on lines (e.g. csv). They can skip a line and continue parsing from the next line.
  • Formatter and Output plugins
    • Formatter plugins read records. They can skip a record whose schema is fixed by the previous plugins.
  • Filter plugins (#26)
    • Filter plugins read records. They can skip a record whose schema is fixed by the previous plugins.

So, depending on plugins, error output needs to store 3 kinds of data:

a) file from a certain position b) line c) record with various schema

貢獻者指南