ipfs/kubo

feature: gzip multi member dependant chunker / importer, warc, tar

Open

#3604 opened on Jan 17, 2017

View on GitHub
 (4 comments) (1 reaction) (0 assignees)Go (13,906 stars) (2,725 forks)batch import
help wantedkind/enhancement

Description

Version information:

go-ipfs version: 0.4.4

Type: Feature, Enhancement

Priority: P4

Area: Tools, Importer

Description:

Like in case of WARCs, gzip files do support multiple members, effectively making it possible to stitch together large files from smaller ones by mere concatenation.
This gives the possibility to compress meta and each record separately, concatenate onto a single file, then do partial fetches and decompression, including HTTP Range requests.

By having the static chunker also split at gzip member bondaries, one can easily construct .tar.gz files, or .tar of .gz files, and all sorts of derived data sets easily, without duplication.

There are two ways to approach this: a) the chunker works as usual, but also additionally splitting a block at member boundary
(resulting in 1:1 result, except replacing one block per member with two split in half) b) the chunker works as usual, but when encountering gzip member boundary, it makes one block smaller, starting new member in it's own 256k data block
(resulting in shift, and hence duplication of data. probably not the way to do it)

This should work for all gzip files, tar files, and more.

Related: https://tools.ietf.org/html/rfc1952

Contributor guide