feature: gzip multi member dependant chunker / importer, warc, tar
#3604 opened on Jan 17, 2017
Description
Version information:
go-ipfs version: 0.4.4
Type: Feature, Enhancement
Priority: P4
Area: Tools, Importer
Description:
Like in case of WARCs, gzip files do support multiple members, effectively making it possible to stitch together large files from smaller ones by mere concatenation.
This gives the possibility to compress meta and each record separately, concatenate onto a single file, then do partial fetches and decompression, including HTTP Range requests.
By having the static chunker also split at gzip member bondaries, one can easily construct .tar.gz files, or .tar of .gz files, and all sorts of derived data sets easily, without duplication.
There are two ways to approach this:
a) the chunker works as usual, but also additionally splitting a block at member boundary
(resulting in 1:1 result, except replacing one block per member with two split in half)
b) the chunker works as usual, but when encountering gzip member boundary, it makes one block smaller, starting new member in it's own 256k data block
(resulting in shift, and hence duplication of data. probably not the way to do it)
This should work for all gzip files, tar files, and more.
Related: https://tools.ietf.org/html/rfc1952