Change to tarball generation?

Thu Jun 21 19:39:44 UTC 2012

2012/5/23 Michael Pyne <mpyne at kde.org>:
> As an example, try:
>
> $ tar cf kdefoo-x.y.z.tar kdefoo-x.y.z/
> $ pixz kdefoo-x.y.z.tar
> # resulting in kdefoo-x.y.z.tar.xz
>
> Because pixz is parallelized it works on whole blocks of data at a time and as
> far as I can tell makes no special provision for the last bits of compressed
> data being smaller than the block size.
>
> With a normal tar file the decompressed data you get is:
>
> 0--------------------------------*  (where * is end of data and end of file)
>
> With a pixz-encoded tar file the decompressed data you get is:
>
> 0--------------------------------*x$  (* is end of data, $ is end of file)
>
> When you run a command like "tar xfJ kdefoo-x.y.z.tar.xz" everything will
> still work fine: tar knows exactly where the data should really end and will
> stop decompressing when it needs to.
>
> When you run a pipeline like "xz --decompress kdefoo-x.y.z.tar.xz | tar xf -"
> though, there's no way to tell xz to stop decompressing early. It tries to
> write all the decompressed data to the pipe. tar still knows exactly where to
> stop, and does so at the '*', not the '$', and closes its input (a pipe!)
> early.
>
> When xz tries to write the 'x$' (garble data) of the decompressed output it
> gets sent to a now-broken pipe, which kills xz on SIGPIPE.
>
> Scripts trying to drive automated extraction of that data using a pipeline
> just see that an error occurred, and will therefore abort. This has affected a
> couple of distributions that are source-based, but is annoying even for those
> manually extracting to have to figure out that their tarball actually
> extracted correctly.
>
> So the problem is only parallelizing compressors that take advantage of the
> allowance to write garbled data past the end of a file and still have the
> decompressor "figure it out". It seems pretty implausible to me that a
> parallelizing compressor would always do this, perhaps this only occurs when
> the compressor is run with tar (e.g. tar cJf) instead of as a separate step?

The "garbled data" has nothing to do with parallelization. pixz stands
for "parallel and indexed xz". Apart from being parallel, it stores a
custom-formatted index at the end of the tarball, apparently to allow
random access.

I also noticed that pixz produces larger results than standard xz,
even when ignoring the extra index data. See:
http://article.gmane.org/gmane.comp.kde.releases/5555

Please do not use pixz for KDE tarballs again...

-- 
Nicolás