[Beowulf] OT: recoverable optical media archive format?

David Mathog mathog at caltech.edu
Thu Jun 10 12:20:39 PDT 2010


Jesse Becker and others suggested:

>     http://users.softlab.ntua.gr/~ttsiod/rsbep.html

I tried it and it works, mostly, but definitely has some warts.

To start with I gave it a negative control - a file so badly corrupted
it should NOT have been able to recover it.

% ssh remotePC 'dd if=/dev/sda1 bs=8192' >img.orig
% cat img.orig      | bzip2 >img.bz2.orig
% cat img.bz2.orig  | rsbep > img.bz2.rsbep
% cat img.bz2.rsbep | pockmark -maxgap 100000 -maxrun 10000
>img.bz2.rsbep.pox
% cat img.bz2.rsbep.pox | rsbep -d -v >img.bz2.restored
rsbep: number of corrected failures   : 9725096
rsbep: number of uncorrectable blocks : 0

img.orig is a Windows XP partition with all empty space filled with
0x0 bytes.  That is then compressed with bzip2, then run
through rsbep (the one from the link above), then corrupted
with pockmark.  Pockmark is my own little concoction, when used as
shown  it stamps 0x0 bytes starting randomly every (1-MAXGAP) bytes, for
a run of (1-MAXRUN).  In both cases the gap and run length are chosen at
random from those ranges for each new gap/run.
This should corrupt around 10% of the file, which I assumed would render
it unrecoverable.  Notice in the file sizes below that the overall size
did not change when the file was run through pockmark.  rsbep did not
note any errors it couldn't correct. However, the
size of the restored file is not the same as the orig.

 4056976560 2010-06-08 17:51 img.bz2.restored
 4639143600 2010-06-08 16:19 img.bz2.rsbep.pox
 4639143600 2010-06-08 16:13 img.bz2.rsbep
 4056879025 2010-06-08 14:40 img.bz2.orig
20974431744 2010-06-07 15:23 img.orig

% bunzip2 -tvv img.bz2.restored
  img.bz2.restored: 
    [1: huff+mtf data integrity (CRC) error in data

So at the very least rsbep sometimes says it has recovered a file when
it has not.  I didn't really expect it to rescue this particular input,
but it really should have handled it better.   I reran it with a less
damaged file like this:


% cat img.bz2.rsbep | pockmark -maxgap 1000000 -maxrun 10000
>img.bz2.rsbep.pox2
% cat img.bz2.rsbep.pox2 | rsbep -d -v >img.bz2.restored2
rsbep: number of corrected failures   : 46025036
rsbep: number of uncorrectable blocks : 0
% bunzip2 img.bz2.restored2
bunzip2: Can't guess original name for img.bz2.restored2 -- using
img.bz2.restored2.out
bunzip2: img.bz2.restored2: trailing garbage after EOF ignored
% md5sum img.bz2.restored2.out img.orig
7fbaec7143c3a17a31295a803641aa3c  img.bz2.restored2.out
7fbaec7143c3a17a31295a803641aa3c  img.orig

This time it was able to recover the corrupted file, but again, it
created an output file which was a different size.  Is this always the
case?   Seems to be at least for the size file used here:

% cat img.bz2.orig | rsbep | rsbep -d > nopox.bz2

nopox.bz2 is also 4056976560.   The decoded output is always 97535 bytes
larger than the original, which may bear some relation to the
z=ERR_BURST_LEN parameter as:

 97535 /765 = 127.496732

which is suspiciously close to 255/2.  Or that could just be a coincidence.

In any case, bunzip2 was able to handle the crud on the end, but this
would have been a problem for other binary files.

Tbe other thing that is frankly bizarre is the number of "corrected"
failures for the 2nd case vs. the first.    The 2nd should have 10X
fewer bad bytes than the first, but the rsbep status messages
indicate 4.73X MORE.  However, the number of bad bytes in the 2nd is
almost exactly 1%, as it should be.  All of this suggests that rsbep
does not handle correctly files which are "too" corrupted.  It gives the
wrong number of corrected blocks and thinks that it has corrected
everything when it has not done so.  Worse, even when it does work the
output file was never (in any of the test cases) the same size as the
input file.

I think this program has potential but it needs a bit of work to sand
the rough edges off.  I will have a look at it, but won't have a chance
to do so for a couple of weeks.

Regards,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech




More information about the Beowulf mailing list