Saturday, February 24, 2007

Bad jpegs

In the last few months when we browse photos on babybaer which mounts a volume from chef over NFS, we occasionally get odd effects that appear to be errors in the jpeg encoding, as if a small piece of data is missing, resulting in a color offset in a part of the image and slightly offset image after the error. Like this:



chef has been logging an error "nfsd send error 55" for the longest time, with no apparent ill effects. Is this maybe related and we only noticed it now? Are things getting worse?

A Google search revealed no answers to where the error on chef is coming from in the first place. It means out of buffer space, but I couldn't find any hints *why* that might be happening.
All I could find are complaints from other people that run Linux clients against OpenBSD servers, about that error filling up the logs. It also appears to happen on NetBSD, which would make sense, given how much code is shared between NetBSD and OpenBSD. Increasing NMBCLUSTERS might help a bit, but no definitive solution. Turning off CBQ might help, but I don't run CBQ (or any firewalling) on chef anymore. Someone suggested bad interaction with cheapo RealTek cards (hey, look at that chef is using a RealTek card). I switched to another interface (dc0), let's see if it makes a difference. Of course, I have no way of knowing for sure any time soon, since the error happens only when browsing 100's of jpegs over NFS.

Update:
I took a look at the timestamps of the nfsd error 55 messages and noticed they don't line up with when we see bad images. Today there was only one error logged this morning, but Patricia got a couple bad images this afternoon. Lovely. This doesn't help.

Update:
At this time I'm not even sure how to quantify the problem. Let's see if this is an issue with Linux NFS to OpenBSD, or something else. Let's generate a md5sum list of a large body of data (my jpegs) straight from disk on chef and via NFS on babybaer.

Update:
This is really easy in Linux:
find . | grep -i jpg$ | xargs -iD md5sum "D" > md5-linux

And really weird in OpenBSD (mostly because the output of md5 is so convoluted):
find . | grep -i jpg$ | xargs -iD md5 "D" | awk -F '[)(=]' '{print $4 $3 $2}' > md5-openbsd

awk seems to be doing some fancy buffering (?) when piping output to another program (like sed) or just a file. I just don't get any output whatsoever. Hmmm.

...

oh great, while experimenting with that, chef decided to say Goodbye and hung itself. No reaction on console. Just hangs. That sucks.

Ctrl-Alt-Esc doesn't work. But Shift-PgUp/PgDn does. Surprise. The box is not on the network, not echoing characters. *sigh* Looks like another power cycle. Bah.

Update:
Rebooted chef last night. I reproduced the jpeg problem again this afternoon on two images, so it's not a RealTek nic driver issue. There are no logs on chef indicating an error, nor on babybaer. It looks like random data corruption, but that doesn't make sense. ... Wait. There is a smart raw read error on chef, followed by ecc_hardware corrected. Could this be related? Rebooting Babybaer to rule out impact of local file system cache.

Update:
Nope. The root of the problem is on chef. I reproduced bad jpeg copies by simply cp'ing a directory of jpegs and running md5sum on the files. One recent directory (02-25) seems to be particularly suspectible to this problem. Of 8 files I copied in several attempts up to 7 had different md5sums. I hexdumped a few files and the differences are always on 4kbyte page boundaries (4kByte, 12kByte).
That seems to indicate either a file system problem (but the filesystem checks clean), or a RAIDframe volume problem, maybe due to issues with the underlying disk sectors (possible), a memory problem (but then I would expect to see more random failures), or a CPU problem (ditto).
I'm still uncomfortable with the raw read error/ecc corrected counters reported by smartd.

No comments: