-rw-r--r-- 3964 libsecded-20220828/README raw
Data stored in DRAM sometimes flips bits. A study of Google servers
found that each gigabyte of DRAM flips a bit every few hours on average:
https://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf
Servers typically protect against DRAM errors using "SECDED ECC DRAM",
which is slightly more expensive than conventional DRAM. "SECDED" means
"single error correction, double error detection". Each 64-bit chunk of
data is stored in 72 physical bits; the extra 8 bits are used for
checksums that allow any single bit flip to be automatically corrected
and any double bit flip to be detected. Operating systems report
"correctable errors" ("CE") and "uncorrectable errors" ("UE") to warn
system administrators to replace failing DRAM.
However, typical desktops, laptops, and smartphones use conventional
DRAM, with no protection against bit flips. Usually a bit flip doesn't
matter (would you notice a pixel occasionally changing in a movie?), but
sometimes it does matter.
libsecded is a software library aimed at programmers who would like to
reduce the risk of data corruption. The library stores an array as a
slightly longer array, using extra bits to automatically correct single
errors and detect double errors. The API is straightforward (with
lengths represented as long long, not size_t):
secded_encode(x,&xlen): encode bytes x[0...xlen-1] as bytes
x[0...newxlen-1], changing xlen to newxlen, which is at most xlen+64
secded_decode(x,&xlen): change x[...] and xlen back to what they
were, while correcting any single error
Notes follow on optional aspects of the API.
There's a secded_clean(x,xlen) that you can use to correct errors in
encoded data while leaving the data in encoded form. This is essentially
the same as secded_decode(x,&xlen) followed by secded_encode(x,&xlen),
but simpler and faster. (Periodically sweeping through encoded data to
correct errors is typically called "scrubbing" for SECDED ECC DRAM.)
Both secded_decode() and secded_clean() return an unsigned int, namely 0
for no errors, 1 through 255 for correctable errors, 256 through 65535
for uncorrectable errors. This lets you measure how often problems occur
and decide what action to take.
Some values of xlen are never produced by secded_encode(). For example,
xlen=120 turns into newxlen=128, xlen=121 turns into newxlen=130, and
nothing turns into newxlen=129. In case you're worried about what
happens with invalid lengths such as 129:
* Invalid lengths are accepted by decode(), essentially as if they
were truncated down to the next valid length, so they won't crash
your program.
* If you're worried about data from an untrustworthy source, you
shouldn't think that rejecting invalid lengths is a solution. The
source will still be able to produce arbitrary results from decode().
* If you're trying to recognize repeated events, you shouldn't be
merely checking for repeated _encodings_ of those events. Even
outside the error-correction context, it's usually wrong to assume
that only one string can decode to a particular piece of data.
* If for some reason you want to recognize invalid lengths, one way is
to see that decode() followed by encode() _doesn't_ produce the
original length. You can pass a 0 pointer to decode() and encode()
to just do length computations without touching the array. You can
also more directly recognize the invalid lengths (1, 2, 3, 5, 9,
17, 33, etc.) as nonzero xlen such that (xlen-1)&(xlen-2) == 0.
The software is designed to work for array lengths as large as an
exabyte (2^60 bytes). However, obviously bugs for large array sizes
could have slipped past tests for smaller array sizes, and being able to
correct just one error is insufficient for large arrays. Large arrays
should be split into smaller arrays with separate error correction. For
even higher reliability, SECDED should be replaced with multiple-error
correction.