SECDED uses a "distance-4 error-correcting code". The typical choice of
code is an "extended Hamming code".

libsecded encodes an n-byte array using an extended Hamming code on the
bottom bit of each byte, in parallel an extended Hamming code on the
next bit of each byte, etc.

This parallel handling of bits makes the libsecded encoding slightly
less space-efficient than applying an extended Hamming code to all bits
together. For example, 256 bytes are encoded as 266 bytes, where they
could instead be encoded as 258 bytes. However, this space difference is
minor. Parallel handling of bits has the advantage of avoiding annoying
data indexing inside the software.

Internally, encode() and decode() and clean() are built from the
following subroutines:

   * expand() inserts bytes with value 0 into the array at positions
     well known to simplify encoding and decoding of extended Hamming
     codes (positions 0, 1, 2, 4, 8, 16, etc.).

   * shrink() removes the extra positions.

   * parity() computes parity bytes for an expanded array, storing them
     in a separate 64-byte array. Encoding could avoid this 64-byte
     array by storing results directly into x (and computing overall
     parity after storing the other bits would skip a logarithmic number
     of xors in fill()), but a separate parity() function is useful
     because it is shared with decoding. A smaller array ("bits+1"
     bytes) would suffice, but 64 bytes are a minor cost.

   * fill() puts parity bytes at the right positions in the expanded
     array.

   * correct() uses parity bytes to correct and detect errors in an
     expanded array. The details avoid the variable array indexing that
     would appear in a conventional Hamming decoder; this is intended to
     assist in some types of automated verification of software
     correctness, although this verification has not happened yet.

The Python software follows the same structure, and test2.py includes
tests of Python against C not just for encode() and decode() and clean()
but also for the subroutines.

On a Haswell with gcc 7.5, the software runs at

   * 0.82 cycles/byte for encoding a megabyte,
   * 1.15 cycles/byte for decoding a megabyte, and
   * 1.02 cycles/byte for cleaning a megabyte.

These speeds seem unlikely to be a problem in applications, but higher
speeds are possible with more work. The following paragraphs explain
where the time is going and what can be sped up.

expand() and shrink() involve a logarithmic number of calls to
memmove(), the total number of bytes being linear in the array size.
fill() is a logarithmic number of byte copies and xors. The main issues
are the byte operations in parity() and correct().

The main job of parity() is to compute a logarithmic number of sums of
halves of the array where indices have specific bits set---

   p[0] = x[1]^x[3]^x[5]^x[7]^x[9]^x[11]^x[13]^x[15]^...
   p[1] = x[2]^x[3]^x[6]^x[7]^x[10]^x[11]^x[14]^x[15]^...
   p[1] = x[4]^x[5]^x[6]^x[7]^x[12]^x[13]^x[14]^x[15]^...
   ...
   p[bits-1] = ...

---along with p[bits] = x[0]^x[1]^x[2]^x[3]^x[4]^x[5]^x[6]^x[7]^...

This involves Theta(n log n) operations as written (and as implemented
straightforwardly in parity.py), but is easily improved to Theta(n) by
common-subexpression elimination. As an example of such elimination,
Theta(n) operations suffice to produce the 32 values

   x[0]^x[32]^x[64]^x[96]^...
   x[1]^x[33]^x[65]^x[97]^...
   ...
   x[31]^x[63]^x[95]^x[127]^...

and then the values p[0],p[1],p[2],p[3],p[4] and p[bits] are all easily
extracted with just a few more xors. As another example, Theta(n)
operations suffice to produce the sums

   x[0]^x[1]^...^x[1023]
   x[1024]^x[1025]^...^x[2047]
   ...

which in turn determine the values p[10],p[11],...,p[bits-1].

All of this is easily vectorizable using, e.g., 256-bit vectors. Even
32-bit integers suffice to make the computations run much more quickly
than working separately with each 8-bit byte. Care is required on CPUs
requiring _aligned_ memory access: the input array is not necessarily
aligned. It is probably worthwhile to move unaligned input arrays to
aligned positions, either in the caller or within parity().

The current parity.c is designed to be portable C software, but, subject
to that constraint, organizes the data flow so that it's reasonable to
imagine a compiler figuring out the desired vectorization. The assembly
produced by gcc has some vectorization, although it can be improved.

For error correction, here's the usual bitwise extended-Hamming story.
If there's no error then the parity bits p[0],p[1],...,p[bits-1],p[bits]
are all 0. If there's a single error then p[0],p[1],...,p[bits-1] show
the index of the error, while p[bits] is 1. If there's a double error
then the parity bits p[0],p[1],...,p[bits-1] are nonzero (the xor of the
two error indices), while p[bits] is 0.

The obvious logarithmic-time way to implement correct() would be to
extract the error index for the bottom bit of each byte, extract the
error index for the next bit of each byte, etc. However, as noted above,
correct() avoids variable array indexing. The indexing is instead
simulated with logical operations on each byte of the array. The data
flow here is essentially the transpose of the parity() data flow, and
correct() is again written in a way that's meant to help the compiler
figure out the necessary vectorization.

There's also a logarithmic cost to compute the sec and ded flags
returned to the caller (as sec+256*ded). This could be streamlined as
follows: skip the "borrow" computation in the software, set "sec" to
"overall", and set "ded" to "anyp&~overall". However, this would allow
some _triple_ errors that would otherwise be caught (and reported via
"ded") to pose as single errors: namely, errors creating out-of-bounds
indices for array lengths that require truncation of the Hamming code.

The choice of parallelizing across bits of 8-bit words can obviously be
generalized to parallelizing across bits of b-bit words for any b.
Different values of b are interoperable in the sense that interleaving
(b/8) 8-bit-encoded arrays would produce a b-bit encoded array. Changing
b from 8 to larger powers of 2 would simplify vectorized software in
various ways. However, there would be some extra complications for
handling bytes at the end if the input is not guaranteed to have an
appropriate size, and some extra complications for alignment if the
input is not guaranteed to have whatever alignment the CPU needs.