After reading your comments, that sounds more reasonable. I just wasn't sure if you were intending on encoding megabytes of data like this.
I'd recommend, along the lines of Oliver's suggestion, that you increase your data density by borrowing a page from Bacon's cipher, which prison gangs often use to encode hidden messages in missives written in 2 different script styles--usually either upper vs. lowercase characters or print vs. cursive characters, e.g.
Hey mOM, WHAT's FOR diNNeR TODAY? = ABBBA AAAAA BAAAB BAABA AAAAA
= P A S T A
However, since your goal isn't stegnography, you would simply use this to expand your glyph set. Doing this, you could have up to 114 glyphs just using print & cursive alphanumeric characters, or 12996 code points using dual-character encoding.
However, since all glyph counts greater than 15 and less than 256 are essentially the same for a straight cipher of binary data (meaning, you'll still need 2 characters to represent each byte, giving you a data density of 4 bits per character in all cases), you can use the extra 98 glyphs / 12740 code points for error detection/correction.
Ways to do this include:
- Choose a set of the 256 most easy to read/write character combos. If any other character combo occurs, you know it's a copying error.
- Use two versions of the end character as a parity bit.
Create 50 different 16-character glyph sets. You can then use them to cipher encode error correction data.
E.g. {set 1}{set 1} means the next 3 nibbles equal 0x000, {set 1}{set 2} equals 0x001, etc.
You can use this to represent 2500+ of the 4096 possible 1.5 byte values. Similarly, you could use just 16 sets to represent all values of the following byte, giving you 100% redundancy without increasing your encoded data length.
Alternatively, you could use the extra glyphs for additional compression:
- Implement variable-width encoding by choosing 98 single-character code points. This would reduce the average encoded content size by about 20%.
- Implement something similar to run-length encoding by using different glyph sets or glyph set combinations to represent repeating nibbles/bytes. E.g.
Ab = aba; aB = abab; AB = ababab...
- Use the extra glyphs or code points to represent "words" and "phrases" that are repeated in your data. Though pre-compressed data will likely have a high level of entropy, so I don't know how effective this would be.
To further reduce copying errors, I would display the encoded content in gridlines and copy onto graphing paper. If you can use custom stationary that has alternating column/row colors or a chessboard-style checkered grid with lettered columns & numbered rows for quick look-ups, that would further increase copying accuracy.
You can also combine an alternating grid layout with alternating character styles as an easy form of error detection. I.e. if odd columns are always capitalized, if the transcriber finds themselves writing lowercase letters in odd columns, then they know they've made an error and can start tracking back to see where it happened.
Though if your main priority is accuracy, I would use a binary encoding +
Hamming code. Using a (12, 8) shortened Hamming code on standard graphing paper, you might only fit 187 bytes, encoding only 124 bytes of data. But it could be transcribed very quickly (a slash for 1, nothing for 0) and provide single error correction. Tacking on an extra parity bit (13, 8) would provide SECDED (single error correction, double error detection). Using a standard hamming code like (15, 11) or (31, 26), you get even better efficiency with 137 and 156 bytes of data per sheet, respectively. Even higher code rates can be achieved, depending on how accurate you think your transcriber can be.
A binary encoding would also be easier to read (aloud) and OCR/OMR.