This document describes the encoding algorithms referenced by the BGFA format. It covers integer encoding, string encoding, and specialized encodings used in block payloads.
- String Encoding Overview
- Integer Encoding Algorithms
- String Encoding Algorithms
- CIGAR 4-byte Strategy Code Details
- Run-Length Bits Encoding
A set of strings is embedded within a single "superstring" constructed by a heuristic that tries to
minimize the overall length (e.g., by overlapping common suffixes/prefixes). The metadata stores the start and end position of each string within the superstring. To recover string i, the decoder reads bytes from start[i] to end[i] (exclusive, following Python slice conventions).
Note: Strings are never concatenated directly; all string sets are encoded via this superstring approach.
The superstring is then encoded using the string encoding strategy.
When encoding a list of strings (the strings type), the encoding strategy is specified by a 2-byte code (see Strategy Code Format Summary in the main spec):
- First byte: integer encoding strategy for the position metadata
- Second byte: string encoding strategy for the superstring blob
The start and end positions are 0-based, the final position is excluded from the substring, following Python slice conventions.
Decoding a set of strings does not need to know the heuristic used to compute the superstring.
The payload of the encoded strings consists of:
-
Metadata: A list of numbers, encoded according to the first-byte strategy (varint, fixed16, etc). This list contains first all start positions, then all end positions of the strings.
The first-byte strategy determines how this list of numbers is encoded. Start and end positions are encoded as two separate contiguous lists: first all start positions, then all end positions. Both lists use the same integer encoding strategy.
-
Blob: The superstring is encoded according to the second-byte strategy.
Layout:
[start_0][start_1]...[start_n-1][end_0][end_1]...[end_n-1][blob]
All start positions are encoded first (as a contiguous list), followed by all end positions (as a contiguous list). Both lists use the same integer encoding strategy specified in the first byte of the compression code.
The compressed_*_len fields in block headers represent the total number of bytes for the encoded payload of that field, including both the integer metadata list and the compressed blob.
The uncompressed_len field is the sum of the lengths of the original strings (before any encoding).
A list of signed integers is encoded by two parts:
- the signs of all integers
- the absolute values of all integers (unsigned ints). The encoding strategy applies to this part only.
The signs are considered a sequence of bits, where + is encoded as 0 and - is encoded with 1. The i-th bit in the sign bitstream corresponds to the sign of the i-th integer in the list. Zero values are treated as positive (sign bit 0). The encoding of a sequence of bits is described at Run-Length Bits Encoding.
The encodings of unsigned ints are the following.
Variable-length integer encoding optimized for small values.
Format:
- Each byte: 7 data bits + 1 continuation bit (1 = more bytes follow, 0 = last byte)
- Little-endian ordering (least significant group first)
Each integer is encoded as a fixed 16-bit little-endian value.
Elias gamma encoding represents a number n using two parts:
- Unary representation of floor(log2(n)) + 1
- Binary representation of n - 2^floor(log2(n))
Format:
- For value n:
- Write floor(log2(n)) + 1 in unary (that many 1 bits, followed by 0)
- Write n - 2^floor(log2(n)) in binary (floor(log2(n)) bits)
Example: n = 5
- floor(log2(5)) + 1 = 3
- Unary: 1110
- n - 2^2 = 5 - 4 = 1 = 01 (2 bits)
- Complete: 111001
Elias omega starts with 0 and builds up, using the previous code length.
Format:
- For n = 1: output "0"
- For n > 1:
- Write n in binary
- Remove leading 1
- Recursively encode length of remaining bits using omega
- Append original bits
Golomb encoding uses a parameter b (default b=128).
Format:
- For value n:
- quotient q = n // b, remainder r = n % b
- Write q in unary (q ones, then zero)
- Write r in binary using ceil(log2(b)) bits
Default parameter: b = 128
Rice coding is Golomb with b as a power of 2: b = 2^k.
Format:
- Parameter k (0-31) stored as the first byte of the integer payload (before the encoded values)
- For value n:
- quotient q = n >> k
- remainder r = n & (2^k - 1)
- Write q in unary
- Write r in binary using k bits
Example: For k=3 and sequence [5, 12, 7]:
- Parameter byte: 0x03
- Encoded values follow the parameter byte
StreamVByte encodes multiple varints in parallel using SIMD-like packing.
Format:
- Control bytes indicate which varints use 1, 2, 3, or 4 bytes
- Data bytes contain the varints packed together
Variable Byte encoding uses 7 bits per byte for data, with high bit as continuation flag.
Format:
- Each byte: 7 data bits + 1 continuation bit (1 = more bytes follow, 0 = last byte)
- Little-endian ordering (least significant byte first)
Each integer is encoded as a fixed 32-bit little-endian value.
Each integer is encoded as a fixed 64-bit little-endian value.
The following algorithms operate on the superstring blob.
The superstring is stored as raw bytes with no transformation. The string metadata (start/end positions from the integer encoding) defines how to slice individual strings from the superstring.
The superstring is compressed using Zstandard compression.
The superstring is compressed using gzip compression.
The superstring is compressed using LZMA compression.
The Huffman encoding of a string consists of:
| Name | Description | Type |
|---|---|---|
codebook_len |
Length of the encoded codebook in bytes | uint16 |
codebook |
16 little-endian uint16 bit-lengths | bytes |
huffman_encoded |
The encoded string | bits |
Codebook format: The codebook is exactly 32 bytes containing 16 little-endian uint16 values representing the bit-length for each nibble (0x0 through 0xF). Even entries with zero bit-length have a placeholder value of 0 in the codebook.
codebook = struct.unpack('<16H', codebook_bytes)
# codebook[0] = bit-length for nibble 0x0
# codebook[1] = bit-length for nibble 0x1
# ...
# codebook[15] = bit-length for nibble 0xFReconstruction: The decoder MUST reconstruct the Huffman codes from the 16 bit-lengths using canonical Huffman code assignment:
-
Collect all (symbol, bit-length) pairs where bit-length > 0
-
Sort primarily by bit-length (ascending) and secondarily by symbol value (ascending)
-
Assign codes sequentially using the following algorithm:
code = 0 prev_len = 0 huffman_codes = {} for symbol, bit_len in sorted_pairs: code = (code + 1) << (bit_len - prev_len) if prev_len > 0 else 0 huffman_codes[symbol] = (code, bit_len) prev_len = bit_len
-
Zero bit-length symbols are not in the alphabet and will not appear in the encoded data
Example: Given bit-lengths: {0x41: 2, 0x42: 3, 0x43: 3} (symbols 'A', 'B', 'C'):
- Sorted by (bit-length, symbol):
[(0x41, 2), (0x42, 3), (0x43, 3)] - Code assignment:
- 'A' (0x41): bit-length=2, code=0b00 (first, shortest)
- 'B' (0x42): bit-length=3, code=(0b00+1)<<1 = 0b010
- 'C' (0x43): bit-length=3, code=(0b010+1)<<0 = 0b011
Encoding process:
- Convert each input byte to two nibbles (high nibble, low nibble)
- Look up each nibble's code in the reconstructed codebook
- Concatenate all codes into a bitstream
- Pad to byte boundary with zeros
Decoding:
The byte-length of the huffman_encoded data is the total compressed_len of the field minus:
- bytes consumed by the string metadata
- 2-byte
codebook_len - 32 bytes for the codebook
The decoder MUST decode exactly 2 * L symbols, where L is the number of characters in the string being reconstructed.
-
Symbol Extraction: Each byte of the target string (the superstring) is treated as two 4-bit symbols:
- Symbol A: High nibble (bits 4-7)
- Symbol B: Low nibble (bits 0-3)
-
Encoding Order: For every byte, Symbol A is encoded first, followed immediately by Symbol B.
-
Alphabet Size: The codebook always contains 16 entries (representing nibbles 0x0 through 0xF). A bit-length of 0 indicates the nibble is not present.
Format:
uint32: frequency table size (number of symbol-frequency pairs), little-endianbytes: frequency table (symbol:count pairs)bytes: arithmetic encoded data
The frequency table contains ONLY symbols with non-zero frequency. Each entry is a pair of:
symbol: 1 byte (ASCII value)frequency:uint32little-endian (count of symbol occurrences)
The frequency table size field indicates the number of pairs in the table. Total frequency table size = frequency_table_size * 5 bytes (1 byte symbol + 4 bytes frequency per entry).
Burrows-Wheeler Transform + Huffman coding provides excellent compression for repetitive sequences like DNA. The pipeline is:
- Apply Burrows-Wheeler Transform in fixed 64KB blocks
- Apply Move-to-Front transform
- Encode with Huffman coding
The block size is fixed at 65536 bytes (64KB).
Format:
uint32: number of BWT blocks- For each block:
uint32: primary indexuint32: block sizebytes: BWT-transformed data (MTF-encoded, then Huffman-encoded)
2-bit DNA encoding provides optimal compression for DNA/RNA sequences by encoding each nucleotide in 2 bits instead of 8 bits (75% size reduction). This is the most impactful encoding for pangenome data where sequences typically comprise 70-80% of file content.
Nucleotide Mapping:
- A (or a) = 00
- C (or c) = 01
- G (or g) = 10
- T (or t) = 11
- U (or u) = 11 (RNA uracil treated as thymine)
Bit packing: Nucleotides are packed MSB-first. The first nucleotide is stored in bits 7-6 of the first byte, the second in bits 5-4, the third in bits 3-2, the fourth in bits 1-0. Subsequent nucleotides continue in subsequent bytes.
Example: Sequence "ACGT" (4 nucleotides):
- A=00, C=01, G=10, T=11
- Packed:
0b00011011= 0x1B (1 byte)
Example: Sequence "ACGTA" (5 nucleotides):
- A=00, C=01, G=10, T=11, A=00
- Packed:
0b000110110b00000000= 0x1B 0x00 (2 bytes, last 6 bits are padding)
Note: Unused bits in the final byte MUST be set to 0 when writing and MUST be ignored when reading.
Format:
- 1 byte: flags
- bit 0:
has_exceptions(1 if exception table present, 0 otherwise) - bits 1-7: reserved (set to 0)
- bit 0:
packed_bases: 4 nucleotides per byte (2 bits each), padded with 0s if needed- If the
has_exceptionflag is set:varint: exception countvarintlist: exception positions (0-based indices in original sequence, absolute positions)bytes: exception characters (one byte per exception, in ASCII)
Exception Handling: Ambiguity codes (N, R, Y, K, M, S, W, B, D, H, V, -) and unknown characters are stored in the exception table with their original ASCII values, allowing perfect reconstruction while maintaining compression on standard ACGT sequences.
Expected Compression: 75% reduction on pure DNA/RNA sequences, slightly less with ambiguity codes.
Primary Use Case: Segment sequences, which are typically the largest data component in BGFA files.
Run-Length Encoding efficiently compresses sequences with repeated characters (homopolymers in DNA, repeated operations in other contexts). The implementation uses a hybrid mode that switches between raw and RLE encoding to prevent expansion on non-repetitive data.
Format:
varint: run count (number of runs)- For each run:
- 1 byte: mode (0x00=raw, 0x01=RLE)
varint: run data length- run data:
- If raw mode: raw bytes
- If RLE mode: sequence of
[char: 1 byte][count: varint]pairs
Algorithm:
- Minimum run length: 3 characters (shorter runs use raw encoding)
- Automatically switches between raw and RLE modes within a string to prevent expansion on non-repetitive data
- Run counts encoded as varint for efficiency
Mode selection: A run of 3 or more identical consecutive characters uses RLE mode. Shorter sequences use raw mode.
Expected Compression: 30-50% reduction on sequences with homopolymers or repetitive patterns.
Primary Use Cases:
- DNA sequences with homopolymer runs (AAAAAAA, GGGGGG, TTTTTT)
- Can be combined with 2-bit DNA encoding for additional compression
- General string data with repeated characters
Dictionary encoding is optimized for repetitive string data by replacing repeated strings with short references to a dictionary.
Format (Concatenation mode):
uint32: dictionary size (number of unique strings), little-endianuintslist: dictionary entry offsets (N+1 values, where N = dictionary size)- Offsets are cumulative from the start of the dictionary blob
- The (N+1)-th offset equals the total blob length
- Offsets are encoded using the integer encoding strategy specified in the first byte of the 2-byte compression code
0xHH0A
bytes: concatenated dictionary entries (each entry is a unique string, no terminators)uintslist: indices into dictionary for each input string, encoded using the same integer strategy as offsets
Maximum dictionary size: 65536 entries (configurable)
Expected Compression: 60-90% reduction on highly repetitive data (e.g., sample IDs repeated across thousands of walks).
Primary Use Cases:
- Sample identifiers in walk blocks
- Segment names with common prefixes
- Path names with structural patterns
- Any string list with high repetition
CIGAR strings represent sequence alignments with alternating numbers and operation letters. This encoding exploits the structure of CIGAR strings to achieve better compression than general-purpose methods.
CIGAR Operations:
M = 0x0 (match/mismatch)
I = 0x1 (insertion)
D = 0x2 (deletion)
N = 0x3 (skipped region)
S = 0x4 (soft clipping)
H = 0x5 (hard clipping)
P = 0x6 (padding)
= = 0x7 (sequence match)
X = 0x8 (sequence mismatch)
Format:
varint: number of operationspacked_ops: 2 operations per byte (4 bits each)- High nibble: operation n
- Low nibble: operation n+1
- If odd number of operations, low nibble of last byte = 0xF (padding marker)
varintlist: operation lengths
Special case: The CIGAR value * (indicating no alignment) is encoded as a single byte 0xFF. This allows efficient representation of unaligned segments.
Example: CIGAR string "10M2I5D" is encoded as:
num_ops: 3 (varint)packed_ops: 0x01, 0x2F (operations M=0, I=1, D=2 with padding 0xF)lengths: 10, 2, 5 (varints)
Expected Compression: 40-60% reduction compared to ASCII CIGAR strings.
Primary Use Cases:
- Link overlaps (L lines in GFA)
- Path CIGAR strings (P lines in GFA)
- Any alignment representation using CIGAR format
For CIGAR strings, 4-byte strategy codes are used. The format and decomposition strategies are defined in the Strategy Code Format Summary.
The following decomposition strategies are available:
| Format | Bytes | Strategy |
|---|---|---|
0x00?????? |
[00, ??, ??, ??] |
none (identity) |
0x01?????? |
[01, RR, II, SS] |
numOperations + lengths + operations |
0x02?????? |
[02, 00, II, SS] |
string (treat as plain string) |
For identity (0x00), the CIGAR strings are stored as raw ASCII CIGAR strings separated by newlines (0x0A), with a final newline terminator. The ?? bytes are ignored on read and SHOULD be 0x00 when writing.
Format: 0x01RRIISS
Bytes: [01, RR, II, SS]
Three components are encoded:
- The list of the number of operations in each CIGAR string (
uints) - encoded with strategyII - The list of lengths of the operations (
uints) - encoded with strategyRR - The operations as a packed string - encoded with strategy
SS
Byte assignments:
- First byte (
0x01): Decomposition strategy (numOperations+lengths+operations) - Second byte (
RR): Integer encoding strategy for operation lengths (e.g.,0x01for varint,0x03for LZMA) - Third byte (
II): Integer encoding strategy for operation counts (e.g.,0x01for varint,0x02for fixed16) - Fourth byte (
SS): String encoding strategy for packed operations (e.g.,0x00for none,0x04for Huffman)
Example: Strategy code 0x01020304 = Bytes [0x01, 0x02, 0x03, 0x04] means:
- First byte (
0x01): Use numOperations+lengths+operations decomposition - Second byte (
0x02): Use fixed16 to encode the list of operation counts - Third byte (
0x03): Use LZMA to encode the operation lengths - Fourth byte (
0x04): Use Huffman to encode the packed operations string
Format: 0x0200IISS
Bytes: [02, 00, II, SS]
The CIGAR strings are compressed as a single block, separated by newlines, using the method specified by II and SS.
Byte assignments:
- First byte (
0x02): Decomposition strategy (string) - Second byte (
0x00): Reserved (MUST be0x00) - Third byte (
II): Integer encoding strategy (MUST be0x00for string decomposition) - Fourth byte (
SS): String encoding strategy (e.g.,0x03for LZMA)
Example: Strategy code 0x02000003 = Bytes [0x02, 0x00, 0x00, 0x03] means:
- First byte (
0x02): Use string decomposition - Second byte (
0x00): Reserved - Third byte (
0x00): No integer encoding (string mode) - Fourth byte (
0x03): Use LZMA to compress the newline-separated CIGAR strings
A sequence of bits is run-length encoded, specialized to bits. The encoding is a list of lengths [L(1), L(2), ... , L(k)] where L(1) is the length of the leading zeroes in the sequence. L(1) = 0 if the first bit of the sequence is 1. L(2) is the length of the ones in the sequence immediately after the first L(1) bits, minus 1. L(3) is the length of the zeroes in the sequence immediately after the first L(1) + L(2) + 1 bits, minus 1.
For example the sequence of bits 00011010110101001 is encoded as [3, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0], since the run-lengths are [3, 2, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1].
Since those numbers are positive, they are always encoded as varint.