Fix incorrect snprintf format specifier and undefined bsearch comparator by aaranmcguire · Pull Request #377 · WizardMac/ReadStat

aaranmcguire · 2026-05-05T18:57:03Z

PR 6: Fix incorrect `snprintf` format specifier and undefined comparator behaviour

Summary

Two unrelated but both definitively-wrong code patterns:

%*s should be %.*s — in the SAV long-variable-names parser,
snprintf uses %*s (minimum-width specifier) where %.*s (precision
specifier, which truncates) is intended. Combined with a missing bounds
check on a memcpy target, a malformed SAV file can overflow a 64-byte
stack buffer.
Unsigned subtraction returned as int in dta_compare_strls — the
bsearch comparator for Stata strL references returns
key->o - target->o where both are uint64_t. When the true difference
exceeds INT_MAX, the truncated int has the wrong sign, causing bsearch
to navigate in the wrong direction and potentially return garbage pointers.

Bug 1: `%s` vs `%.s` + `memcpy` overflow (`spss/readstat_sav_parse.c`)

What `%s` and `%.s` actually do

snprintf(dst, dst_size, "%*s", width, src) — * specifies minimum
field width (padding). If src is longer than width, the full src
is copied (up to dst_size). This is NOT truncation.
snprintf(dst, dst_size, "%.*s", prec, src) — .* specifies precision
(maximum characters from src). This IS truncation.

The code intends to copy at most str_len characters of temp_val into
info->longname. %*s does not achieve this — it copies all of temp_val.
%.*s is the correct specifier.

`memcpy` overflow into `temp_val[64]`

The same function also does:

/* str_len is a file-controlled value from the Ragel parser */
memcpy(temp_val, str_start, str_len);
temp_val[str_len] = '\0';

temp_val is declared char temp_val[64+1]. If the Ragel grammar allows
str_len > 64 (which it does — the grammar constrains key characters but
not value length), memcpy overflows temp_val onto the stack.

Vulnerable code

/* spss/readstat_sav_parse.c — two identical blocks */

/* Block 1 (transition actions): */
memcpy(temp_val, str_start, str_len);   /* ← overflows if str_len > 64 */
temp_val[str_len] = '\0';
...
snprintf(info->longname, sizeof(info->longname), "%*s", (int)str_len, temp_val);
/*                                                  ↑ wrong: pads, doesn't truncate */

/* Block 2 (EOF actions): identical */

Reproduction

# Craft a SAV file with a type-7, subtype-13 (long variable names) record
# where the value side of a "shortname=longname" pair is longer than 64 bytes.
# e.g.: MYVAR=AAAA...A (65 A's)
#
# str_len = 65, sizeof(temp_val) = 65 (64 + NUL slot)
# memcpy(temp_val, ..., 65) writes 65 bytes into a 65-byte buffer,
# then temp_val[65] = '\0' writes one byte past the end.

With AddressSanitizer on a crafted SAV:

ERROR: AddressSanitizer: stack-buffer-overflow
WRITE of size 1 at temp_val[65]

Fix

+if (str_len > sizeof(temp_val) - 1) {
+    retval = READSTAT_ERROR_PARSE;
+    goto cleanup;
+}
 memcpy(temp_val, str_start, str_len);
 temp_val[str_len] = '\0';
 ...
-snprintf(info->longname, sizeof(info->longname), "%*s",  (int)str_len, temp_val);
+snprintf(info->longname, sizeof(info->longname), "%.*s", (int)str_len, temp_val);

Note: This file is generated by Ragel from readstat_sav_parse.rl. The
fix should be applied to the .rl source and the generated .c re-generated,
or applied directly to the .c if the .rl source is not being maintained.

Bug 2: Unsigned subtraction overflow in `dta_compare_strls`

`dta_strl_t` field types

typedef struct dta_strl_s {
    uint16_t  v;    /* variable index, max 65535 */
    uint64_t  o;    /* observation index, max 2^64-1 */
    ...
} dta_strl_t;

Vulnerable code

static int dta_compare_strls(const void *elem1, const void *elem2) {
    const dta_strl_t *key    = (const dta_strl_t *)elem1;
    const dta_strl_t *target = *(const dta_strl_t **)elem2;
    if (key->o == target->o)
        return key->v - target->v;   /* uint16 − uint16: safe, max diff ±65535 */
    return key->o - target->o;       /* uint64 − uint64: UNSAFE */
}

key->o - target->o is uint64_t subtraction. The result is always
non-negative (wraps on underflow). When cast to int:

If key->o > target->o and key->o - target->o > INT_MAX (e.g. o
values differ by 0x80000001), the cast truncates to a negative int.
bsearch sees a negative return and goes left — the opposite direction
from correct — potentially returning NULL or a wrong element.

What goes wrong

dta_compare_strls is used as the comparator in bsearch over the strL
lookup table. A wrong comparator makes bsearch navigate incorrectly, so:

It may return NULL when the entry exists → strL data silently replaced
with empty string.
It may return a wrong entry → a different strL value substituted for the
one at the queried (v, o) coordinates — silent data corruption.

The v subtraction is safe because uint16_t differences fit in int.
Only o has the problem.

Reproduction

A Stata DTA file (v117+) with at least two strL entries whose observation
indices (o) differ by more than INT_MAX (2^31−1):

* In Stata 14+, a dataset with a string variable and many observations
* naturally produces strL (v, o) pairs where o can reach millions.
* Any pair with o1 - o2 > 2^31 will trigger the wrong comparator result.

The bug is deterministic and format-dependent. A DTA file with
o values {0x100000001, 0x000000001} will cause bsearch to go the
wrong direction on the first comparison.

Fix

Use explicit three-way comparison that is safe for any unsigned type:

 static int dta_compare_strls(const void *elem1, const void *elem2) {
     const dta_strl_t *key    = (const dta_strl_t *)elem1;
     const dta_strl_t *target = *(const dta_strl_t **)elem2;
-    if (key->o == target->o)
-        return key->v - target->v;
-    return key->o - target->o;
+    if (key->o != target->o)
+        return (key->o > target->o) - (key->o < target->o);
+    return (key->v > target->v) - (key->v < target->v);
 }

(a > b) - (a < b) evaluates to +1, 0, or −1 for any comparable type,
without overflow.

1. spss/readstat_sav_parse.c: %*s should be %.*s snprintf uses %*s (minimum field width — pads with spaces) where %.*s (precision — truncates to N chars) is intended. This means if temp_val is longer than str_len, the full string is copied rather than a str_len-character prefix. The same function also does memcpy(temp_val, str_start, str_len) where temp_val is char[65]. If str_len > 64 the memcpy overflows the stack buffer. Added a bounds check before the memcpy. Note: this file is generated by Ragel from readstat_sav_parse.rl. The fix is applied to both the generated .c and should be applied to the .rl source as well. 2. stata/readstat_dta_read.c: dta_compare_strls returns uint64 diff dta_compare_strls returns key->o - target->o where both are uint64_t. When the difference exceeds INT_MAX, the truncated int has the wrong sign, causing bsearch to navigate the wrong direction — potentially returning NULL (strL silently replaced with empty string) or a wrong entry (silent data corruption). Replaced with safe three-way comparison: (a>b)-(a<b) evaluates to +1, 0, or -1 for any comparable type without overflow.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix incorrect snprintf format specifier and undefined bsearch comparator#377

Fix incorrect snprintf format specifier and undefined bsearch comparator#377
aaranmcguire wants to merge 1 commit intoWizardMac:devfrom
aaranmcguire:fix/snprintf-format-and-strl-comparator

aaranmcguire commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

aaranmcguire commented May 5, 2026

PR 6: Fix incorrect snprintf format specifier and undefined comparator behaviour

Summary

Bug 1: %*s vs %.*s + memcpy overflow (spss/readstat_sav_parse.c)

What %*s and %.*s actually do

memcpy overflow into temp_val[64]

Vulnerable code

Reproduction

Fix

Bug 2: Unsigned subtraction overflow in dta_compare_strls

dta_strl_t field types

Vulnerable code

What goes wrong

Reproduction

Fix

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

PR 6: Fix incorrect `snprintf` format specifier and undefined comparator behaviour

Bug 1: `%s` vs `%.s` + `memcpy` overflow (`spss/readstat_sav_parse.c`)

What `%s` and `%.s` actually do

`memcpy` overflow into `temp_val[64]`

Bug 2: Unsigned subtraction overflow in `dta_compare_strls`

`dta_strl_t` field types