Skip to content

Fix incorrect snprintf format specifier and undefined bsearch comparator#377

Open
aaranmcguire wants to merge 1 commit intoWizardMac:devfrom
aaranmcguire:fix/snprintf-format-and-strl-comparator
Open

Fix incorrect snprintf format specifier and undefined bsearch comparator#377
aaranmcguire wants to merge 1 commit intoWizardMac:devfrom
aaranmcguire:fix/snprintf-format-and-strl-comparator

Conversation

@aaranmcguire
Copy link
Copy Markdown

PR 6: Fix incorrect snprintf format specifier and undefined comparator behaviour

Summary

Two unrelated but both definitively-wrong code patterns:

  1. %*s should be %.*s — in the SAV long-variable-names parser,
    snprintf uses %*s (minimum-width specifier) where %.*s (precision
    specifier, which truncates) is intended. Combined with a missing bounds
    check on a memcpy target, a malformed SAV file can overflow a 64-byte
    stack buffer.

  2. Unsigned subtraction returned as int in dta_compare_strls — the
    bsearch comparator for Stata strL references returns
    key->o - target->o where both are uint64_t. When the true difference
    exceeds INT_MAX, the truncated int has the wrong sign, causing bsearch
    to navigate in the wrong direction and potentially return garbage pointers.


Bug 1: %*s vs %.*s + memcpy overflow (spss/readstat_sav_parse.c)

What %*s and %.*s actually do

  • snprintf(dst, dst_size, "%*s", width, src)* specifies minimum
    field width
    (padding). If src is longer than width, the full src
    is copied (up to dst_size). This is NOT truncation.
  • snprintf(dst, dst_size, "%.*s", prec, src).* specifies precision
    (maximum characters from src). This IS truncation.

The code intends to copy at most str_len characters of temp_val into
info->longname. %*s does not achieve this — it copies all of temp_val.
%.*s is the correct specifier.

memcpy overflow into temp_val[64]

The same function also does:

/* str_len is a file-controlled value from the Ragel parser */
memcpy(temp_val, str_start, str_len);
temp_val[str_len] = '\0';

temp_val is declared char temp_val[64+1]. If the Ragel grammar allows
str_len > 64 (which it does — the grammar constrains key characters but
not value length), memcpy overflows temp_val onto the stack.

Vulnerable code

/* spss/readstat_sav_parse.c — two identical blocks */

/* Block 1 (transition actions): */
memcpy(temp_val, str_start, str_len);   /* ← overflows if str_len > 64 */
temp_val[str_len] = '\0';
...
snprintf(info->longname, sizeof(info->longname), "%*s", (int)str_len, temp_val);
/*                                                  ↑ wrong: pads, doesn't truncate */

/* Block 2 (EOF actions): identical */

Reproduction

# Craft a SAV file with a type-7, subtype-13 (long variable names) record
# where the value side of a "shortname=longname" pair is longer than 64 bytes.
# e.g.: MYVAR=AAAA...A (65 A's)
#
# str_len = 65, sizeof(temp_val) = 65 (64 + NUL slot)
# memcpy(temp_val, ..., 65) writes 65 bytes into a 65-byte buffer,
# then temp_val[65] = '\0' writes one byte past the end.

With AddressSanitizer on a crafted SAV:

ERROR: AddressSanitizer: stack-buffer-overflow
WRITE of size 1 at temp_val[65]

Fix

+if (str_len > sizeof(temp_val) - 1) {
+    retval = READSTAT_ERROR_PARSE;
+    goto cleanup;
+}
 memcpy(temp_val, str_start, str_len);
 temp_val[str_len] = '\0';
 ...
-snprintf(info->longname, sizeof(info->longname), "%*s",  (int)str_len, temp_val);
+snprintf(info->longname, sizeof(info->longname), "%.*s", (int)str_len, temp_val);

Note: This file is generated by Ragel from readstat_sav_parse.rl. The
fix should be applied to the .rl source and the generated .c re-generated,
or applied directly to the .c if the .rl source is not being maintained.


Bug 2: Unsigned subtraction overflow in dta_compare_strls

dta_strl_t field types

typedef struct dta_strl_s {
    uint16_t  v;    /* variable index, max 65535 */
    uint64_t  o;    /* observation index, max 2^64-1 */
    ...
} dta_strl_t;

Vulnerable code

static int dta_compare_strls(const void *elem1, const void *elem2) {
    const dta_strl_t *key    = (const dta_strl_t *)elem1;
    const dta_strl_t *target = *(const dta_strl_t **)elem2;
    if (key->o == target->o)
        return key->v - target->v;   /* uint16 − uint16: safe, max diff ±65535 */
    return key->o - target->o;       /* uint64 − uint64: UNSAFE */
}

key->o - target->o is uint64_t subtraction. The result is always
non-negative (wraps on underflow). When cast to int:

  • If key->o > target->o and key->o - target->o > INT_MAX (e.g. o
    values differ by 0x80000001), the cast truncates to a negative int.
  • bsearch sees a negative return and goes left — the opposite direction
    from correct — potentially returning NULL or a wrong element.

What goes wrong

dta_compare_strls is used as the comparator in bsearch over the strL
lookup table. A wrong comparator makes bsearch navigate incorrectly, so:

  1. It may return NULL when the entry exists → strL data silently replaced
    with empty string.
  2. It may return a wrong entry → a different strL value substituted for the
    one at the queried (v, o) coordinates — silent data corruption.

The v subtraction is safe because uint16_t differences fit in int.
Only o has the problem.

Reproduction

A Stata DTA file (v117+) with at least two strL entries whose observation
indices (o) differ by more than INT_MAX (2^31−1):

* In Stata 14+, a dataset with a string variable and many observations
* naturally produces strL (v, o) pairs where o can reach millions.
* Any pair with o1 - o2 > 2^31 will trigger the wrong comparator result.

The bug is deterministic and format-dependent. A DTA file with
o values {0x100000001, 0x000000001} will cause bsearch to go the
wrong direction on the first comparison.

Fix

Use explicit three-way comparison that is safe for any unsigned type:

 static int dta_compare_strls(const void *elem1, const void *elem2) {
     const dta_strl_t *key    = (const dta_strl_t *)elem1;
     const dta_strl_t *target = *(const dta_strl_t **)elem2;
-    if (key->o == target->o)
-        return key->v - target->v;
-    return key->o - target->o;
+    if (key->o != target->o)
+        return (key->o > target->o) - (key->o < target->o);
+    return (key->v > target->v) - (key->v < target->v);
 }

(a > b) - (a < b) evaluates to +1, 0, or −1 for any comparable type,
without overflow.

1. spss/readstat_sav_parse.c: %*s should be %.*s

   snprintf uses %*s (minimum field width — pads with spaces) where
   %.*s (precision — truncates to N chars) is intended. This means
   if temp_val is longer than str_len, the full string is copied
   rather than a str_len-character prefix.

   The same function also does memcpy(temp_val, str_start, str_len)
   where temp_val is char[65]. If str_len > 64 the memcpy overflows
   the stack buffer. Added a bounds check before the memcpy.

   Note: this file is generated by Ragel from readstat_sav_parse.rl.
   The fix is applied to both the generated .c and should be applied
   to the .rl source as well.

2. stata/readstat_dta_read.c: dta_compare_strls returns uint64 diff

   dta_compare_strls returns key->o - target->o where both are
   uint64_t. When the difference exceeds INT_MAX, the truncated int
   has the wrong sign, causing bsearch to navigate the wrong direction
   — potentially returning NULL (strL silently replaced with empty
   string) or a wrong entry (silent data corruption).

   Replaced with safe three-way comparison: (a>b)-(a<b) evaluates to
   +1, 0, or -1 for any comparable type without overflow.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant