Skip to content

fix: replace atoi() with strict integer parsing in SamParser (#22)#37

Open
lukedev45 wants to merge 2 commits intocompiler-research:developfrom
lukedev45:fix/strict-integer-parsing-samparser
Open

fix: replace atoi() with strict integer parsing in SamParser (#22)#37
lukedev45 wants to merge 2 commits intocompiler-research:developfrom
lukedev45:fix/strict-integer-parsing-samparser

Conversation

@lukedev45
Copy link
Copy Markdown

Summary

  • Replace all 5 unsafe atoi() calls in SamParser::ParseLine with a StrictAtoi helper that uses std::stoi with full validation
  • Corrupted SAM records with non-numeric values in integer fields (FLAG, POS, MAPQ, PNEXT, TLEN) are now rejected instead of silently written with zeroed-out values
  • Add unit test (SamParserRejectsNonNumericFields) verifying that invalid records are skipped while valid records in the same file are still processed

Problem

atoi() returns 0 for non-numeric input without signaling an error. This caused silent data corruption: malformed SAM records were written to ROOT files with integer fields set to 0, with no warning to the user.

Solution

A static StrictAtoi function wraps std::stoi and:

  • Validates that the entire token is consumed (rejects trailing characters like "123abc")
  • Catches std::invalid_argument for non-numeric strings
  • Catches std::out_of_range for values exceeding int limits
  • Prints a diagnostic to stderr identifying the field and invalid value
  • Returns false so ParseLine skips the record

Test plan

  • New test creates a SAM file with non-numeric FLAG and POS values plus one valid record
  • Asserts only the valid record is written to the output RNTuple (1 entry, not 3)
  • cmake --build build && cd build && ctest --output-on-failure passes all tests

Closes #22

@AdityaPandeyCN
Copy link
Copy Markdown

Thanks for this, Can you explore if this will be better (https://en.cppreference.com/w/cpp/string/byte/strtol.html).

@lukedev45
Copy link
Copy Markdown
Author

No problem I will take a look and see

@lukedev45
Copy link
Copy Markdown
Author

The short answer is yes.
std::stoi requires a std::string, so the PR constructs std::string(token) on every call. That's a heap allocation per call, 5 per SAM record. SAM files routinely have hundreds of millions of records which is potentially billions of unnecessary allocations in a hot parsing loop. strtol works directly with the char* tokens that strtok already produces. Plus strtol reports errors via errno and the endptr output parameter, which costs nothing.

I'll work on implementing it now.

Copy link
Copy Markdown

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

Comment thread src/ramcore/SamParser.cxx
@@ -1,9 +1,33 @@
#include "ramcore/SamParser.h"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: 'ramcore/SamParser.h' file not found [clang-diagnostic-error]

#include "ramcore/SamParser.h"
         ^

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lukedev45 please take a look here and fix the conflicts.

@codecov-commenter
Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 71.87500% with 9 lines in your changes missing coverage. Please review.
✅ Project coverage is 56.06%. Comparing base (3a3873a) to head (d8c0701).
⚠️ Report is 9 commits behind head on develop.

Files with missing lines Patch % Lines
src/ramcore/SamParser.cxx 58.82% 2 Missing and 5 partials ⚠️
test/ramcoretests.cxx 86.66% 1 Missing and 1 partial ⚠️

❌ Your patch status has failed because the patch coverage (71.87%) is below the target coverage (85.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff             @@
##           develop      #37      +/-   ##
===========================================
+ Coverage    55.85%   56.06%   +0.20%     
===========================================
  Files           16       16              
  Lines         1425     1452      +27     
  Branches       752      771      +19     
===========================================
+ Hits           796      814      +18     
- Misses         521      524       +3     
- Partials       108      114       +6     
Flag Coverage Δ
unittests 56.06% <71.87%> (+0.20%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
test/ramcoretests.cxx 75.67% <86.66%> (+1.71%) ⬆️
src/ramcore/SamParser.cxx 73.91% <58.82%> (-6.79%) ⬇️
Files with missing lines Coverage Δ
test/ramcoretests.cxx 75.67% <86.66%> (+1.71%) ⬆️
src/ramcore/SamParser.cxx 73.91% <58.82%> (-6.79%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@vgvassilev
Copy link
Copy Markdown

This needs rebasing and tests.

…r-research#22)

atoi() silently returns 0 for non-numeric strings, causing corrupted
SAM records to be written with zeroed-out integer fields. Replace all
five atoi() calls with a StrictAtoi wrapper using std::stoi that
validates input and rejects records with non-numeric FLAG, POS, MAPQ,
PNEXT, or TLEN values.
strtol works directly with the char* tokens from strtok, avoiding the heap allocation from std::string construction and exception overhead on every call. This is more efficient for a hot parsing loop and fits the C-style pattern already used throughout SamParser.
@lukedev45 lukedev45 force-pushed the fix/strict-integer-parsing-samparser branch from d8c0701 to 1993695 Compare March 25, 2026 19:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: SamParser silently corrupts data when encountering non-numeric values in integer fields

5 participants