Skip to content

Conversation

@vlakoff
Copy link
Contributor

@vlakoff vlakoff commented Nov 11, 2025

Refactor Encoder::chooseMode to simplify and optimize the checks for numeric and alphanumeric content.

ctype_digit() can check the entire string at once, so we can check it upfront:

if (ctype_digit($content)) {
    return Mode::NUMERIC();
}

This allows us to remove the logic related to $hasNumeric, streamlining the rest of the function.

@vlakoff
Copy link
Contributor Author

vlakoff commented Nov 11, 2025

The refactor introduced a regression: an empty string ('') was incorrectly encoded as Mode::ALPHANUMERIC() instead of Mode::BYTE().

While technically valid—since all modes can represent an empty string—Mode::BYTE() better reflects general-purpose encoding and avoids unnecessary constraints tied to the alphanumeric character set.

The most appropriate fix is to add an explicit check at the beginning:

if ('' === $content) {
    return Mode::BYTE();
}

This restores the correct behavior and clarifies the intent—helping prevent future regressions.

edit: PR has been updated with the correction.

@DASPRiD
Copy link
Member

DASPRiD commented Nov 13, 2025

Very nice implementaiton, thanks! Though at the moment we still ahve failing unit tests in main, which prevents me from being able to evaluate any PRs. That needs to be fixed first.

@vlakoff vlakoff changed the title Simplify encoding mode detection logic refactor: simplify encoding mode detection logic Nov 14, 2025
@DASPRiD
Copy link
Member

DASPRiD commented Nov 15, 2025

Please rebase against main to get the tests working.

Refactor to simplify checks for numeric and alphanumeric content.
Fixes a regression introduced in the previous refactor where an empty string ('') was incorrectly encoded as Mode::ALPHANUMERIC() instead of Mode::BYTE().

While technically valid, Mode::BYTE() is a more appropriate choice—it reflects general-purpose encoding and avoids the limitations of the alphanumeric character set.

This change also improves clarity and helps prevent similar regressions in the future.

Similarly, it ensures correct behavior for Encoder::chooseMode('', 'SHIFT-JIS').
getAlphanumericCode():
- rename parameter from $code to $byte for clarity and consistency with docblock
- simplify using null coalescing (also reduces lookups)

isOnlyDoubleByteKanji():
- update docblock to note empty string returns true (which is important to consider in the caller)
@vlakoff
Copy link
Contributor Author

vlakoff commented Nov 16, 2025

Rebased. Please let me know what you would like to see improved (including commit messages) before considering this for merging.

@vlakoff
Copy link
Contributor Author

vlakoff commented Nov 16, 2025

A faster version of isOnlyAlphanumeric(), which takes advantage of strspn():

private static function isOnlyAlphanumeric(string $content) : bool
{
    $allowed = '0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ $%*+-./:';

    return strspn($content, $allowed) === strlen($content);
}

It is much, much faster (hundreds or even thousands of times) and highly scalable, as it appears to run in close to O(1).

Edit: committed — would’ve been a shame not to use it.

Implemented a technique that takes advantage of strspn() to optimize isOnlyAlphanumeric().

This approach is much, much faster (hundreds or even thousands of times) and highly scalable,
appearing to run in close to O(1).
@codecov
Copy link

codecov bot commented Nov 16, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 70.51%. Comparing base (bd2370d) to head (d6d5428).
⚠️ Report is 3 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main     #201      +/-   ##
============================================
- Coverage     70.58%   70.51%   -0.08%     
+ Complexity      995      993       -2     
============================================
  Files            49       49              
  Lines          3182     3174       -8     
============================================
- Hits           2246     2238       -8     
  Misses          936      936              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

*/
private static function isOnlyAlphanumeric(string $content) : bool
{
$allowed = '0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ $%*+-./:';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of assigning this variable on every call, would it make sense to have this as a private const on the class?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. It looks cleaner this way.

@DASPRiD DASPRiD merged commit 741e712 into Bacon:main Nov 16, 2025
9 checks passed
@vlakoff vlakoff deleted the encoder-chooseMode branch November 16, 2025 18:29
@vlakoff
Copy link
Contributor Author

vlakoff commented Nov 16, 2025

Aside: The previous “iterate and analyze byte‑by‑byte” approach would have made sense if we were implementing a “multiple modes” algorithm (see #111). However, we are still far from that. The algorithms are already known and can be generated in PHP, but they remain complex and not straightforward to put in place.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants