Skip to content

Detect valid UTF-8 as UTF-8#572

Draft
filip-hejsek wants to merge 2 commits into
TypesettingTools:masterfrom
filip-hejsek:utf8_detect
Draft

Detect valid UTF-8 as UTF-8#572
filip-hejsek wants to merge 2 commits into
TypesettingTools:masterfrom
filip-hejsek:utf8_detect

Conversation

@filip-hejsek
Copy link
Copy Markdown
Contributor

@filip-hejsek filip-hejsek commented Mar 3, 2026

This PR modifies Aegisub's charset detection function so that if the input is valid UTF-8, it will always be detected as UTF-8 instead of using uchardet's heuristics (which sometimes fail to correctly detect UTF-8).

ICU charset conversion API is used for UTF-8 validation. Because this API is designed for conversion, not validation, a dummy output buffer has to be provided.

A charset detection test suite has been added. For testing purposes, an additional "detect reason" output parameter has been added to the charset detection function, which allows distinguishing between e.g. detecting utf-8 from byte order mark, UTF-8 validation or uchardet heuristics.

Remaining issues:

  • The added tests need to know if libaegisub was compiled with or without uchardet, but the WITH_UCHARDET macro is not available when compiling tests.

Fixes #370

@filip-hejsek filip-hejsek force-pushed the utf8_detect branch 8 times, most recently from 6861c91 to 9276cc8 Compare March 24, 2026 17:59
@filip-hejsek
Copy link
Copy Markdown
Contributor Author

I've added some basic tests. However, the tests need to know if Aegisub was compiled with uchardet, but the WITH_UCHARDET macro is not available when compiling the tests.

@arch1t3cht
Copy link
Copy Markdown
Member

Thanks! And sorry for the delay; I ended up postponing absolutely everything else until I finished #594 since I probably would have never gotten it done otherwise.

The first commit looks good to me and fixes the wrongly detected files I've been sent in the past. For the second commit (thanks for adding tests!):

  • I like the added TEST_DATA_DIR logic in principle (though maybe it'd be nice to prefix the env variable with AEGISUB_) but as it is it's inconsistent with the way test files are handled for the other test variants. The other tests run tests/setup.{sh,bat} to copy the test data to the build directory so that they can be found via a relative path from the test binary's CWD.

    Now, these dedicated setup scripts are probably not actually needed any more. They used to set up some extra files with different permissions, but that was removed in 5da48ec . So in principle they could just be removed entirely and be replaced with the AEGISUB_TEST_DATA_DIR logic everywhere.

    If you agree I can do this myself sometime soon unless you want to do it yourself.

  • The configuration macros not being available to the tests is unfortunate, yeah. I think the simplest solution might be to move the logic in meson.build that's generating aegisub_defines outside of the if/else branch and pass those flags to the tests, as in the attached patch: testflags.patch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

SubRip encoding not detected as UTF-8 when the first character is "🖥️"

2 participants