Skip to content

Add ADR for regex engine use in C++#61

Open
Ozaq wants to merge 1 commit into
mainfrom
adr/regex-engine
Open

Add ADR for regex engine use in C++#61
Ozaq wants to merge 1 commit into
mainfrom
adr/regex-engine

Conversation

@Ozaq
Copy link
Copy Markdown
Member

@Ozaq Ozaq commented May 5, 2026

Description

Contributor Declaration

By opening this pull request, I affirm the following:

  • All authors agree to the Contributor License Agreement.
  • The code follows the project's coding standards.
  • I have performed self-review and added comments where needed.
  • I have added or updated tests to verify that my changes are effective and functional.
  • I have run all existing tests and confirmed they pass.

@Ozaq Ozaq requested review from danovaro, mcocdawc and tlmquintino May 5, 2026 08:46
Copy link
Copy Markdown
Contributor

@pgeier pgeier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had a look. Nothing much to mention. The important is really that a migration is not straightforward as every used regex has to be tested at runtime.

Otherwise I'm much in favour to provide just a set of general functions like match and let projects use/wrap PCRE2 directly for special use-cases and optimizations (e.g. JIT Compilation)

Copy link
Copy Markdown
Contributor

@mcocdawc mcocdawc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice overview.

A few high-level comments:

Would be great to have an issue in ECKIT that links to this PR. Such that

The current implementation will be marked as deprecated and removed in EcKit 3.0

is actually known there. :-D


It would be great to write a few sentences about the fact that the Regex.h actually is the intended RAII wrapper, such that most C++ code in the ECMWF stack does not have to do the RAII wrapping themselves. (Personal discussion with @Ozaq )
The manual wrapping is only for projects that do not depend on eckit.


Oh, and because of the numbering, it should be merged after #60

Copy link
Copy Markdown
Contributor

@simondsmart simondsmart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The decision quoted here says to implement an API compatible version of Regex.

There are some arguments to Regex - notably shell and extended. Are these arguments meaningful, and supportable with the new pattern. Do we have a feel for where/how often these arguments are used through the code base, or what might break?

My understanding is that one reason for not changing std::regex to something better is that to extract the performance improvements would require an API change. Is the API of PCRE2 appropriate to be wrapped cleanly by eckit::Regex?


### EcKit Migration Path

EcKit will provide a new API-compatible implementation for `Regex.h`. Since *PCRE2's* Perl-compatible syntax differs from POSIX ERE, every caller must verify that its patterns still match correctly.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not something that can be straightforwardly done. Regexes are supplied by, and used by, users - not developers - in many of the cases.

Can we see a documentary summary of the differences?

@Ozaq
Copy link
Copy Markdown
Member Author

Ozaq commented May 7, 2026

The decision quoted here says to implement an API compatible version of Regex.

There are some arguments to Regex - notably shell and extended. Are these arguments meaningful, and supportable with the new pattern. Do we have a feel for where/how often these arguments are used through the code base, or what might break?

Good catch, I did not take the flags into account. However a cursory check revealed no callsite where shell or extended flags where used. In any case I will update the migration guide.

My understanding is that one reason for not changing std::regex to something better is that to extract the performance improvements would require an API change. Is the API of PCRE2 appropriate to be wrapped cleanly by eckit::Regex?

So std::regex is essentially "unfixable" because there are few design mistakes, as acknowledged by the standards committee as well as implementation issues. A central issue is that std::regex is locale aware and not unicode aware. This means matching behavior can change on setting a different locale. If I understand it correctly performance improvements are also hampered by some original choices and since the ABI may not be broken the implementation is essentially set in stone.

This 2004 report lists 65 issues with std::regex.

Then implementers shipped slow implementations, gcc predominantly, and now given the outlook that none of the mess can really be fixed no efforts are made to address performance issues that could be fixed.

For us the general flow of Create -> Compile -> Match stays the same.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants