Skip to content

Regex support for missing predefined character classes and negated character classes#1474

Open
lmasroca wants to merge 37 commits intomasterfrom
external-pr-lmasroca
Open

Regex support for missing predefined character classes and negated character classes#1474
lmasroca wants to merge 37 commits intomasterfrom
external-pr-lmasroca

Conversation

@lmasroca
Copy link
Copy Markdown
Collaborator

Added Java/JS support for negated character classes
Added Java support for missing predefined character classes escapes (\v, \H, etc.) and POSIX character classes (\p{Lower}, etc.)

lmasroca and others added 24 commits October 22, 2025 17:12
…as CI currently fails otherwise"

This reverts commit 56a7462.
@lmasroca lmasroca requested a review from jgaleotti March 17, 2026 16:44
@lmasroca lmasroca requested a review from jgaleotti March 19, 2026 14:51
@lmasroca
Copy link
Copy Markdown
Collaborator Author

Looking into this failing check, the test generates a SqlCompositeGene which contains a ChoiceGene which contains an ArrayGene which starts empty but its template contains SqlForeignKeyGene. This generates no problems as long as the array is empty, but as soon as the array gets a random element everything stops being globally valid (in this case the new element, SqlForeignKeyGene is not globally valid as none of its parents in the chain are an SqlAction). The test filters globally invalid genes before randomizing but not after.

This happens on GeneRandomizedTest during verifyCopyValueFrom. We could add a globally valid check after randomization the same way it is done with printable checks, this makes the test pass. However, I am not sure if there should be checks in place to make this chain invalid even before randomizing instead.

    private fun verifyCopyValueFrom(
        mutable: List<Gene>,
        rand: Randomness
    ) {
        val printable = mutable.filter { it.isGloballyValid() && it.isPrintable() }

        printable.forEach { root ->

            val x = root.copy()
            val sx = x.getValueAsRawString()

            val y = x.copy().apply { randomize(rand, true) }
            if(!y.isPrintable()){ // we could add a globally valid check here the same way we filter before
                return@forEach
            }
            val sy = y.getValueAsRawString()

@jgaleotti
Copy link
Copy Markdown
Collaborator

CI is passing. I think this is ready for Andrea's review.

@jgaleotti jgaleotti requested a review from arcuri82 March 24, 2026 13:44
- aDouble
- aFloat
- aBoolean
- aNullableString
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this change?
@Pgarrett you wrote this file, isn't it? what is your opinion on this change?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I did, it seemed one of the fields was not being generated in the test and thus causing the test to fail in the reflective assertions

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No concern on my side

| HexEscapeSequence
| UnicodeEscapeSequence
| OctalEscapeSequence
| 'p' BRACE_open PosixCharacterClassLabel BRACE_close
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add comment that this is specific to Java... possible also add link where \p is explained

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

\p is allowed in javascript also but only when certain flags are enabled. There is no flag support for Java/JS yet.

//| IdentityEscape
;

fragment PosixCharacterClassLabel
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

provide reference link from where those values were taken from. recall the comments at the top of this file

fragment CharacterClassEscape
//one of d D s S w W
: [dDsSwW]
//one of d D s S w W v V h H
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as v V h H are non-standard, and specific to Java, add a comment to explain them here

// this limits the character class complements to 0xffff instead of allowing up to 0x10ffff, but values over
// 0xffff are not permitted on Char as they need 2 Chars to be represented; to allow this, we would need to
// use String or Int in every possible step as methods which return a single Char cannot return these characters
if(negated) internalRanges.add(CharacterRange(Character.MIN_VALUE,Character.MAX_VALUE))
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

always use {} for blocks, even if single statement inside

companion object {
private val log: Logger = LoggerFactory.getLogger(Randomness::class.java)

private fun stringToListOfCharPairs(s: String) : List<CharacterRange> {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rather stringToListOfCharacterRanges ?

private val digitSet = listOf(CharacterRange('0', '9'))
private val asciiLetterSet = listOf(CharacterRange('a', 'z'), CharacterRange('A', 'Z'))
private val wordSet = listOf(CharacterRange('_', '_')) + asciiLetterSet + digitSet
private val spaceSet = stringToListOfCharPairs(" \t\r\n\u000C\u000b")
Copy link
Copy Markdown
Collaborator

@arcuri82 arcuri82 Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

explain in a comment what \u000C\u000b are...
also notice that, in some cases, we explicitely skipped some special characters because escaping in the generating test files is bit buggy (need fixing... but never had time)

private val verticalSpaceSet = stringToListOfCharPairs("\n\u000B\u000C\r\u0085\u2028\u2029")
private val punctuationSet = stringToListOfCharPairs("""!"#$%&'()*+,-./:;<=>?@[\\]^_`{|}~""")

private val digitCharClass = CharacterRangeRxGene(false, digitSet)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should not have references to a gene (ie CharacterRangeRxGene) here in this package, as it can have architectural side-effects in the long run... whatever logic needs to be re-used, should be refactored and extracted outside of gene

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before these changes there were no references to genes here, but character classes values were sampled from here. The possible values came from strings defined on this file, which provided some basic functionality but missed several characters from predefined character classes. Trying to get these predefined character classes working with these missing characters I implemented them by changing code here. I see now how this could be an issue.

So to fix this, the predefined character classes definitions should be in CharacterClassEscapeRxGene, then with each randomize instead of using Randomness.nextX() we just call CharacterRangeRxGene.randomize(...) by making each CharacterClassEscapeRxGene hold the CharacterRangeRxGene which represents it? Then restore the previous method of getting Randomness.nextWordChar(), etc. (regarding previous comment about skipping some special characters as escaping in the generating test files)

return set[random.nextInt(set.length)]
}

fun nextFromCharClass(cc: CharacterRangeRxGene) : Char{
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe you could introduce a new MultiCharacterRange and move out some of business logic of CharacterRangeRxGene into it (which then would use MultiCharacterRange internally)

cc.randomize(this, false)
val k = cc.value
// is it necessary to log this?
log.trace("nextFromCharClass(): {}", k)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes... we have tests for determinism that check those trace logs. eg look at runAndCheckDeterminism()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants