regex2dfa

Convert regular expressions to minimized DFAs in AT&T FST format.

Pure Python. Zero dependencies. Results cached in memory.

Installation

pip install regex2dfa

Or install from source:

git clone https://github.com/kpdyer/regex2dfa.git
cd regex2dfa
pip install -e .

Quick Start

from regex2dfa import regex2dfa

dfa = regex2dfa("(a|b)+")
print(dfa)

Output:

Caching

Results are automatically cached in memory. Repeated calls with the same regex return instantly:

from regex2dfa import regex2dfa, cache_info, clear_cache

# First call computes the DFA
regex2dfa("(a|b)+")

# Second call returns from cache (instant)
regex2dfa("(a|b)+")

# Check cache statistics
print(cache_info())
# CacheInfo(hits=1, misses=1, maxsize=1024, currsize=1)

# Clear cache if needed
clear_cache()

Object-Oriented Interface

from regex2dfa import Regex2DFA

converter = Regex2DFA("(a|b)+")

# Access intermediate representations
print(f"NFA states: {len(converter.nfa.states)}")
print(f"DFA states: {len(converter.dfa.states)}")
print(f"Minimized DFA states: {len(converter.minimized_dfa.states)}")

# Get AT&T format
print(converter.to_att())

Low-Level API

from regex2dfa import parse_regex, build_nfa, nfa_to_dfa, minimize_dfa, format_att

# Step-by-step pipeline
postfix = parse_regex("(a|b)+")
nfa = build_nfa(postfix)
dfa = nfa_to_dfa(nfa)
min_dfa = minimize_dfa(dfa)
output = format_att(min_dfa)

AT&T FST Format

The output uses tab-separated fields:

Transitions: src dst input output
Final states: state_id

Labels are ASCII byte values (97 = 'a', 98 = 'b').

Supported Regex Syntax

Feature	Syntax	Description
Literal	`a`	Matches character
Any char	`.`	Matches any byte (0-255)
Char class	`[abc]`	Matches any listed char
Negated class	`[^abc]`	Matches any char not listed
Range	`[a-z]`	Matches range
Zero or more	`*`	Kleene star
One or more	`+`	Kleene plus
Optional	`?`	Zero or one
Alternation	`\|`	Either side
Grouping	`(...)`	Group expressions
Anchors	`^` `$`	Start/end (stripped, implicit)
Hex escape	`\x00`	Byte by hex value
Any byte	`\C`	Any byte (0-255)
Digits	`\d`	`[0-9]`
Word chars	`\w`	`[a-zA-Z0-9_]`
Whitespace	`\s`	Space, tab, newline, carriage return

How It Works

Parser — Tokenizes regex and converts to postfix notation (shunting-yard algorithm)
Thompson's Construction — Builds NFA from postfix tokens
Subset Construction — Converts NFA to DFA (powerset method)
Hopcroft's Algorithm — Minimizes DFA (partition refinement)

Project Structure

regex2dfa/
├── __init__.py     # Public API
├── core.py         # Main regex2dfa() function with caching
├── parser.py       # Regex tokenizer + postfix conversion
├── nfa.py          # NFA data structures
├── thompson.py     # Thompson's construction (Regex → NFA)
├── dfa.py          # DFA data structures
├── subset.py       # Subset construction (NFA → DFA)
├── hopcroft.py     # DFA minimization
└── formatter.py    # AT&T FST output

Tests

pytest tests/ -v

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 105 Commits
.github/workflows		.github/workflows
regex2dfa		regex2dfa
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README_PYPI.md		README_PYPI.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

regex2dfa

Installation

Quick Start

Caching

Object-Oriented Interface

Low-Level API

AT&T FST Format

Supported Regex Syntax

How It Works

Project Structure

Tests

License

About

Uh oh!

Releases 6

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

regex2dfa

Installation

Quick Start

Caching

Object-Oriented Interface

Low-Level API

AT&T FST Format

Supported Regex Syntax

How It Works

Project Structure

Tests

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 6

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages