TextParser is a high-performance, extensible text parsing library written in C. It uses regular expressions to define language grammars and generates a hierarchical Abstract Syntax Tree (AST) for parsed documents.
The project currently provides robust support for CFML (ColdFusion Markup Language) and JSON, with a flexible architecture allows for easy addition of new language definitions.
- High Performance: Written in optimized C for fast parsing of large codebases.
- Small Footprint: The library is designed to be small and easy to integrate into other projects.
- Minimal Dependencies: The library has minimal dependencies (only crpe2 library for regex matching).
- Regex-Based Grammars: Define language syntax using flexible regular expressions.
- Hierarchical AST: Generates a structured tree of tokens (
textparser_token_item) representing the code structure. - Syntax Highlighting Support: Tokens track metadata like color, background, and flags, making it suitable for building syntax highlighters and editors.
- Extensibility: Language definitions are decoupled from the core parsing logic, constructed with JSON, and can be loaded at compile time (by generated header file) or at runtime (by loading JSON file).
- Python Tooling: Includes Python scripts for: prototyping and validation of the core algorithm, generation of C header files, and other parser verification tools.
src/: Core C library implementation (textparser.c,adv_regex.c).include/: Public header files (textparser.h).cli/: Command-line tool mainly for testing and demonstrating the library.definitions/: Language definitions (e.g., CFML, JSON).python/: Python bindings, prototypes, and validation tools (validate_cfml.py).tests/: Unit and integration tests.ccat/: Utilities for text processing (e.g., color cat).
- CMake (version 3.15 or higher)
- Ninja build system
- A C compiler (GCC/Clang)
You can use the provided build script for a quick start:
./build.shAlternatively, you can build using standard CMake commands:
cmake -B build -G Ninja
cmake --build buildArtifacts (libraries and executables) will be output to the bin/ directory.
textparser is available on the Arch User Repository (AUR). You can install it using an AUR helper like yay:
yay -S textparserOr view the package details at https://aur.archlinux.org/packages/textparser.
The textparser CLI tool can be used to parse files and visualize the resulting token tree.
bin/textparser path/to/file.cfmTo use TextParser in your C project, include textparser.h and link against libtextparser.
Basic Example:
#include <textparser.h>
#include <stdio.h>
// Assume 'my_lang_definition' is defined elsewhere
extern const textparser_language_definition my_lang_definition;
int main() {
textparser_defer(handle); // Auto-cleanup
// Open a file
int err = textparser_openfile("example.txt", 0, &handle);
if (err) {
fprintf(stderr, "Failed to open file\n");
return 1;
}
// Parse using the language definition
err = textparser_parse(handle, &my_lang_definition);
if (err) {
fprintf(stderr, "Parse error\n");
return 1;
}
// Iterate through tokens
for (textparser_token_item *item = textparser_get_first_token(handle); item != NULL; item = item->next) {
// ... process item ...
}
return 0;
}TextParser uses a JSON-based format to define language grammars. This allows for defining complex syntax rules using regular expressions and hierarchical token structures.
Here is a simplified example of what a JSON definition might look like (based on definitions/json_definition.json):
{
"name": "json",
"version": 1.0,
"startTokens": ["Object", "Array"],
"tokens": {
"Object": {
"type": "StartStop",
"startRegex": "{",
"endRegex": "}",
"textColor": "0xffd700",
"nestedTokens": ["Key", "String", "Number", "ValueSeparator"]
},
"String": {
"type": "StartStop",
"startRegex": "\"",
"endRegex": "\"",
"textColor": "0xce9178",
"nestedTokens": ["StringEscape"]
},
"Number": {
"type": "SimpleToken",
"startRegex": "-?\\d+(?:\\.\\d+)?",
"textColor": "0xb5cea8"
}
}
}The python/ directory contains tools for verifying the parser's correctness, particularly for CFML.
validate_cfml.py: A robust validation script that compares the AST generated by this project against reference parsers (e.g., a Java-based CFML parser) to ensuring high fidelity and correctness.
python3 python/validate_cfml.py /path/to/cfml/filesSee LICENSE file for details.