Skip to content

TextParser is a high-performance C library that parses text(CFML and JSON for now) into Abstract Syntax Trees using regex grammars, designed for building syntax highlighters, language servers, as well as other code related tools.

License

Notifications You must be signed in to change notification settings

bokic/textparser

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TextParser

TextParser is a high-performance, extensible text parsing library written in C. It uses regular expressions to define language grammars and generates a hierarchical Abstract Syntax Tree (AST) for parsed documents.

The project currently provides robust support for CFML (ColdFusion Markup Language) and JSON, with a flexible architecture allows for easy addition of new language definitions.

Features

  • High Performance: Written in optimized C for fast parsing of large codebases.
  • Small Footprint: The library is designed to be small and easy to integrate into other projects.
  • Minimal Dependencies: The library has minimal dependencies (only crpe2 library for regex matching).
  • Regex-Based Grammars: Define language syntax using flexible regular expressions.
  • Hierarchical AST: Generates a structured tree of tokens (textparser_token_item) representing the code structure.
  • Syntax Highlighting Support: Tokens track metadata like color, background, and flags, making it suitable for building syntax highlighters and editors.
  • Extensibility: Language definitions are decoupled from the core parsing logic, constructed with JSON, and can be loaded at compile time (by generated header file) or at runtime (by loading JSON file).
  • Python Tooling: Includes Python scripts for: prototyping and validation of the core algorithm, generation of C header files, and other parser verification tools.

Project Structure

  • src/: Core C library implementation (textparser.c, adv_regex.c).
  • include/: Public header files (textparser.h).
  • cli/: Command-line tool mainly for testing and demonstrating the library.
  • definitions/: Language definitions (e.g., CFML, JSON).
  • python/: Python bindings, prototypes, and validation tools (validate_cfml.py).
  • tests/: Unit and integration tests.
  • ccat/: Utilities for text processing (e.g., color cat).

Build Instructions

Prerequisites

  • CMake (version 3.15 or higher)
  • Ninja build system
  • A C compiler (GCC/Clang)

Building

You can use the provided build script for a quick start:

./build.sh

Alternatively, you can build using standard CMake commands:

cmake -B build -G Ninja
cmake --build build

Artifacts (libraries and executables) will be output to the bin/ directory.

Installation

Arch Linux

textparser is available on the Arch User Repository (AUR). You can install it using an AUR helper like yay:

yay -S textparser

Or view the package details at https://aur.archlinux.org/packages/textparser.

Usage

CLI Tool

The textparser CLI tool can be used to parse files and visualize the resulting token tree.

bin/textparser path/to/file.cfm

C Library Integration

To use TextParser in your C project, include textparser.h and link against libtextparser.

Basic Example:

#include <textparser.h>
#include <stdio.h>

// Assume 'my_lang_definition' is defined elsewhere
extern const textparser_language_definition my_lang_definition;

int main() {
    textparser_defer(handle); // Auto-cleanup

    // Open a file
    int err = textparser_openfile("example.txt", 0, &handle);
    if (err) {
        fprintf(stderr, "Failed to open file\n");
        return 1;
    }

    // Parse using the language definition
    err = textparser_parse(handle, &my_lang_definition);
    if (err) {
        fprintf(stderr, "Parse error\n");
        return 1;
    }

    // Iterate through tokens
    for (textparser_token_item *item = textparser_get_first_token(handle); item != NULL; item = item->next) {
        // ... process item ...
    }
    
    return 0;
}

Language Definition Example

TextParser uses a JSON-based format to define language grammars. This allows for defining complex syntax rules using regular expressions and hierarchical token structures.

Here is a simplified example of what a JSON definition might look like (based on definitions/json_definition.json):

{
  "name": "json",
  "version": 1.0,
  "startTokens": ["Object", "Array"],
  "tokens": {
    "Object": {
      "type": "StartStop",
      "startRegex": "{",
      "endRegex": "}",
      "textColor": "0xffd700",
      "nestedTokens": ["Key", "String", "Number", "ValueSeparator"]
    },
    "String": {
      "type": "StartStop",
      "startRegex": "\"",
      "endRegex": "\"",
      "textColor": "0xce9178",
      "nestedTokens": ["StringEscape"]
    },
    "Number": {
      "type": "SimpleToken",
      "startRegex": "-?\\d+(?:\\.\\d+)?",
      "textColor": "0xb5cea8"
    }
  }
}

Development and Verification

The python/ directory contains tools for verifying the parser's correctness, particularly for CFML.

  • validate_cfml.py: A robust validation script that compares the AST generated by this project against reference parsers (e.g., a Java-based CFML parser) to ensuring high fidelity and correctness.
python3 python/validate_cfml.py /path/to/cfml/files

License

See LICENSE file for details.

About

TextParser is a high-performance C library that parses text(CFML and JSON for now) into Abstract Syntax Trees using regex grammars, designed for building syntax highlighters, language servers, as well as other code related tools.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published