Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
48 changes: 48 additions & 0 deletions .github/workflows/test.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
name: Tests

on:
push:
branches: [ main, test-suite ]
pull_request:
branches: [ main ]

jobs:
test:
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v4

- name: Set up OCaml
uses: ocaml/setup-ocaml@v3
with:
ocaml-compiler: 5.1.1
dune-cache: true
opam-repositories: |
default: https://github.com/ocaml/opam-repository.git

- name: Install dependencies
run: |
sudo apt-get update
sudo apt-get install -y npm xz-utils libomp-dev llvm-dev
opam install . --deps-only --update-invariant
npm install --no-save typescript browserify pug-lexer pug-parser pug-walk

- name: Install QuickJS
run: |
curl https://bellard.org/quickjs/quickjs-2021-03-27.tar.xz > quickjs.tar.xz
tar xvf quickjs.tar.xz && rm quickjs.tar.xz
mv quickjs-2021-03-27 quickjs
cd quickjs && make

- name: Install Flow
run: |
git clone --branch v0.183.1 --depth 1 https://github.com/facebook/flow.git flow
ln -s "$(pwd)/flow/src/parser" src/flow_parser
ln -s "$(pwd)/flow/src/third-party/sedlex" src/sedlex
ln -s "$(pwd)/flow/src/hack_forked/utils/collections" src/collections

- name: Run tests
run: |
mkdir -p strings
opam exec -- dune runtest tests/
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -23,3 +23,5 @@ bad/
src/flow_parser
src/sedlex
src/collections

tests/integration_test_run/
89 changes: 89 additions & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
# Agent Information - String Extractor

This repository contains an OCaml-based internationalization (i18n) string extraction tool. It parses source files (JS, TS, Vue, Pug, HTML) and extracts strings for translation management.

## Documentation

- **[ARCHITECTURE.md](ARCHITECTURE.md)**: Contains a deep-dive into the codebase layout, directory structure, and a comprehensive API reference. **Read this file first** when:
- Starting a new task to understand which files are relevant.
- Investigating the impact of changes across the system.
- Looking for specific functionality or function definitions before searching.

- **[DEVELOPMENT.md](DEVELOPMENT.md)**: Contains instructions for environment setup, build processes for various platforms, and release workflows. **Read this file first** when:
- Setting up the development environment or installing dependencies (OCaml, JS, QuickJS).
- Building the project for development or release.
- Executing the tool for manual verification or testing.
- Managing version numbers or release artifacts.

## Project Overview

- **Language**: OCaml (5.1.1) with some C++ (QuickJS bridge) and JavaScript (parsers via Browserify).
- **Architecture**:
- `src/cli/`: Main entry point, command-line interface, and output generation logic.
- `src/parsing/`: OCaml parsers using `Angstrom` for custom formats and `Flow_parser` for JS.
- `src/quickjs/`: Bridge to QuickJS to run JavaScript-based parsers (TypeScript/Pug) from OCaml.
- `src/utils/`: Common utilities for collection, timing, and I/O.
- **Key Libraries**: `Core`, `Eio` (concurrency), `Angstrom` (parsing), `Yojson`, `Ppx_jane`.

## Essential Commands

### Build
- **Development build**: `dune build src/cli/strings.exe`
- **Watch mode**: `dune build src/cli/strings.exe -w`
- **Release build (MacOS)**: `DUNE_PROFILE=release dune build src/cli/strings.exe`
- **Full release cycle**: See `DEVELOPMENT.md` for `cp`, `strip`, and Docker commands.

### Run
- After building: `./_build/default/src/cli/strings.exe [directory-to-extract-from]`
- The CLI expects to be run from the root of a project containing a `strings/` directory (or it will create one if a `.git` folder is present).

### Installation (Dev Setup)
Refer to `DEVELOPMENT.md` for specific `opam` and `npm` setup steps, as the project has several external dependencies (Flow, QuickJS, pug-lexer, etc.).

## Code Conventions & Patterns

### Parsing Strategy
1. **Direct Parsers**: Simple formats like `.strings`, `HTML`, and basic `Vue` tags are parsed using `Angstrom` in `src/parsing/`.
2. **JS/TS Parsing**:
- Javascript uses `Flow_parser` and a custom AST walker in `src/parsing/js_ast.ml`.
- TypeScript uses the official TS parser running inside QuickJS (`src/quickjs/`).
3. **Pug Parsing**: Has a "fast" OCaml implementation (`src/parsing/pug.ml`) and a "slow" official Pug implementation via QuickJS (`src/quickjs/`).

### Extraction Pattern
- Content is extracted into a `Utils.Collector.t`.
- The collector tracks found strings, potential scripts (to be further parsed), and file errors.
- **Convention**: Strings found inside `L("...")` calls are treated as translations in JS/TS.

### Concurrency
- Uses OCaml 5's `Eio` for multicore processing.
- Parallel traversal of directories is handled in `src/cli/strings.ml` via `Fiber.List.iter` and `Eio.Executor_pool`.
- JS workers (QuickJS) are managed via a pool in `src/quickjs/quickjs.ml`.

## Important Gotchas

- **QuickJS Dependency**: Requires a compiled `quickjs` directory at the project root for building. `dune` rules in `src/quickjs/dune` copy headers and libraries from there.
- **Generated Headers**: `src/quickjs/runtime.h` is generated from `src/quickjs/parsers.js` using `browserify` and `qjsc`.
- **Linking**: MacOS builds use specific link flags (e.g., `ld64.lld`) defined in `src/cli/link_flags.*`.
- **OCamlFormat**: `.ocamlformat` is present; ensure you format OCaml code before submitting.
- **Memory Safety**: Be cautious with C++ FFI code in `src/quickjs/quickjs.cpp`, particularly regarding OCaml's GC interaction (`CAMLparam`, `CAMLreturn`, `caml_release_runtime_system`).

## Testing Approach

- **Inline Tests**: The project uses `ppx_inline_test`. Parsers in `src/parsing/` can be tested directly within the OCaml files or in the `tests/` directory.
- **Test Suite**: A standard test suite is located in `tests/test_runner.ml`. It covers JS, HTML, Pug, and `.strings` file parsing.
- **Integration Tests**: Verification can be performed by running the built binary against fixtures in `tests/fixtures/` and checking the generated output in the `strings/` directory.
- **Debug Flags**: Use `--show-debugging` or `--debug-pug` / `--debug-html` flags in the CLI to inspect internal parsing results.

## Troubleshooting

### "File modified since last read"
If you receive an error stating that a file has been **"modified since it was last read"**, it usually indicates a discrepancy between the file's filesystem timestamp and the internal state tracking.

**Example Error:**
> `Edit failed: The file '/path/to/file' was modified since it was last read. Please read the file again before trying to edit it.`

**Recommended Fix:**
1. Execute `touch filename` to reset the file's modification time to the current system time.
2. Re-read the file using the `view` tool.
3. Attempt the edit again.

80 changes: 80 additions & 0 deletions ARCHITECTURE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
# Architecture Documentation - String Extractor

This document provides a high-level overview of the String Extractor's architecture, directory structure, and internal APIs.

## Project Entry Point

The main entry point of the application is **`src/cli/strings.ml`**. It handles command-line argument parsing using `Core.Command`, sets up the `Eio` runtime, and initiates the file traversal process.

## Directory Structure

```text
/
├── src/
│ ├── cli/ # Main CLI application logic
│ │ ├── strings.ml # CLI entry point, traversal coordination
│ │ ├── vue.ml # Vue-specific parsing and extraction logic
│ │ └── generate.ml # Localization file generation (.strings, .json)
│ ├── parsing/ # Core parsers using Angstrom and Flow
│ │ ├── basic.ml # Common parsing utilities and combinators
│ │ ├── js_ast.ml # Flow AST walker for string extraction
│ │ ├── js.ml # JavaScript string extraction entry point
│ │ ├── pug.ml # Native Pug template parsing
│ │ ├── html.ml # HTML template parsing
│ │ ├── strings.ml # .strings file parsing logic
│ │ └── ... # Other specialized parsers (vue blocks, styles)
│ ├── quickjs/ # Interface to QuickJS for JS/TS/Pug parsing
│ │ ├── quickjs.ml # OCaml FFI to QuickJS
│ │ ├── quickjs.cpp # C++ implementation of the bridge
│ │ └── parsers.js # JS-based parsers running in QuickJS
│ └── utils/ # Shared utility modules
│ ├── collector.ml # State container for collected strings/errors
│ ├── io.ml # I/O helpers
│ ├── timing.ml # Performance measurement
│ └── exception.ml # Exception handling
├── strings/ # Directory where .strings files are managed
├── dune-project # Dune build system configuration
└── README.md # Project overview and usage instructions
```

## Core API Reference

### `src/cli/`
- **`Strings.main`**: Coordinates the entire run, including directory traversal and result generation.
- **`Vue.parse`**: Splits a `.vue` file into its constituent parts (template, script, style).
- **`Generate.write_english`**: Creates `english.strings` and `english.json` from the collected strings.
- **`Generate.write_other`**: Updates existing translations for other languages.

### `src/parsing/`
- **`Parsing.Basic`**: Provides foundational Angstrom parsers for whitespace, strings, and standard error handling.
- **`Parsing.Js.extract_to_collector`**: Entry point for scanning JavaScript source code.
- **`Parsing.Js_ast.extract`**: A comprehensive walker for the Flow AST that identifies and extracts strings from `L("...")` calls.
- **`Parsing.Pug.collect`**: Traverses the native Pug AST to extract strings.
- **`Parsing.Strings.parse`**: Parses existing `.strings` files into a lookup table.

### `src/quickjs/`
- **`Quickjs.extract_to_collector`**: Offloads extraction to QuickJS for TypeScript and advanced Pug templates.

### `src/utils/`
- **`Utils.Collector.create`**: Initializes a new string collection state for a specific file. (type `t = { path: string; strings: string Queue.t; ... }`)
- **`Utils.Collector.blit_transfer`**: Merges results from one collector into another.

## Control Flow
1. **Initiation**: `strings.exe` starts, parses CLI flags, and identifies the target directory.
2. **Traversal**: Uses `Eio` to recursively walk the directory tree.
3. **Dispatch**: For each supported file extension, the corresponding parser in `src/parsing` is invoked.
4. **Collection**: Parsers find strings (usually inside `L()`) and add them to a `Collector.t`.
5. **Generation**: `Generate.ml` aggregates strings from all collectors and updates the `strings/` directory.

## Testing Setup

The project implements a multi-layered testing strategy:

1. **Inline Tests**: Using `ppx_inline_test`, logic can be tested directly within the source files. This is primarily used for parser validation in `src/parsing/`.
2. **Standard Test Suite**: Located in `tests/test_runner.ml`, this suite uses `ppx_expect` and `ppx_assert` to verify:
- JavaScript string extraction via `Flow_parser`.
- HTML and Pug extraction via `SZXX` and `Angstrom`.
- Apple-style `.strings` file parsing.
3. **Integration Testing**: The `tests/fixtures/` directory contains sample files of all supported types. The CLI can be run against these fixtures to verify end-to-end extraction and output generation (`.strings` and `.json` files).

The `tests/dune` file configures the test library and enables inline tests for the module.
17 changes: 17 additions & 0 deletions DEVELOPMENT.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,14 @@
## Local development

### Setup

From the root of the repo:

```sh
brew install opam libomp llvm

opam switch create . ocaml-variants.5.1.1+options --no-install
eval $(opam env)
opam install . --deps-only -t

# Remove old Flow version
Expand All @@ -28,6 +31,7 @@ cd quickjs && make && cd -
```

### MacOS - Development

```sh
# Build
dune build src/cli/strings.exe -w
Expand All @@ -37,6 +41,7 @@ cp _build/default/src/cli/strings.exe strings.mac && ./strings.mac
```

### MacOS - Build & Run

```sh
# Don't forget to update the version number in [strings.ml]

Expand All @@ -46,6 +51,7 @@ rm -f strings.mac && dune clean && DUNE_PROFILE=release dune build src/cli/strin
```

### Docker (Linux) - Build & Run

```sh
# Don't forget to update the version number in [strings.ml]

Expand All @@ -68,3 +74,14 @@ docker run -it --rm \
## apt-get update && apt-get install musl
## /app/strings.linux frontend
```

### Testing

To run the automated tests and generate the translation files, first create the `strings/` directory at the project root, then run the tests. Ensure your opam environment is initialized:

```sh
eval $(opam env)
mkdir -p strings && opam exec -- dune runtest tests/
```

This command builds the project, executes the test suite, and populates the `strings/` directory with `english.strings` (extracted from fixtures) and merged `french.strings`.
2 changes: 1 addition & 1 deletion dune
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
(data_only_dirs node_modules quickjs)
(data_only_dirs node_modules quickjs flow)
(env
(dev
(flags (:standard -warn-error -A))
Expand Down
1 change: 1 addition & 0 deletions src/cli/link_flags.linux.dev.dune
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
()
21 changes: 19 additions & 2 deletions src/quickjs/dune

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 QuickJS dune rules cd to %{project_root}/../../ causing build to run outside repo

The src/quickjs/dune rules cd to %{project_root}/../../ before running browserify, and also reference QuickJS tools/headers via %{project_root}/../../quickjs/....

%{project_root} is already the repository root, so appending /../../ moves the working directory two levels above the repo. That makes paths like src/quickjs/parsers.js and quickjs/qjsc resolve to non-existent locations.

Example (new code):

cd %{project_root}/../../
npx browserify src/quickjs/parsers.js > "$DIR/runtime.js"

src/quickjs/parsers.js is expected to be in the repo root, but after cd it will not be.

Actual behavior: dune rules fail during build/test (cannot find src/quickjs/parsers.js / quickjs/qjsc / quickjs.h / libquickjs.a).
Expected behavior: these rules should run within the repo root (or otherwise reference repo-root-relative paths correctly).

Impact: CI and local builds that rely on these rules will break (QuickJS bridge can’t build, which typically blocks TypeScript/Pug parsing and may prevent the entire binary from linking).

(Refers to lines 24-50)

Recommendation: Remove the ../../ hops and operate from %{project_root} (repo root), e.g. cd %{project_root} and use %{project_root}/quickjs/... paths. Alternatively, avoid cd and pass absolute %{project_root}-based paths directly to browserify/qjsc/cp.

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Original file line number Diff line number Diff line change
Expand Up @@ -53,8 +53,25 @@
(rule
(targets libomp.a)
(action (bash "
cp /usr/local/Cellar/libomp/17.0.6/lib/libomp.a . &> /dev/null \
|| cp /usr/lib/libgomp.a libomp.a
OUT=\"libomp.a\"
PAWTHS=\"
/usr/local/Cellar/libomp/17.0.6/lib/libomp.a
/usr/lib/x86_64-linux-gnu/libomp.a
/usr/lib/x86_64-linux-gnu/libgomp.a
/usr/lib/libgomp.a
/usr/lib/gcc/x86_64-linux-gnu/*/libgomp.a
/usr/lib/gcc/aarch64-redhat-linux/*/libgomp.a
\"
for p in $PAWTHS; do
for matched_path in $p; do
if [ -f \"$matched_path\" ]; then
cp \"$matched_path\" \"$OUT\"
exit 0
fi
done
done
echo \"Error: Could not find libomp.a or libgomp.a\" >&2
exit 1
"))
(mode standard)
)
54 changes: 54 additions & 0 deletions tests/dune
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
(library
(name parsing_tests)
(inline_tests)
(libraries parsing utils core eio_main)
(preprocess (pps ppx_jane ppx_inline_test))
)

(rule
(alias runtest)
(deps
../src/cli/strings.exe
(source_tree fixtures))
(action
(bash "
TMP_DIR=\"integration_test_run\"
rm -rf $TMP_DIR
mkdir -p $TMP_DIR/strings
mkdir -p $TMP_DIR/.git
printf '\"Hello from HTML\" = \"Bonjour de HTML\";\n' > $TMP_DIR/strings/french.strings
cp -r fixtures $TMP_DIR/
cd $TMP_DIR
../../src/cli/strings.exe fixtures --output strings &> /dev/null
cd ..

if ! grep -q \"Bonjour de HTML\" $TMP_DIR/strings/french.strings; then
echo \"Error: French translation lost in .strings\"
exit 1
fi
if ! grep -q \"Bonjour de HTML\" $TMP_DIR/strings/french.json; then
echo \"Error: French translation lost in .json\"
exit 1
fi
if ! grep -q \"MISSING TRANSLATION - demo.pug\" $TMP_DIR/strings/french.strings; then
echo \"Error: Missing translation marker not found in .strings\"
exit 1
fi

echo \"✅ French integration test passed\"
rm -rf $TMP_DIR

# Help user populate root strings/ if it exists
# We use absolute paths to ensure we hit the real source directory
# even if sandboxed in _build/default/tests
# The dune-project is used as a landmark for the root
# Traverse up to find the root of the source tree
# In dune, we are at _build/default/tests
ROOT_SRC=\"$(cd ../../.. && pwd)\"
if [ -d \"$ROOT_SRC/strings\" ]; then
# Extraction generates 5 strings.
# We pre-populate 3 translations.
printf '\"Hello from HTML\" = \"Bonjour de HTML\";\n\"Hello from JS\" = \"Bonjour de JS\";\n\"Hello from Pug\" = \"Bonjour de Pug\";\n' > \"$ROOT_SRC/strings/french.strings\"
./../src/cli/strings.exe \"$ROOT_SRC/tests/fixtures\" --output \"$ROOT_SRC/strings\"
fi
Comment on lines +47 to +53

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Integration test overwrites real strings/french.strings in repo root (data loss) when strings/ exists

The tests/dune runtest rule contains a “help user populate root strings/” block that writes directly into $ROOT_SRC/strings/french.strings and then runs the extractor with --output "$ROOT_SRC/strings".

New code:

ROOT_SRC="$(cd ../../.. && pwd)"
if [ -d "$ROOT_SRC/strings" ]; then
  printf '...\n' > "$ROOT_SRC/strings/french.strings"
  ./../src/cli/strings.exe "$ROOT_SRC/tests/fixtures" --output "$ROOT_SRC/strings"
fi

Actual behavior: running dune runtest tests/ on a checkout that already has a strings/ directory will clobber the developer’s real strings/french.strings (and mutate other generated files) as a side effect of the test suite.
Expected behavior: tests should be hermetic and should not modify the user’s working tree or overwrite translation files.

Impact: potential data loss / corruption of translation files during routine test runs; also makes tests non-reproducible depending on whether strings/ exists.

Recommendation: Remove this side-effecting block, or gate it behind an explicit env var (e.g. POPULATE_ROOT_STRINGS=1), and never overwrite existing files (use a temp output dir or append-only behavior).

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: I did this for my convenience to verify that the test command produced the correct output. @SGrondin If you prefer some other location feel free to modify as you see fit.

Comment on lines +41 to +53

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Integration test helper computes repo root incorrectly (lands in _build instead of source root)

In tests/dune, the post-test helper tries to locate the repo root from the assumed working directory _build/default/tests:

# In dune, we are at _build/default/tests
ROOT_SRC="$(cd ../../.. && pwd)"

From _build/default/tests, cd ../../.. resolves to _build, not the repository root. As a result, the conditional if [ -d "$ROOT_SRC/strings" ]; then ... almost always evaluates false (because _build/strings typically doesn’t exist), and the intended helper extraction into the real strings/ directory never runs.

Actual: helper block is effectively skipped even when the real repo-root strings/ exists.
Expected: compute the true repository root (likely one more .., or use %{project_root} from dune in the rule).

Impact: reduces usefulness of the test rule for developers; does not typically break CI because the helper is conditional.

Recommendation: Use dune variables instead of relative traversal, e.g. ROOT_SRC="%{project_root}" (or %{workspace_root} depending on dune version), or adjust the traversal to cd ../../../.. if you truly start from _build/default/tests.

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

")))
1 change: 1 addition & 0 deletions tests/fixtures/demo.html
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
<i18n>Hello from HTML</i18n>
1 change: 1 addition & 0 deletions tests/fixtures/demo.js
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
L('Hello from JS');
1 change: 1 addition & 0 deletions tests/fixtures/demo.pug
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
i18n Hello from Pug
13 changes: 13 additions & 0 deletions tests/fixtures/demo.vue
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
<template>
<i18n>Hello from Vue Template</i18n>
</template>

<script>
export default {
data() {
return {
msg: L('Hello from Vue Script')
}
}
}
</script>
Loading