Skip to content

Commit cd32d8a

Browse files
pirapiraclaude
andcommitted
Implement Phase 1 lexer: Python 3.12 tokenizer
Add a complete Python 3.12 lexer in 9 sub-modules under Lython/Lexer/: - Types, Keywords (35), Operators (32), Delimiters (11) - Monadic lexer (StateT + Except) with char peek/advance helpers - Number lexing: decimal/hex/octal/binary integers, floats, imaginary - String lexing: single/double/triple-quoted, escapes, raw, bytes, f-string start - INDENT/DEDENT generation from leading whitespace with tab expansion - Main tokenization loop with maximal-munch operator dispatch 128 passing #guard test assertions covering all token types, including integration tests for function/class definitions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 9d85fed commit cd32d8a

File tree

13 files changed

+1525
-10
lines changed

13 files changed

+1525
-10
lines changed

CLAUDE.md

Lines changed: 16 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -13,15 +13,25 @@ Python 3.12 interpreter in Lean4, targeting the leanSpec Ethereum consensus spec
1313
```
1414
Lython.lean — umbrella import for the library
1515
Lython/
16-
Lexer.lean — tokenizer
17-
Parser.lean — PEG parser
18-
AST.lean — Python AST node types
19-
Interpreter.lean — tree-walking interpreter
20-
Runtime.lean — runtime support (types, exceptions, stdlib)
16+
Lexer.lean — tokenizer entry point (imports sub-modules, exposes tokenize)
17+
Lexer/
18+
Types.lean — SourcePos, SourceSpan, TokenKind, Token, LexError
19+
Keywords.lean — Python 3.12 keyword enum + lookup
20+
Operators.lean — Operator and Delimiter enums
21+
State.lean — LexerState, LexerM monad, char peek/advance helpers
22+
Char.lean — Character classification (isIdentStart, isHexDigit, etc.)
23+
Number.lean — Integer/float/imaginary literal lexing
24+
StringLit.lean — String/bytes/raw/f-string literal lexing
25+
Indent.lean — INDENT/DEDENT generation from leading whitespace
26+
Core.lean — Main tokenization loop, operator dispatch
27+
Parser.lean — PEG parser (stub)
28+
AST.lean — Python AST node types (stub)
29+
Interpreter.lean — tree-walking interpreter (stub)
30+
Runtime.lean — runtime support (stub)
2131
Main.lean — CLI entry point
2232
LythonTest.lean — test driver root
2333
LythonTest/
24-
Basic.lean — smoke tests
34+
Basic.lean — lexer tests (keywords, operators, numbers, strings, indent)
2535
```
2636

2737
## Code Style

Lython/Lexer.lean

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,21 @@
1+
import Lython.Lexer.Types
2+
import Lython.Lexer.Keywords
3+
import Lython.Lexer.Operators
4+
import Lython.Lexer.State
5+
import Lython.Lexer.Char
6+
import Lython.Lexer.Indent
7+
import Lython.Lexer.Number
8+
import Lython.Lexer.StringLit
9+
import Lython.Lexer.Core
10+
111
set_option autoImplicit false
212

313
namespace Lython.Lexer
14+
15+
/-- Tokenize a Python source string into an array of tokens. -/
16+
def tokenize (source : String) : Except LexError (Array Token) :=
17+
match lexAll.run (LexerState.initial source) with
18+
| .ok (tokens, _) => .ok tokens
19+
| .error err => .error err
20+
421
end Lython.Lexer

Lython/Lexer/Char.lean

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
set_option autoImplicit false
2+
3+
namespace Lython.Lexer
4+
5+
/-- Is the character a valid Python identifier start? (ASCII subset) -/
6+
def isIdentStart (c : Char) : Bool :=
7+
c.isAlpha || c == '_'
8+
9+
/-- Is the character a valid Python identifier continuation? (ASCII subset) -/
10+
def isIdentCont (c : Char) : Bool :=
11+
c.isAlphanum || c == '_'
12+
13+
/-- Is the character an octal digit? -/
14+
def isOctDigit (c : Char) : Bool :=
15+
c.val ≥ 0x30 && c.val ≤ 0x37
16+
17+
/-- Is the character a binary digit? -/
18+
def isBinDigit (c : Char) : Bool :=
19+
c == '0' || c == '1'
20+
21+
/-- Is the character a hex digit? -/
22+
def isHexDigit (c : Char) : Bool :=
23+
c.isDigit || (c.val ≥ 0x61 && c.val ≤ 0x66) || (c.val ≥ 0x41 && c.val ≤ 0x46)
24+
25+
/-- Is the character horizontal whitespace (not newline)? -/
26+
def isHorizWhitespace (c : Char) : Bool :=
27+
c == ' ' || c == '\t' || c == '\x0C'
28+
29+
end Lython.Lexer

0 commit comments

Comments
 (0)