Full refactor of the library #16

meilame-tayebjee · 2025-11-07T16:12:04Z

Conceptually a text classification model is composed of:
- a Tokenizer that outputs a (batch_size, output_dim) tensor:
  - enabling support for hugging face tokenizer, that can be trained, saved, loaded from the HF Hub... HF Tokenizers also provide (token -> position in the sentence) methods so it is very easy to map token attributions to characters and words
  - TF-IDF is also a "tokenizer", conceptually: it is just that the second dim is vocabulary_size
  - all tokenizers should have same structure of output
- A PyTorch modelTextClassificationModelthat does not embark the tokenizer, and has nothing related to it
  - Its input should be a vectorized/tokenized text in the form of a 2D-tensor (not raw text), optionally an additional tensor (categorical variables)
  - It ouputs raw "scores", not softmaxed.
  - No predict function here: it is not its job to take raw text and output predictions but the wrapper's.
  - It is itself composed of 3 components:
    - The TextEmbedder:
      - It is composed of an nn.Embedding layer and potentially attention logic
      - The user can not pass a custom PyTorch Module here (because we impose some structure e.g. having an Embedding layer
      - It is optional, because the input text tensor can already be "vectorized"! (ex: TF-IDF)
    - the CategoricalVariableNet : Embedding layers and same forward logic as before
    - the ClassificationHead : a neural net that takes a (batch_size, embeddding_dim) tensor and outputs a (batch_size, num_classes) tensor
      - here we provide some freedom: the user can pass any neural net provided the input and output dimensions are coherent with the other components
  - The PyTorch model checks if all the components are well imbricated. Once you have been able to built it, your forward pass should work seemlessly
  - All of the three components are can live alone independently, and all of them are nn.Module
    - For instance, one can imagine, after training, use only TextEmbedder component to make visualizations of the embedding space or even train a sklearn classifier on top ! Or even fine-tune with another ClassificationHead...
- LightningModule and dataset have minimal changes
- The wrapper class:
  - its role is to create, orchestrate and validate the coherence between all those components given the hyperparameters and a tokenizer
    - for instance, it will make sure that there the vocab_size of the tokenizer is the vocab_size of the text embedding layer
  - it is the object that has a broad overview: otherwise all the components are completely independent by construction
  - it is the object in direct relation with the user so it deals from raw text to prediction: this one has a predict method ! And explainability too is here.
  - Technically, it can directly inherit from mlflow.pyfunc
  - train method as before etc.
So all in all, you can let the wrapper manipulate the objects for you. Or you can build the wrapper and/or the PyTorch Model (TextClassificationModel) from your own custom objects (instantiate your own TextEmbedder, CategoricalVariableNet and ClassificationHead...). So a great flexibility. And you can also use ANY tokenizer from HuggingFace.

reorder files add first tests

add tests

… cat var

ClassificationHead and CatVarNet objetcs maximum flexibility TextFeaturizer TODO

components are text_embedder, categorical_var_net and classification_head text_embedder is optional, but tokenizer is not (TF-IDF is conceptually a tokenizer) all are customizable need to add doc at some point

in prevision of adding TF_IDF

…l, training, explainability moved prediction logic here

constant output dim if specified, otherwise pad to longest sequence in batch

…xpl.

…ar level plot functions to be fixed!

torchTextClassifiers/torchTextClassifiers.py

…tClassifiers into hf_tokenizer

add also tests and comparison with WordPiece

aligning with NGramTokenizer

…tClassifiers into hf_tokenizer

for clarity and consistency

pyproject.toml

torchTextClassifiers/tokenizers/WordPiece.py

xxx.py

has been put into tests into a suitable format

Related to c7307f5 Co-authored-by: Cédric Couralet <cedric.couralet@insee.fr>

meilame-tayebjee added 30 commits October 22, 2025 16:04

feat: add HF dependencies (as a group)

64f38a5

feat: add WordPiece tokenize

3703f48

reorder files add first tests

chore: rename file to ngram

1266287

feat: improve base tokenizer, add HF abstract

d2563ea

feat: change inheritance to HFTokenizer

ae045ab

add tests

feat(dataset): init

c6eac58

fix: add update of vocab size in post training

c25eb36

fix: categorical tensors set to None instead of empty tensors when no…

d897bef

… cat var

feat: add ruff and datasets dep

51be1d1

feat: first working example for model/module

b53a10d

chore: fix signature

6f3417c

chore: default value for batch_idx in predict

c600f18

feat!: violently modularize and simplify forward+checking

85cb8b8

ClassificationHead and CatVarNet objetcs maximum flexibility TextFeaturizer TODO

chore: remove tokenizer (now it is ngram tokenizer)

dc863ff

feat!(components): first working example with full modularity

064b73f

components are text_embedder, categorical_var_net and classification_head text_embedder is optional, but tokenizer is not (TF-IDF is conceptually a tokenizer) all are customizable need to add doc at some point

fix: avoid bugs with numpy arrays in boolean contexts

164cccf

feat: add smooth imports for HF and output_dim field

c5b9673

in prevision of adding TF_IDF

feat!(wrapper class): finalize orchestration tokenizer, dataset, mode…

a0fe18c

…l, training, explainability moved prediction logic here

fix: return only optimizer when scheduler is none

ddd7cec

feat(test): clean tests (wip)

32e6805

chore: clean

8fdaf0c

feat: enable to choose context size in tokenizer

a7f71d3

constant output dim if specified, otherwise pad to longest sequence in batch

chore: pin_memory to default False (avoid warning on CPU run)

0a9eda5

feat: ad __repr__ for all components

6d951fe

chore: format

c31ad43

feat!(HF): enable load from pretrained

956b7a3

chore: update description

a497697

feat: __call__ for tokenizers is tokenize

2fda9c2

feat(tokenizers): clean __call__ and __rep__, add offset return for e…

13b9de4

…xpl.

feat!(explainability): finalize explainability feature at word and ch…

f55452b

…ar level plot functions to be fixed!

micedre reviewed Nov 13, 2025

View reviewed changes

torchTextClassifiers/torchTextClassifiers.py Show resolved Hide resolved

micedre and others added 10 commits November 13, 2025 15:26

examples : fix basic_classification after refactor

89cc8fe

Fix check for categorical variable

1b62eee

Adapt examples to new package architecture

704fe14

Merge branch 'hf_tokenizer' of https://github.com/InseeFrLab/torchTex…

be28866

…tClassifiers into hf_tokenizer

chore: first draft of example notebook. WIP

be4acf2

refactor: replace cpu_run with accelerator in TrainingConfig

0f9b4b4

feat!(tokenizer-ngram): add very fast ngram tokenizer

5102f82

add also tests and comparison with WordPiece

doc: clean example notebook

ab58e26

fix: better handling of truncation to avoid warning

45ace28

doc: fix readme

b2e797b

meilame-tayebjee marked this pull request as ready for review November 19, 2025 11:08

meilame-tayebjee requested a review from micedre November 19, 2025 11:08

meilame-tayebjee added 8 commits November 20, 2025 10:57

fix: allow tokenizer not to have train attribute

84b118b

feat(ngram): add return offsets and word_ids + fix output_dim

3c0a85a

fix: update vocab_size after training

ab70485

fix: add a flag for return_word_ids

27a11bb

aligning with NGramTokenizer

fix: add a flag for return_word_ids

823467b

aligning with NGramTokenizer

Merge branch 'hf_tokenizer' of https://github.com/InseeFrLab/torchTex…

93a6e80

…tClassifiers into hf_tokenizer

fix: replace _build_vocab by train

4e2ffa5

for clarity and consistency

feat(test): add test of all pipeline with different tokenizers

519a32d

micedre reviewed Nov 20, 2025

View reviewed changes

pyproject.toml Outdated Show resolved Hide resolved

micedre reviewed Nov 20, 2025

View reviewed changes

torchTextClassifiers/tokenizers/WordPiece.py Outdated Show resolved Hide resolved

micedre reviewed Nov 20, 2025

View reviewed changes

xxx.py Outdated Show resolved Hide resolved

meilame-tayebjee and others added 3 commits November 20, 2025 12:41

chore: remove old file

6017a22

has been put into tests into a suitable format

fix: right command to install HF dependencies in warning

aa70919

Related to c7307f5 Co-authored-by: Cédric Couralet <cedric.couralet@insee.fr>

chore: change HF opt. dep. group name to huggingface

41a15f0

micedre approved these changes Nov 20, 2025

View reviewed changes

micedre merged commit 4e07bff into main Nov 20, 2025
3 checks passed

micedre deleted the hf_tokenizer branch November 20, 2025 13:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Full refactor of the library #16

Full refactor of the library #16

Uh oh!

meilame-tayebjee commented Nov 7, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Full refactor of the library #16

Full refactor of the library #16

Uh oh!

Conversation

meilame-tayebjee commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

meilame-tayebjee commented Nov 7, 2025 •

edited

Loading