Skip to content

Conversation

@meilame-tayebjee
Copy link
Member

@meilame-tayebjee meilame-tayebjee commented Nov 7, 2025

  • Conceptually a text classification model is composed of:

    • a Tokenizer that outputs a (batch_size, output_dim) tensor:

      • enabling support for hugging face tokenizer, that can be trained, saved, loaded from the HF Hub... HF Tokenizers also provide (token -> position in the sentence) methods so it is very easy to map token attributions to characters and words
      • TF-IDF is also a "tokenizer", conceptually: it is just that the second dim is vocabulary_size
      • all tokenizers should have same structure of output
    • A PyTorch modelTextClassificationModelthat does not embark the tokenizer, and has nothing related to it

      • Its input should be a vectorized/tokenized text in the form of a 2D-tensor (not raw text), optionally an additional tensor (categorical variables)
      • It ouputs raw "scores", not softmaxed.
      • No predict function here: it is not its job to take raw text and output predictions but the wrapper's.
      • It is itself composed of 3 components:
        • The TextEmbedder:
          - It is composed of an nn.Embedding layer and potentially attention logic
          - The user can not pass a custom PyTorch Module here (because we impose some structure e.g. having an Embedding layer
          - It is optional, because the input text tensor can already be "vectorized"! (ex: TF-IDF)
        • the CategoricalVariableNet : Embedding layers and same forward logic as before
        • the ClassificationHead : a neural net that takes a (batch_size, embeddding_dim) tensor and outputs a (batch_size, num_classes) tensor
          - here we provide some freedom: the user can pass any neural net provided the input and output dimensions are coherent with the other components
      • The PyTorch model checks if all the components are well imbricated. Once you have been able to built it, your forward pass should work seemlessly
      • All of the three components are can live alone independently, and all of them are nn.Module
        • For instance, one can imagine, after training, use only TextEmbedder component to make visualizations of the embedding space or even train a sklearn classifier on top ! Or even fine-tune with another ClassificationHead...
    • LightningModule and dataset have minimal changes

    • The wrapper class:

      • its role is to create, orchestrate and validate the coherence between all those components given the hyperparameters and a tokenizer
        • for instance, it will make sure that there the vocab_size of the tokenizer is the vocab_size of the text embedding layer
      • it is the object that has a broad overview: otherwise all the components are completely independent by construction
      • it is the object in direct relation with the user so it deals from raw text to prediction: this one has a predict method ! And explainability too is here.
      • Technically, it can directly inherit from mlflow.pyfunc
      • train method as before etc.
  • So all in all, you can let the wrapper manipulate the objects for you. Or you can build the wrapper and/or the PyTorch Model (TextClassificationModel) from your own custom objects (instantiate your own TextEmbedder, CategoricalVariableNet and ClassificationHead...). So a great flexibility. And you can also use ANY tokenizer from HuggingFace.

reorder files
add first tests
ClassificationHead and CatVarNet objetcs
maximum flexibility
TextFeaturizer TODO
components are text_embedder, categorical_var_net and classification_head
text_embedder is optional, but tokenizer is not (TF-IDF is conceptually a tokenizer)
all are customizable
need to add doc at some point
…l, training, explainability

moved prediction logic here
constant output dim if specified, otherwise pad to longest sequence in batch
@meilame-tayebjee meilame-tayebjee marked this pull request as ready for review November 19, 2025 11:08
meilame-tayebjee and others added 3 commits November 20, 2025 12:41
has been put into tests into a suitable format
Related to c7307f5

Co-authored-by: Cédric Couralet <cedric.couralet@insee.fr>
@micedre micedre merged commit 4e07bff into main Nov 20, 2025
3 checks passed
@micedre micedre deleted the hf_tokenizer branch November 20, 2025 13:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants