-
Notifications
You must be signed in to change notification settings - Fork 6
Full refactor of the library #16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
reorder files add first tests
ClassificationHead and CatVarNet objetcs maximum flexibility TextFeaturizer TODO
components are text_embedder, categorical_var_net and classification_head text_embedder is optional, but tokenizer is not (TF-IDF is conceptually a tokenizer) all are customizable need to add doc at some point
in prevision of adding TF_IDF
…l, training, explainability moved prediction logic here
constant output dim if specified, otherwise pad to longest sequence in batch
…ar level plot functions to be fixed!
micedre
reviewed
Nov 13, 2025
add also tests and comparison with WordPiece
aligning with NGramTokenizer
aligning with NGramTokenizer
for clarity and consistency
micedre
reviewed
Nov 20, 2025
micedre
reviewed
Nov 20, 2025
micedre
reviewed
Nov 20, 2025
has been put into tests into a suitable format
Related to c7307f5 Co-authored-by: Cédric Couralet <cedric.couralet@insee.fr>
micedre
approved these changes
Nov 20, 2025
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Conceptually a text classification model is composed of:
a Tokenizer that outputs a
(batch_size, output_dim)tensor:A PyTorch model
TextClassificationModelthat does not embark the tokenizer, and has nothing related to itTextEmbedder:- It is composed of an
nn.Embeddinglayer and potentially attention logic- The user can not pass a custom PyTorch Module here (because we impose some structure e.g. having an Embedding layer
- It is optional, because the input text tensor can already be "vectorized"! (ex: TF-IDF)
CategoricalVariableNet: Embedding layers and same forward logic as beforeClassificationHead: a neural net that takes a(batch_size, embeddding_dim)tensor and outputs a(batch_size, num_classes)tensor- here we provide some freedom: the user can pass any neural net provided the input and output dimensions are coherent with the other components
nn.ModuleTextEmbeddercomponent to make visualizations of the embedding space or even train a sklearn classifier on top ! Or even fine-tune with anotherClassificationHead...LightningModule and dataset have minimal changes
The wrapper class:
vocab_sizeof the tokenizer is thevocab_sizeof the text embedding layermlflow.pyfuncSo all in all, you can let the wrapper manipulate the objects for you. Or you can build the wrapper and/or the PyTorch Model (
TextClassificationModel) from your own custom objects (instantiate your own TextEmbedder, CategoricalVariableNet and ClassificationHead...). So a great flexibility. And you can also use ANY tokenizer from HuggingFace.