Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
69 commits
Select commit Hold shift + click to select a range
64f38a5
feat: add HF dependencies (as a group)
meilame-tayebjee Oct 22, 2025
3703f48
feat: add WordPiece tokenize
meilame-tayebjee Oct 22, 2025
1266287
chore: rename file to ngram
meilame-tayebjee Oct 27, 2025
d2563ea
feat: improve base tokenizer, add HF abstract
meilame-tayebjee Oct 27, 2025
ae045ab
feat: change inheritance to HFTokenizer
meilame-tayebjee Oct 27, 2025
c6eac58
feat(dataset): init
meilame-tayebjee Oct 27, 2025
c25eb36
fix: add update of vocab size in post training
meilame-tayebjee Oct 27, 2025
d897bef
fix: categorical tensors set to None instead of empty tensors when no…
meilame-tayebjee Oct 27, 2025
51be1d1
feat: add ruff and datasets dep
meilame-tayebjee Oct 31, 2025
b53a10d
feat: first working example for model/module
meilame-tayebjee Oct 31, 2025
6f3417c
chore: fix signature
meilame-tayebjee Oct 31, 2025
c600f18
chore: default value for batch_idx in predict
meilame-tayebjee Nov 3, 2025
85cb8b8
feat!: violently modularize and simplify forward+checking
meilame-tayebjee Nov 3, 2025
dc863ff
chore: remove tokenizer (now it is ngram tokenizer)
meilame-tayebjee Nov 3, 2025
064b73f
feat!(components): first working example with full modularity
meilame-tayebjee Nov 4, 2025
164cccf
fix: avoid bugs with numpy arrays in boolean contexts
meilame-tayebjee Nov 5, 2025
c5b9673
feat: add smooth imports for HF and output_dim field
meilame-tayebjee Nov 5, 2025
a0fe18c
feat!(wrapper class): finalize orchestration tokenizer, dataset, mode…
meilame-tayebjee Nov 5, 2025
ddd7cec
fix: return only optimizer when scheduler is none
meilame-tayebjee Nov 5, 2025
32e6805
feat(test): clean tests (wip)
meilame-tayebjee Nov 5, 2025
8fdaf0c
chore: clean
meilame-tayebjee Nov 5, 2025
a7f71d3
feat: enable to choose context size in tokenizer
meilame-tayebjee Nov 5, 2025
0a9eda5
chore: pin_memory to default False (avoid warning on CPU run)
meilame-tayebjee Nov 7, 2025
6d951fe
feat: ad __repr__ for all components
meilame-tayebjee Nov 7, 2025
c31ad43
chore: format
meilame-tayebjee Nov 7, 2025
956b7a3
feat!(HF): enable load from pretrained
meilame-tayebjee Nov 7, 2025
a497697
chore: update description
meilame-tayebjee Nov 7, 2025
2fda9c2
feat: __call__ for tokenizers is tokenize
meilame-tayebjee Nov 7, 2025
13b9de4
feat(tokenizers): clean __call__ and __rep__, add offset return for e…
meilame-tayebjee Nov 7, 2025
f55452b
feat!(explainability): finalize explainability feature at word and ch…
meilame-tayebjee Nov 7, 2025
0262109
chore: remove useless file
meilame-tayebjee Nov 7, 2025
6bdb750
fix: typo in trainer_params max_epochs
meilame-tayebjee Nov 10, 2025
830a45c
feat!(tokenizer): ensure output is consistent across al tokenizers
meilame-tayebjee Nov 10, 2025
c7307f5
fix: move hf-dep to optional dependencies
meilame-tayebjee Nov 10, 2025
a5b3e4d
Merge branch 'main' into hf_tokenizer
meilame-tayebjee Nov 10, 2025
934b041
feat!(attention): enable attention logic
meilame-tayebjee Nov 10, 2025
5e150b2
fix: check if categorical var are present before checking their arrays
meilame-tayebjee Nov 10, 2025
162e296
fix: no persistent_workers if num_workers=0
meilame-tayebjee Nov 10, 2025
1591bd9
fix: closing parenthesis
meilame-tayebjee Nov 12, 2025
1af9e53
fix: truncation=True is needed
meilame-tayebjee Nov 12, 2025
4ca1807
add ipywidgets
meilame-tayebjee Nov 12, 2025
927a5e7
fix: check_Y problem of indexes
meilame-tayebjee Nov 12, 2025
7fdb4e3
fix: truncation=True is needed
meilame-tayebjee Nov 12, 2025
4e36940
rmeove unncessary print
meilame-tayebjee Nov 12, 2025
d44d051
progress on doc
meilame-tayebjee Nov 12, 2025
a179c37
fix: load model on cpu to avoid pb after training
meilame-tayebjee Nov 12, 2025
ea26799
progress on docs
meilame-tayebjee Nov 12, 2025
269c76a
fix!(explainability): remove nan words and fix plotting
meilame-tayebjee Nov 12, 2025
89cc8fe
examples : fix basic_classification after refactor
micedre Nov 12, 2025
1b62eee
Fix check for categorical variable
micedre Nov 12, 2025
704fe14
Adapt examples to new package architecture
micedre Nov 13, 2025
be28866
Merge branch 'hf_tokenizer' of https://github.com/InseeFrLab/torchTex…
meilame-tayebjee Nov 13, 2025
be4acf2
chore: first draft of example notebook. WIP
meilame-tayebjee Nov 13, 2025
0f9b4b4
refactor: replace cpu_run with accelerator in TrainingConfig
meilame-tayebjee Nov 17, 2025
5102f82
feat!(tokenizer-ngram): add very fast ngram tokenizer
meilame-tayebjee Nov 18, 2025
ab58e26
doc: clean example notebook
meilame-tayebjee Nov 19, 2025
45ace28
fix: better handling of truncation to avoid warning
meilame-tayebjee Nov 19, 2025
b2e797b
doc: fix readme
meilame-tayebjee Nov 19, 2025
84b118b
fix: allow tokenizer not to have train attribute
meilame-tayebjee Nov 20, 2025
3c0a85a
feat(ngram): add return offsets and word_ids + fix output_dim
meilame-tayebjee Nov 20, 2025
ab70485
fix: update vocab_size after training
meilame-tayebjee Nov 20, 2025
27a11bb
fix: add a flag for return_word_ids
meilame-tayebjee Nov 20, 2025
823467b
fix: add a flag for return_word_ids
meilame-tayebjee Nov 20, 2025
93a6e80
Merge branch 'hf_tokenizer' of https://github.com/InseeFrLab/torchTex…
meilame-tayebjee Nov 20, 2025
4e2ffa5
fix: replace _build_vocab by train
meilame-tayebjee Nov 20, 2025
519a32d
feat(test): add test of all pipeline with different tokenizers
meilame-tayebjee Nov 20, 2025
6017a22
chore: remove old file
meilame-tayebjee Nov 20, 2025
aa70919
fix: right command to install HF dependencies in warning
meilame-tayebjee Nov 20, 2025
41a15f0
chore: change HF opt. dep. group name to huggingface
meilame-tayebjee Nov 20, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -174,4 +174,6 @@ fastTextAttention.py
poetry.lock

# vscode
.vscode/
.vscode/

benchmark_results/
142 changes: 12 additions & 130 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,18 @@
# torchTextClassifiers

A unified, extensible framework for text classification built on [PyTorch](https://pytorch.org/) and [PyTorch Lightning](https://lightning.ai/docs/pytorch/stable/).
A unified, extensible framework for text classification with categorical variables built on [PyTorch](https://pytorch.org/) and [PyTorch Lightning](https://lightning.ai/docs/pytorch/stable/).

## 🚀 Features

- **Unified API**: Consistent interface for different classifier wrappers
- **Extensible**: Easy to add new classifier implementations through wrapper pattern
- **FastText Support**: Built-in FastText classifier with n-gram tokenization
- **Flexible Preprocessing**: Each classifier can implement its own text preprocessing approach
- **Mixed input support**: Handle text data alongside categorical variables seamlessly.
- **Unified yet highly customizable**:
- Use any tokenizer from HuggingFace or the original fastText's ngram tokenizer.
- Manipulate the components (`TextEmbedder`, `CategoricalVariableNet`, `ClassificationHead`) to easily create custom architectures - including **self-attention**. All of them are `torch.nn.Module` !
- The `TextClassificationModel` class combines these components and can be extended for custom behavior.
- **PyTorch Lightning**: Automated training with callbacks, early stopping, and logging
- **Easy experimentation**: Simple API for training, evaluating, and predicting with minimal code:
- The `torchTextClassifiers` wrapper class orchestrates the tokenizer and the model for you
- **Additional features**: explainability using Captum


## 📦 Installation
Expand All @@ -25,140 +29,18 @@ uv sync
pip install -e .
```

## 🎯 Quick Start

### Basic FastText Classification

```python
import numpy as np
from torchTextClassifiers import create_fasttext

# Create a FastText classifier
classifier = create_fasttext(
embedding_dim=100,
sparse=False,
num_tokens=10000,
min_count=2,
min_n=3,
max_n=6,
len_word_ngrams=2,
num_classes=2
)

# Prepare your data
X_train = np.array([
"This is a positive example",
"This is a negative example",
"Another positive case",
"Another negative case"
])
y_train = np.array([1, 0, 1, 0])

X_val = np.array([
"Validation positive",
"Validation negative"
])
y_val = np.array([1, 0])

# Build the model
classifier.build(X_train, y_train)

# Train the model
classifier.train(
X_train, y_train, X_val, y_val,
num_epochs=50,
batch_size=32,
patience_train=5,
verbose=True
)

# Make predictions
X_test = np.array(["This is a test sentence"])
predictions = classifier.predict(X_test)
print(f"Predictions: {predictions}")

# Validate on test set
accuracy = classifier.validate(X_test, np.array([1]))
print(f"Accuracy: {accuracy:.3f}")
```

### Custom Classifier Implementation

```python
import numpy as np
from torchTextClassifiers import torchTextClassifiers
from torchTextClassifiers.classifiers.simple_text_classifier import SimpleTextWrapper, SimpleTextConfig

# Example: TF-IDF based classifier (alternative to tokenization)
config = SimpleTextConfig(
hidden_dim=128,
num_classes=2,
max_features=5000,
learning_rate=1e-3,
dropout_rate=0.2
)

# Create classifier with TF-IDF preprocessing
wrapper = SimpleTextWrapper(config)
classifier = torchTextClassifiers(wrapper)

# Text data
X_train = np.array(["Great product!", "Terrible service", "Love it!"])
y_train = np.array([1, 0, 1])

# Build and train
classifier.build(X_train, y_train)
# ... continue with training
```


### Training Customization

```python
# Custom PyTorch Lightning trainer parameters
trainer_params = {
'accelerator': 'gpu',
'devices': 1,
'precision': 16, # Mixed precision training
'gradient_clip_val': 1.0,
}

classifier.train(
X_train, y_train, X_val, y_val,
num_epochs=100,
batch_size=64,
patience_train=10,
trainer_params=trainer_params,
verbose=True
)
```

## 🔬 Testing

Run the test suite:

```bash
# Run all tests
uv run pytest

# Run with coverage
uv run pytest --cov=torchTextClassifiers

# Run specific test file
uv run pytest tests/test_torchTextClassifiers.py -v
```
## 📝 Usage

Checkout the [notebook](notebooks/example.ipynb) for a quick start.

## 📚 Examples

See the [examples/](examples/) directory for:
- Basic text classification
- Multi-class classification
- Mixed features (text + categorical)
- Custom classifier implementation
- Advanced training configurations


- Prediction and explainability

## 📄 License

Expand Down
Loading