This repository contains the official PyTorch implementation of SAGE (Semantic Alignment with Global Embedding). SAGE is a framework designed to enhance Sequential Recommendation Systems (SRS) by alleviating the long-tail problem through global semantic alignment using Large Language Models (LLMs).
SAGE integrates LLM-derived semantic information with collaborative signals. Key components include:
- Fuzzy-Membership Prototypes: Uses HDBSCAN and UMAP to cluster items and assigns fuzzy memberships, allowing tail items to inherit features from semantically related head items.
- Alignment-Based User Distillation: Retrieves semantically similar users to enrich sparse user representations.
- Dual-View Modeling: Aligns Semantic (LLM) and Collaborative (ID-based) spaces.
Before training, you must generate LLM embeddings and perform the fuzzy clustering process.
Process the raw datasets (Beauty, Fashion, Yelp) into interaction sequences.
python data/data_process.pyUse the notebooks in data/{dataset}/ to generate semantic embeddings for items and users (e.g., using OpenAI API or open-source LLMs).
get_item_embedding.ipynbget_user_embedding.ipynbpca.ipynb(Reduce embedding dimension to 64 for efficiency).
Run the notebooks in the Clustering/ folder (e.g., Clustering/fashion.ipynb).
This step utilizes UMAP for reduction and HDBSCAN for density-based clustering to generate the following required artifacts in data/{dataset}/handled/:
hdbscan_best_labels.pkl: Cluster assignments.hdbscan_cluster_centers.pkl: Weighted semantic centers.hdbscan_core_probs.pkl: Core point probabilities.hdbscan_fuzzy_U.pkl: The Global Fuzzy Membership Matrix.
Note: The clustering logic ensures noise points (tail items) are assigned soft memberships based on distance to valid semantic clusters.
Scripts to reproduce the experiments are located in experiments_sage/. We support three backbones: SASRec, Bert4Rec, and GRU4Rec.
cd experiments_sage
bash beauty.bash
bash yelp.bash
bash fashion.bash| Argument | Description | Default |
|---|---|---|
--model_name |
Backbone selection (llmesr_sasrec, llmesr_bert4rec, llmesr_gru4rec) |
llmesr_sasrec |
--alpha |
Weight for User Alignment (Distillation) Loss | 0.1 |
--gamma |
Weight for Item Prototype (Fuzzy) Loss | 0.05 |
--beta |
Weight for Global Semantic-Collaborative Alignment | 0.1 |
--fuzzy_m |
Fuzzy exponent (controls fuzziness of membership) | 2 |
--ts_user |
Threshold to define tail users (interactions < threshold) | 9 |
--ts_item |
Threshold to define tail items | 4 |
--user_sim_func |
Similarity function for user retrieval (kd for distillation) |
kd |
--freeze |
Freeze the pre-computed LLM Semantic Embeddings | True |
The model saves logs to the logs/ directory. The primary metrics evaluated are Hit Rate@10 (HR@10) and NDCG@10.
If you use this extension, please cite the original LLM-ESR work along with our paper (to be added).