Skip to content

Applied-Machine-Learning-Lab/WWW2026_SAGE-LLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SAGE: Global Semantic Alignment with LLMs for Long-Tail Sequential Recommendation

This repository contains the official PyTorch implementation of SAGE (Semantic Alignment with Global Embedding). SAGE is a framework designed to enhance Sequential Recommendation Systems (SRS) by alleviating the long-tail problem through global semantic alignment using Large Language Models (LLMs).

🌟 Overview

SAGE integrates LLM-derived semantic information with collaborative signals. Key components include:

  • Fuzzy-Membership Prototypes: Uses HDBSCAN and UMAP to cluster items and assigns fuzzy memberships, allowing tail items to inherit features from semantically related head items.
  • Alignment-Based User Distillation: Retrieves semantically similar users to enrich sparse user representations.
  • Dual-View Modeling: Aligns Semantic (LLM) and Collaborative (ID-based) spaces.

🚀 Data Preparation

Before training, you must generate LLM embeddings and perform the fuzzy clustering process.

1. Data Processing

Process the raw datasets (Beauty, Fashion, Yelp) into interaction sequences.

python data/data_process.py

2. Generate LLM Embeddings

Use the notebooks in data/{dataset}/ to generate semantic embeddings for items and users (e.g., using OpenAI API or open-source LLMs).

  • get_item_embedding.ipynb
  • get_user_embedding.ipynb
  • pca.ipynb (Reduce embedding dimension to 64 for efficiency).

3. Clustering & Fuzzy Membership (Crucial)

Run the notebooks in the Clustering/ folder (e.g., Clustering/fashion.ipynb). This step utilizes UMAP for reduction and HDBSCAN for density-based clustering to generate the following required artifacts in data/{dataset}/handled/:

  • hdbscan_best_labels.pkl: Cluster assignments.
  • hdbscan_cluster_centers.pkl: Weighted semantic centers.
  • hdbscan_core_probs.pkl: Core point probabilities.
  • hdbscan_fuzzy_U.pkl: The Global Fuzzy Membership Matrix.

Note: The clustering logic ensures noise points (tail items) are assigned soft memberships based on distance to valid semantic clusters.

🏃 Training & Evaluation

Scripts to reproduce the experiments are located in experiments_sage/. We support three backbones: SASRec, Bert4Rec, and GRU4Rec.

Running a specific dataset

cd experiments_sage
bash beauty.bash
bash yelp.bash
bash fashion.bash

Key Hyperparameters

Argument Description Default
--model_name Backbone selection (llmesr_sasrec, llmesr_bert4rec, llmesr_gru4rec) llmesr_sasrec
--alpha Weight for User Alignment (Distillation) Loss 0.1
--gamma Weight for Item Prototype (Fuzzy) Loss 0.05
--beta Weight for Global Semantic-Collaborative Alignment 0.1
--fuzzy_m Fuzzy exponent (controls fuzziness of membership) 2
--ts_user Threshold to define tail users (interactions < threshold) 9
--ts_item Threshold to define tail items 4
--user_sim_func Similarity function for user retrieval (kd for distillation) kd
--freeze Freeze the pre-computed LLM Semantic Embeddings True

📊 Results

The model saves logs to the logs/ directory. The primary metrics evaluated are Hit Rate@10 (HR@10) and NDCG@10.

Reference

If you use this extension, please cite the original LLM-ESR work along with our paper (to be added).

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors