SemLink - Automatic Semantic Note Linking

Transform unstructured markdown/text notes into an interconnected knowledge graph using semantic similarity analysis.

Features

Automatic Link Discovery - Infer semantic relationships between notes without manual linking
Multiple Embedding Methods - TF-IDF baseline, Sentence-BERT, or OpenAI embeddings
Graph Visualization - Interactive D3.js web interface or Pyvis HTML export
Community Detection - Discover topic clusters using Louvain algorithm
Topic Aggregation - Group related notes and generate topic summaries
Incremental Updates - SQLite storage for efficient vault synchronization
Link Reasoning - Understand why notes are connected with shared terms

Similar solutions around "Document linking" for human interpretablity fall short by focusing on purely key-word matching approaches like this one

Some ideas for how to visualize the semantic linking is given here even though some of those are a bit over-the-top and would only constitute UI complexity which is to be avoided.

Objectives

Infer semantic relationships between textual notes
Compare keyword-based and embedding-based similarity methods
Construct a sparse, interpretable note graph
Analyze graph structure and semantic coherence
Provide a CLI tool that allows for easy usage

Installation

# Core installation (TF-IDF only, lightweight)
pip install semlink

# With neural embeddings (includes PyTorch)
pip install semlink[sbert]

# With OpenAI embeddings
pip install semlink[openai]

# With web server for React frontend
pip install semlink[server]

# With Pyvis visualization
pip install semlink[viz]

# Everything
pip install semlink[all]

Quick Start

# Run full pipeline on a vault
semlink run ./my-notes --output ./output

# Or use incremental sync with SQLite
semlink sync ./my-notes --db .semlink.db

# Start the web interface
semlink serve --db .semlink.db

# Aggregate notes into topics
semlink aggregate output/graph.json --notes output/notes.json -o topics/

CLI Commands

Command	Description
`ingest`	Load and preprocess notes from a vault directory
`embed`	Generate embeddings (TF-IDF, SBERT, or OpenAI)
`link`	Infer links between notes based on similarity
`analyze`	Compute graph metrics and detect communities
`visualize`	Generate HTML, PNG, D3 JSON, or Obsidian export
`compare`	Compare different embedding methods
`run`	Full pipeline: ingest → embed → link → analyze → visualize
`sync`	Incremental vault sync with SQLite storage
`export`	Export graph from SQLite database
`status`	Show database statistics
`aggregate`	Group notes into topics by community
`serve`	Start web server for React frontend
`info`	Display available models and strategies

Usage Examples

Basic Pipeline

# Process a vault with TF-IDF (default, lightweight)
semlink run ./vault --output ./output

# Use Sentence-BERT for better semantic matching
semlink run ./vault --method sbert --output ./output

# Filter weak links (< 25% similarity)
semlink run ./vault --min-weight 0.25 --output ./output

Incremental Workflow

# Initial sync (creates .semlink.db)
semlink sync ./vault

# Re-run after editing notes (only processes changes)
semlink sync ./vault

# Export graph for visualization
semlink export --db .semlink.db -o graph.json

# Check database status
semlink status

Web Interface

# Build frontend (first time only)
cd frontend && npm install && npm run build && cd ..

# Start server
semlink serve --db .semlink.db

# Opens at http://localhost:8000

Topic Aggregation

# Generate topic clusters from graph
semlink aggregate graph.json --notes notes.json --format markdown -o topics/

# Export as Obsidian vault structure
semlink aggregate graph.json --notes notes.json --format obsidian -o vault/

# More granular topics (higher resolution)
semlink aggregate graph.json --notes notes.json --resolution 1.5 -k 7

Embedding Methods

Method	Description	Requirements
`tfidf`	TF-IDF keyword matching (default)	scikit-learn
`sbert`	Sentence-BERT semantic similarity	sentence-transformers
`openai`	OpenAI text embeddings API	openai, tiktoken

Link Strategies

Strategy	Description
`threshold`	Connect notes with similarity ≥ threshold
`knn`	Connect to k nearest neighbors
`mutual_knn`	Connect only if mutually nearest
`hybrid`	KNN + threshold (recommended)

Project Structure

semlink/
├── frontend/              # React + D3.js web interface
│   ├── src/
│   │   ├── components/    # ForceGraph, SidePanel, ControlPanel
│   │   ├── hooks/         # useGraphData
│   │   └── types/         # TypeScript definitions
│   └── package.json
├── src/semlink/
│   ├── core/
│   │   ├── ingest.py      # Note discovery and preprocessing
│   │   ├── chunk.py       # Chunking strategies
│   │   ├── tfidf.py       # TF-IDF embeddings
│   │   ├── embeddings.py  # SBERT, OpenAI embeddings
│   │   ├── linker.py      # Link inference strategies
│   │   ├── graph.py       # NetworkX graph building
│   │   ├── analysis.py    # Metrics, community detection
│   │   ├── visualize.py   # Pyvis, D3, Obsidian export
│   │   ├── evaluate.py    # Method comparison
│   │   ├── storage.py     # SQLite persistence
│   │   └── aggregate.py   # Topic aggregation
│   ├── server.py          # FastAPI backend
│   └── cli.py             # Typer CLI
└── pyproject.toml

Development

# Clone and install
git clone https://github.com/KreativeThinker/SemLink.git
cd SemLink
uv sync

# Install pre-commit hooks
pre-commit install

# Run CLI
uv run semlink --help

Abstract

The digital note-taking software market is experiencing significant growth, with projections reaching $1.35-1.5 billion by 2027-2028. Knowledge workers are increasingly adopting tools like Obsidian, Roam Research, and Logseq for personal knowledge management. However, manual linking does not scale beyond a few hundred notes, leaving valuable relationships unexpressed.

This project addresses the problem of automatically identifying and representing semantic relationships between notes without user intervention, using NLP techniques to infer contextual relationships and represent them as a knowledge graph.

References

Arnold, M., Goldschmitt, M., & Rigotti, T. (2023). Dealing with information overload: A comprehensive review. Frontiers in Psychology, 14.
Karr-Wisniewski, P., & Lu, Y. (2010). When more is too much: Operationalizing technology overload. Computers in Human Behavior, 26(5), 1061-1072.
Li, H., et al. (2023). Knowledge graphs in practice: characterizing users, challenges, and visualization opportunities. IEEE TVCG.

Contributing

Please refer to the guidelines in CONTRIBUTING.md.

License

MIT License - see LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
docs		docs
frontend		frontend
src/semlink		src/semlink
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SemLink - Automatic Semantic Note Linking

Features

Objectives

Installation

Quick Start

CLI Commands

Usage Examples

Basic Pipeline

Incremental Workflow

Web Interface

Topic Aggregation

Embedding Methods

Link Strategies

Project Structure

Development

Abstract

References

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SemLink - Automatic Semantic Note Linking

Features

Objectives

Installation

Quick Start

CLI Commands

Usage Examples

Basic Pipeline

Incremental Workflow

Web Interface

Topic Aggregation

Embedding Methods

Link Strategies

Project Structure

Development

Abstract

References

Contributing

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages