Skip to content

cosmol-studio/COSMolKit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

34 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

COSMolKit

coverage workflow badge codecov badge crates.io badge docs.rs badge pypi badge

COSMolKit is a Python molecular toolkit backed by a Rust core. It provides value-style molecule operations, SMILES and SDF workflows, 2D depiction, fingerprints, batch processing, and protein-focused structural biology APIs.

The library is built around explicit behavior: supported operations return structured results, unsupported behavior fails visibly, and public molecule transforms return new values instead of mutating their inputs.

COSMolKit is designed for array-oriented structural data access, keeping molecular data efficient and natural for NumPy, PyTorch, and model-building workflows.

Documentation

Installation

pip install cosmolkit

Core Concepts

  • Value-style molecules: methods such as with_hydrogens(), without_hydrogens(), with_kekulized_bonds(), and with_2d_coords() return new molecule values.
  • Explicit errors: invalid input and unsupported behavior are surfaced as errors instead of silent fallbacks.
  • Batch-native processing: MoleculeBatch keeps input order, supports structured per-record failures, and can run batch transforms and exports with configurable parallelism.
  • Array-friendly data access: coordinates, bounds matrices, fingerprints, and graph features are exposed in forms that fit Python numerical workflows.

Value-Style Transformations

Normal molecule operations return new objects and do not mutate their inputs. This follows the same explicit-dataflow direction as modern dataframe libraries: users can reason about each transformation as a new value while the Rust core can share unchanged internal storage efficiently.

from cosmolkit import Molecule

mol = Molecule.from_smiles("CCO")
mol_h = mol.with_hydrogens()

assert mol is not mol_h

Python Quick Start

from cosmolkit import Molecule, MoleculeBatch

mol = Molecule.from_smiles("c1ccccc1O")
mol_2d = mol.with_2d_coords()

print(mol_2d.to_smiles())
print(mol_2d.coords_2d())

svg = mol_2d.to_svg(width=400, height=300)
mol_2d.write_png("phenol.png", width=400, height=300)

fp = mol.fingerprint_morgan(radius=2, n_bits=2048)
print(fp.on_bits())

batch = (
    MoleculeBatch.from_smiles_list(
        ["CCO", "c1ccccc1", "CC(=O)O"],
        sanitize=True,
        errors="keep",
    )
    .with_parallel_jobs(8)
    .with_progress_bar(False)
)

prepared = batch.add_hydrogens(errors="keep").compute_2d_coords(errors="keep")
print(prepared.valid_mask())
print(prepared.to_smiles_list())

prepared.to_images(
    "molecule_images",
    format="png",
    size=(300, 300),
    errors="keep_errors",
    filenames=["ethanol", "benzene", "acetate"],
)

Protein Structures

Use Protein when the workflow is focused on protein chains rather than the full structural table.

from cosmolkit import Protein

protein = Protein.from_pdb("1crn.pdb")

print(protein.num_chains())
print(protein.num_residues())
print(protein.num_atoms())

for chain in protein.chains():
    print(chain.index(), chain.kind(), len(chain))
    for residue in chain.residues():
        print(residue.name(), residue.kind(), len(residue))

For lower-level structural workflows, COSMolKit also exposes BioStructure types in Rust and Python.

SDF and Dataset Workflows

SdfDataset builds a lightweight index of SDF record byte ranges, so individual records and chunks can be read without loading an entire file into memory.

from cosmolkit import SdfDataset

dataset = SdfDataset.open("library.sdf")
print(len(dataset))

record = dataset[0]
mol = record.molecule()

for batch in dataset.batches(size=1024, errors="keep_errors", n_jobs=8):
    smiles = batch.to_smiles_list()

Feature Areas

  • Molecular graph construction and inspection
  • SMILES parsing and writing
  • MOL/SDF reading and writing
  • Hydrogen transforms and Kekulization
  • Sanitization and chemistry problem detection
  • 2D coordinate generation and SVG/PNG depiction
  • Morgan and Avalon fingerprints
  • Distance-geometry bounds matrices
  • Substructure matching and SMARTS parsing
  • Ordered batch transforms and exports
  • PDB/mmCIF parsing and protein projection APIs
  • Support-status metadata for public features

Design Principles

COSMolKit aims to be Python-friendly, batch-friendly, and suitable for model-building workflows.

  • Correctness comes before breadth.
  • Public transforms use value semantics.
  • Mutation-capable workflows are explicit.
  • Unsupported chemistry should fail clearly.
  • RDKit-compatible behavior is the correctness floor for supported cheminformatics features.
  • High-throughput APIs should preserve input order and expose per-record failures.

Examples

Python examples live in python/examples/.

Roadmap

Status labels:

  • βœ… available in the public Python API
  • πŸ§ͺ implemented or partially available, still being hardened
  • 🚧 planned / not yet public

Chemistry Core

Goal: keep the supported molecular core correct before expanding breadth.

  • βœ… Molecule, atom, and bond graph model
  • βœ… SMILES parsing
  • βœ… SMILES writing with RDKit-style writer options for supported branches
  • βœ… Ring perception, valence handling, aromaticity, and Kekulization
  • βœ… Hydrogen addition and removal
  • βœ… Sanitization for supported chemistry workflows
  • βœ… Stereochemistry inspection for supported atom and bond states
  • βœ… Distance-geometry bounds matrices
  • βœ… Morgan fingerprints and Tanimoto similarity
  • πŸ§ͺ Avalon fingerprints
  • πŸ§ͺ Substructure matching and SMARTS parsing
  • 🚧 Broader descriptor APIs such as formula, molecular weight, and ring statistics

File I/O and Depiction

Goal: make common molecule import, export, and visualization workflows usable from Python.

  • βœ… MOL/SDF reading
  • βœ… SDF dataset indexing for large files
  • βœ… SDF writing for supported V2000/V3000 branches
  • βœ… 2D coordinate generation
  • βœ… SVG drawing
  • βœ… PNG export
  • πŸ§ͺ RDKit-style visual parity testing for supported depiction output
  • 🚧 Annotation overlays and richer drawing customization
  • 🚧 3D conformer generation and embedding APIs

Batch-Native Workflows

Goal: make high-throughput molecule preparation and export a core product identity.

  • βœ… Ordered MoleculeBatch.from_smiles_list()
  • βœ… Batch transforms for sanitization, hydrogens, Kekulization, and 2D coordinates
  • βœ… Configurable parallelism with with_parallel_jobs()
  • βœ… Configurable progress display with with_progress_bar()
  • βœ… Per-record errors, valid masks, and error reports
  • βœ… Batch SMILES, image, and SDF export paths
  • πŸ§ͺ Golden parity tests for parallel batch behavior
  • 🚧 More streaming and chunked dataset workflows

Protein and Structural Biology

Goal: provide practical Biopython-like structure workflows without forcing users through low-level structural tables.

  • βœ… Protein.from_pdb() / Protein.from_mmcif() high-level entry points
  • βœ… Protein chain, residue, and atom iteration
  • βœ… Protein-only projection from broader structural data
  • πŸ§ͺ PDB/mmCIF structural parsing
  • πŸ§ͺ Lower-level BioStructure access for advanced workflows
  • 🚧 Selection utilities for chains, residues, atoms, and neighborhoods
  • 🚧 Ligand, nucleic-acid, and mixed-structure ergonomic APIs

Python API and ML Readiness

Goal: expose verified Rust-backed behavior through a practical Python interface.

  • βœ… Value-style molecule transformations
  • βœ… Graph, coordinate, fingerprint, and bounds-matrix accessors
  • βœ… Python examples for drawing, SDF-to-SMILES, batch processing, and proteins
  • πŸ§ͺ Type stubs and documentation coverage
  • 🚧 Stable model-ready graph exports
  • 🚧 NumPy / PyTorch oriented adapters
  • 🚧 Molecular tokenization and AI-native geometry helpers

Browser and Deployment

Goal: support lightweight chemistry workflows outside native Python processes.

  • 🚧 WASM compilation target
  • 🚧 JavaScript bindings
  • 🚧 Browser-native SMILES/SDF parsing and depiction

Respect for RDKit

COSMolKit is developed with deep respect for RDKit and the broader open-source cheminformatics community. The goal is an independent Rust-native implementation that preserves interoperability and behavioral compatibility where appropriate, while offering a more deterministic Python API and AI-native extension surface.

About

COSMolKit is a Rust-native cheminformatics and structural biology toolkit for molecules, SMILES/SDF/MolBlock parsing, molecular graphs, conformers, coordinates, and AI-ready batch workflows.

Topics

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors