Skip to content
78 changes: 45 additions & 33 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Contributing to malariagen-data-python

Thanks for your interest in contributing to this project! This guide will help you get started.
Thanks for your interest in contributing! Whether you're fixing a bug, adding a feature, or improving the docs, we're glad to have you here. This guide will help you get your environment set up and walk you through the contribution process.

## About the project

Expand All @@ -10,18 +10,18 @@ This package provides Python tools for accessing and analyzing genomic data from

### Prerequisites

You'll need:
You'll need two tools before you start:

- [pipx](https://pipx.pypa.io/) for installing Python tools
- [git](https://git-scm.com/) for version control
- [pipx](https://pipx.pypa.io/) — installs Python CLI tools in isolated environments
- [git](https://git-scm.com/) for version control

Both of these can be installed using your distribution's package manager or [Homebrew](https://brew.sh/) on Mac.
Both can be installed via your distribution's package manager or [Homebrew](https://brew.sh/) on Mac.

### Initial setup

1. **Fork and clone the repository**

Fork the repository on GitHub, then clone your fork:
Fork the repository on GitHub so you have your own copy, then clone it locally:

```bash
git clone git@github.com:[your-username]/malariagen-data-python.git
Expand All @@ -30,19 +30,23 @@ Both of these can be installed using your distribution's package manager or [Hom

2. **Add the upstream remote**

This lets you pull in future changes from the main project:

```bash
git remote add upstream https://github.com/malariagen/malariagen-data-python.git
```

3. **Install Poetry**

[Poetry](https://python-poetry.org/) manages the project's dependencies and virtual environment:

```bash
pipx install poetry
```

4. **Install Python 3.12**

Python 3.12 is tested in the CI-system and is the recommended version to use.
Python 3.12 is the recommended version — it's what CI uses and what the team develops against:

```bash
poetry python install 3.12
Expand Down Expand Up @@ -87,19 +91,21 @@ Both of these can be installed using your distribution's package manager or [Hom

6. **Install pre-commit hooks**

Pre-commit hooks run the linter and formatter automatically before every commit, so code quality issues are caught early:

```bash
pipx install pre-commit
pre-commit install
```

Pre-commit hooks will automatically run `ruff` (linter and formatter) on your changes before each commit.

## Development workflow

### Creating a new feature or fix

1. **Sync with upstream**

Before starting, make sure your local `master` is up to date:

```bash
git checkout master
git pull upstream master
Expand All @@ -121,48 +127,48 @@ Both of these can be installed using your distribution's package manager or [Hom

4. **Run tests locally**

Fast unit tests using simulated data (no external data access):
Fast unit tests using simulated data (no external data access needed):

```bash
poetry run pytest -v tests --ignore tests/integration
```

To run integration tests which read data from GCS, you'll need to [request access to MalariaGEN data on GCS](https://malariagen.github.io/vector-data/vobs/vobs-data-access.html).
To run integration tests that read data from GCS, you'll first need to [request access to MalariaGEN data on GCS](https://malariagen.github.io/vector-data/vobs/vobs-data-access.html).

Once access has been granted, [install the Google Cloud CLI](https://cloud.google.com/sdk/docs/install). E.g., if on Linux:

```bash
./install_gcloud.sh
```

You'll then need to obtain application-default credentials, e.g.:
Then obtain application-default credentials:

```bash
./google-cloud-sdk/bin/gcloud auth application-default login
```

Once this is done, you can run integration tests:
Once authenticated, run integration tests:

```bash
poetry run pytest -v tests/integration
```

Tests will run slowly the first time, as data required for testing will be read from GCS. Subsequent runs will be faster as data will be cached locally in the "gcs_cache" folder.
Tests will run slowly the first time, as data will be read from GCS and cached locally in the `gcs_cache` folder.

6. **Run typechecking**
5. **Check code quality**

Run static typechecking with mypy:
The pre-commit hooks will run automatically on commit, but you can also run them manually at any time:

```bash
poetry run mypy malariagen_data tests --ignore-missing-imports
pre-commit run --all-files
```

5. **Check code quality**
6. **Run typechecking**

The pre-commit hooks will run automatically, but you can also run them manually:
Run static typechecking with mypy:

```bash
pre-commit run --all-files
poetry run mypy malariagen_data tests --ignore-missing-imports
```

### Code style
Expand Down Expand Up @@ -205,6 +211,8 @@ poetry run pytest -v tests --typeguard-packages=malariagen_data,malariagen_data.

### Before opening a pull request

Run through this checklist to make sure your PR is ready for review:

- [ ] Tests pass locally
- [ ] Pre-commit hooks pass (or run `pre-commit run --all-files`)
- [ ] Code is well-documented
Expand All @@ -224,18 +232,20 @@ poetry run pytest -v tests --typeguard-packages=malariagen_data,malariagen_data.
- Select your fork and branch
- Write a clear title and description

3. **Pull request description should include:**
3. **A good PR description includes:**
- What problem does this solve?
- How does it solve it?
- Any relevant issue numbers (e.g., "Fixes #123")
- Testing done
- Relevant issue numbers (e.g., "Fixes #123")
- What testing you did
- Any breaking changes or migration notes

### Review process

- PRs require approval from a project maintainer
- CI tests must pass (pytest on Python 3.10 with NumPy 1.26.4)
- Address review feedback by pushing new commits to your branch
Once your PR is open, a project maintainer will review it. Here's what to expect:

- PRs require approval from a project maintainer before merging
- CI tests must pass (pytest on Python 3.10, 3.11, and 3.12, with NumPy versions `==2.0.2` and `>=2.0.2,<2.1`)
- Address review feedback by pushing new commits to your branch — no need to open a new PR
- Once approved, a maintainer will merge your PR

## Communication
Expand All @@ -247,18 +257,20 @@ poetry run pytest -v tests --typeguard-packages=malariagen_data,malariagen_data.

## Finding something to work on

- Look for issues labeled [`good first issue`](https://github.com/malariagen/malariagen-data-python/labels/good%20first%20issue)
- Check for issues labeled [`help wanted`](https://github.com/malariagen/malariagen-data-python/labels/help%20wanted)
- Improve documentation or add examples
Not sure where to start? Here are some good entry points:

- Issues labeled [`good first issue`](https://github.com/malariagen/malariagen-data-python/labels/good%20first%20issue) — designed to be approachable for new contributors
- Issues labeled [`help wanted`](https://github.com/malariagen/malariagen-data-python/labels/help%20wanted) — areas where the team would love community help
- Improve documentation or add usage examples
- Increase test coverage

## Questions?

If you're unsure about anything, feel free to:
Don't hesitate to ask — we'd rather help you get unstuck than have you spin your wheels:

- Open an issue to ask
- Start a discussion on GitHub Discussions
- Ask in your pull request
- Open an issue to ask a question
- Start a discussion on [GitHub Discussions](https://github.com/malariagen/malariagen-data-python/discussions)
- Ask directly in your pull request

We appreciate your contributions and will do our best to help you succeed!

Expand Down
99 changes: 75 additions & 24 deletions LINUX_SETUP.md
Original file line number Diff line number Diff line change
@@ -1,67 +1,118 @@
# Developer setup (Linux)

To get setup for development, see [this video if you prefer VS Code](https://youtu.be/zddl3n1DCFM), or [this older video if you prefer PyCharm](https://youtu.be/QniQi-Hoo9A), and the instructions below.
Welcome! This guide walks you through getting a local development environment up and running on Linux.
If you prefer a video walkthrough, check out [this VS Code version](https://youtu.be/zddl3n1DCFM) or [this older PyCharm version](https://youtu.be/QniQi-Hoo9A).

## 1. Fork and clone this repo

Start by creating your own copy of the repo on GitHub (fork it), then download it to your machine:

```bash
git clone git@github.com:[username]/malariagen-data-python.git
cd malariagen-data-python
```

## 2. Install Python

We recommend Python 3.12, which is the version used in CI. How you install it depends on your distro:

**Ubuntu** — add the Deadsnakes PPA and install Python 3.12:
```bash
sudo add-apt-repository ppa:deadsnakes/ppa
sudo apt install python3.10 python3.10-venv
sudo apt install python3.12 python3.12-venv
```

**Debian or other Linux distributions** — the Deadsnakes PPA is Ubuntu-only and won't work here.
Use [pyenv](https://github.com/pyenv/pyenv) instead, which compiles Python from source and works on any distro:

```bash
# 1. Install the libraries Python needs to compile
sudo apt install -y build-essential libssl-dev zlib1g-dev libbz2-dev \
libreadline-dev libsqlite3-dev curl libncursesw5-dev xz-utils \
tk-dev libxml2-dev libxmlsec1-dev libffi-dev liblzma-dev

# 2. Download and install pyenv
curl https://pyenv.run | bash

# 3. Tell your shell where pyenv lives (add these lines to ~/.bashrc or ~/.zshrc)
export PYENV_ROOT="$HOME/.pyenv"
export PATH="$PYENV_ROOT/bin:$PATH"
eval "$(pyenv init -)"

# 4. Reload your shell configuration so the changes take effect
source ~/.bashrc # or: source ~/.zshrc

# 5. Install Python 3.12 and activate it for this session
pyenv install 3.12
pyenv shell 3.12
```

## 3. Install pipx and poetry
> **Note:** `pyenv install` compiles Python from source, so it may take a few minutes.

## 3. Install pipx and Poetry

[pipx](https://pipx.pypa.io/) installs Python command-line tools in isolated environments so they don't interfere with your project.
[Poetry](https://python-poetry.org/) manages the project's dependencies and virtual environment.

```bash
python3.10 -m pip install --user pipx
python3.10 -m pipx ensurepath
python3.12 -m pip install --user pipx
python3.12 -m pipx ensurepath
pipx install poetry
```

## 4. Create and activate development environment
> **Tip:** After running `pipx ensurepath`, you may need to open a new terminal for the `pipx` and `poetry` commands to be found.

## 4. Create and activate the development environment

Poetry will create a virtual environment and install all the project's dependencies:

```bash
poetry install
poetry shell
```

Once inside `poetry shell`, you're working inside the project's virtual environment. You can type `exit` to leave it.

## 5. Install pre-commit hooks

Pre-commit hooks run the linter and formatter automatically before each of your commits, catching issues early:

```bash
pipx install pre-commit
pre-commit install
```

Run pre-commit checks manually:
To run all the checks manually at any time:
```bash
pre-commit run --all-files
```

## 6. Run tests

Run fast unit tests using simulated data:
Check that everything is working with the fast unit tests (no internet access needed):

```bash
poetry run pytest -v tests/anoph
```

## 7. Google Cloud authentication (for legacy tests)

To run legacy tests which read data from GCS, you'll need to [request access to MalariaGEN data on GCS](https://malariagen.github.io/vector-data/vobs/vobs-data-access.html).
If the tests pass, you're all set! 🎉

Once access has been granted, [install the Google Cloud CLI](https://cloud.google.com/sdk/docs/install):
```bash
./install_gcloud.sh
```
## 7. Google Cloud authentication (for legacy tests)

Then obtain application-default credentials:
```bash
./google-cloud-sdk/bin/gcloud auth application-default login
```
Most development doesn't require this step. Legacy integration tests read real data directly from Google Cloud Storage (GCS), so you'll need to apply for data access first.

Once authenticated, run legacy tests:
```bash
poetry run pytest --ignore=tests/anoph -v tests
```
1. [Request access to MalariaGEN data on GCS](https://malariagen.github.io/vector-data/vobs/vobs-data-access.html).
2. Once access is granted, install the Google Cloud CLI:
```bash
./install_gcloud.sh
```
3. Authenticate with your Google account:
```bash
./google-cloud-sdk/bin/gcloud auth application-default login
```
4. Run the legacy tests:
```bash
poetry run pytest --ignore=tests/anoph -v tests
```

Tests will run slowly the first time, as data will be read from GCS and cached locally in the `gcs_cache` folder.
> **Heads up:** Tests will be slow on the first run because data is downloaded from GCS. After that, it's cached locally in `gcs_cache/` so subsequent runs are much faster.
2 changes: 1 addition & 1 deletion malariagen_data/anoph/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -134,7 +134,7 @@ def __init__(
storage_options = dict()
try:
self._fs, self._base_path = _init_filesystem(self._url, **storage_options)
except Exception as exc: # pragma: no cover
except (OSError, ImportError, ValueError) as exc: # pragma: no cover
raise IOError(
"An error occurred establishing a connection to the storage system. Please see the nested exception for more details."
) from exc
Expand Down
13 changes: 7 additions & 6 deletions malariagen_data/anoph/snp_data.py
Original file line number Diff line number Diff line change
Expand Up @@ -67,12 +67,13 @@ def __init__(
self._default_site_mask = default_site_mask

# Set up caches.
# TODO review type annotations here, maybe can tighten
self._cache_snp_sites = None
self._cache_snp_genotypes: Dict = dict()
self._cache_site_filters: Dict = dict()
self._cache_site_annotations = None
self._cache_locate_site_class: Dict = dict()
self._cache_snp_sites: Optional[zarr.hierarchy.Group] = None
self._cache_snp_genotypes: Dict[str, zarr.hierarchy.Group] = dict()
self._cache_site_filters: Dict[str, zarr.hierarchy.Group] = dict()
self._cache_site_annotations: Optional[zarr.hierarchy.Group] = None
self._cache_locate_site_class: Dict[
Tuple[Region, Optional[str], str], np.ndarray
] = dict()

# Create the SNP-calls cache as a per-instance lru_cache wrapping the
# bound method. Storing it on the instance (rather than using a
Expand Down