Skip to content
203 changes: 203 additions & 0 deletions docs/datasets/ingestion-guide/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,203 @@
# Dataset Ingestion Guide

Once you have your database, the SciCat API server up and running, several options of ingesting a dataset into SciCat exist - ranging from quick, one-time ingestion via CURL (not recommended) to a fully automized ingestion setup using python software for SciCat API access based on [`pydantic`]([https://pydantic.dev/docs/validation/latest/concepts/models/]) (a python validation class to ensure valid formats) for example [`pyscicat`](https://www.scicatproject.org/pyscicat/) or SciCat's `python-sdk`.

Another example that uses Jupyter Notebook in SciCatLive can be found [here]([https://github.com/SciCatProject/scicatlive/blob/main/services/jupyter/config/notebooks/pyscicat.ipynb) which includes how to authenticate, create a dataset, add datablocks and upload an attachement.

## The `CURL` command
The highest chance to make a successful request to one of SciCats endpoints is to learn from Swagger. Browse to see the syntax formation, obtain a valid token via the auth login or through looking at the Users endpoint in settings on the frontend, provide the correct fields ensuring you exclude the forbidden fields. In the following we give a skeleton for examples.

### Simple GET request
with a known `pid` as authenticated user (replace placeholders)

```bash
curl -X "GET" \
"http://"${URL}"/api/v4/datasets/${pid}" \
-H "accept: application/json"
-H 'Authentication: Bearer "${TOKEN}"'
```
If you are only interested in public records , you will not need to authenticate so repeat the command without the last line `-H Authentication: Bearer "${TOKEN}"`.

### GET identifiers of ```origdatablocks``` (OIDs) of a dataset

The dataset is identified by its pid, we use v4 API and first define a filter

```bash
export URL="valid_url"
export TOKEN="valid_token"
export pid="example_pid"

# -- GET the oids ------------------
FILTER_JSON='{
"where": { "pid": "'${pid}'" },
"include": [
{
"relation": "origdatablocks",
"scope": {
"fields": ["_id"],
"limit": 20,
"order": "filename ASC"
}
}
],
"fields": ["datasetName"],
"limit": 10
}'

curl -G "${URL}/api/v4/datasets" \
--data-urlencode "filter=${FILTER_JSON}" \
-H "Accept: application/json" \
-H "Authorization: Bearer ${TOKEN}"

```

### Create a dataset
Make sure the json is formatted OK. Some fields are mandatory, you can check in swagger to see which fields are mandatory or not, see also [swagger documentation](../../swagger/index.md#tips-and-tricks).
```bash
curl -X 'POST' \
'https://${URL}/api/v4/datasets' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"ownerGroup": "string",
"accessGroups": [
"string"
],
"instrumentGroup": "string",
"owner": "string",
"ownerEmail": "user@example.com",
"orcidOfOwner": "string",
"contactEmail": "string",
"sourceFolder": "string",
"sourceFolderHost": "string",
"size": 0,
"packedSize": 0,
"numberOfFiles": 0,
"numberOfFilesArchived": 0,
"creationTime": "2026-04-22T10:54:53.642Z",
"validationStatus": "string",
"keywords": [
"string"
],
"description": "string",
"datasetName": "string",
"classification": "string",
"license": "string",
"isPublished": false,
"techniques": [],
"sharedWith": [],
"relationships": [],
"datasetlifecycle": {},
"scientificMetadata": {},
"scientificMetadataSchema": "string",
"scientificMetadataValid": true,
"comment": "string",
"dataQualityMetrics": 0,
"principalInvestigators": [
"string"
],
"startTime": "2026-04-22T10:54:53.642Z",
"endTime": "2026-04-22T10:54:53.642Z",
"creationLocation": "string",
"dataFormat": "string",
"proposalIds": [
"string"
],
"sampleIds": [
"string"
],
"instrumentIds": [
"string"
],
"inputDatasets": [
"string"
],
"usedSoftware": [
"string"
],
"jobParameters": {},
"jobLogData": "string",
"runNumber": "string",
"pid": "string",
"type": "string"
}'

```
Note, that datafiles and attachments related to a SciCat dataset need to be POSTed separtely as they are a priori independent entities.

### Adding datafiles
Associated datafiles are organised in datablocks. There are original datablocks (`origdatablocks`) and `datablocks`. The latter are obsolete and functionality has fully moved to `origdatablocks`. To attach your metadata of these associated datafiles to the dataset use e.g. `/api/v4/origdatablock`. Note, that the dataset to which the blocks belong are indicated by `dataasetId` which corresponds to the `pid` field of the dataset itself. This can be achieved by the following command (and placeholders replaced):

```bash
curl -X 'POST' \
'http://localhost:3000/api/v4/origdatablocks' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"ownerGroup": "string",
"accessGroups": [
"string"
],
"instrumentGroup": "string",
"size": 0,
"chkAlg": "string",
"dataFileList": [
{
"path": "string",
"size": 0,
"time": "2026-04-22T11:28:56.870Z",
"chk": "string",
"uid": "string",
"gid": "string",
"perm": "string",
"type": "string"
}
],
"isPublished": true,
"datasetId": "string"
}'
```
Also note, that the `ownerGroup` must match the same field in the dataset. Same holds for attachments.

### Adding attachments

Here too a POST request will ingest attachments to the dataset, e.g. like this (with placeholders replaced):

```bash
curl -X 'POST' \
'http://localhost:3000/api/v4/attachments' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"ownerGroup": "string",
"accessGroups": [
"string"
],
"instrumentGroup": "string",
"thumbnail": "string",
"caption": "string",
"relationships": [
{
"targetId": "string",
"targetType": "dataset",
"relationType": "is attached to"
}
],
"isPublished": true,
"aid": "string"
}'
```


## Pythonic way: python sdk
The python software development kit, sdk, is entirely generated from the backend based on the OpenAPI initiative and swagger definitions. Find more info [here](https://www.piwheels.org/project/scicat-sdk-py/).

## Pythonic way: pyscicat
`pyscicat` has the same functionality as the python sdk but is meant to be more user friendly and maintained by [dmreyno](https://pypi.org/user/dmcreyno/). Some intuitive examples and its documentation how to ingest can be found [here](https://www.scicatproject.org/pyscicat/howto/ingest.html).

## Pythonic way: sciteacean
Scitacean is a high level Python package for downloading and uploading datasets from and to SciCat.

See the [documentation](https://www.scicatproject.org/scitacean/) for installation and usage instructions.

If you need help, have a look at our [GitHub discussions](https://github.com/SciCatProject/scitacean). For questions, please start a Q&A discussion if you can't find an answer.
7 changes: 2 additions & 5 deletions docs/operator-guide/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,13 +14,10 @@ SciCat covers these core aspects in a flexible way:
1. Searchable metadata fields, most common and highly specific ones. SciCat was developed by the PaNoSc community and has been successfully used more widely. This is because SciCat is highly configurable.
2. Provision of unique persistent identifiers not only for the internal catalogue, but also connecting to the global DOI system through e.g. ready pathway to publication via [DataCite](https://datacite.org/).

SciCat is an open source project can can be developed in accordance with our [license](https://github.com/SciCatProject/scicat-backend-next?tab=BSD-3-Clause-1-ov-file#readme).
SciCat is an open source project can can be developed in accordance with our [license](https://github.com/SciCatProject/backend?tab=BSD-3-Clause-1-ov-file#readme).

## Dataset ingestion
You find here a pythonic way of metadata ingestion using SciCats API based on the PySciCat client:
See this [how-to-ingest doc](https://www.scicatproject.org/pyscicat/howto/ingest.html) to get started.

Another example that uses Jupyter Notebook in SciCatLive (see below) can be found [here]([https://github.com/SciCatProject/scicatlive/blob/main/services/jupyter/config/notebooks/pyscicat.ipynb) which includes how to authenticate, create a dataset, add datablocks and upload an attachement.
There are several ways of ingesting a dataset into SciCat, [here](../datasets/ingestion-guide/index.md) are the details.

## Up-to-date operator's information
Generally, the [**scicatlive**](https://www.scicatproject.org/scicatlive/latest/) documentation contains an up-to-date information how to set up and run the system ```SciCat``` interfacing it with various external, site-specific services. For troublshooting issues, please refer [the User's Guide](../troubleshoot/index.md).
Expand Down
Binary file added docs/sites/img/SciCatATPSI.png
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be in here? It seems to have crept in from another branch?

Copy link
Copy Markdown
Member Author

@rkweehinzmann rkweehinzmann Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually this is not complete. Can one reject this file to be included in this PR? If not, it will require a separate fix unrelated to ingestion guide!

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/swagger/img/swagger_required_fields.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
9 changes: 8 additions & 1 deletion docs/swagger/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,13 @@ You need to authenticate twice:
1. Get the **SciCat token** from the user setting when logged into SciCat via the main GUI. Copy paste it into the field "Authorize" in the explorer on the top right. ![swagger login](img/swagger_getToken.png)
<br>
<br>
2. Login on the explorer page again with the same credentials. ![swagger login](img/swaggerLogin.png)
2. Login on the explorer page again with the same credentials using the token. ![Swagger login](img/swaggerLogin.png)

## Tips and Tricks

To see which fields are required check the "Schema" link next to the _Example Value_ in the **Response body** section. They are marked with a red asterisk. ![required fields](img/swagger_required_fields.png)