-
Notifications
You must be signed in to change notification settings - Fork 2
Datasets ingestion guide and SciCat at PSI #34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
rkweehinzmann
wants to merge
10
commits into
SciCatProject:main
Choose a base branch
from
rkweehinzmann:datasets-ingestion-guide
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
10 commits
Select commit
Hold shift + click to select a range
b9e13f2
adding SciCat at PSI png
rkweehinzmann c730b2d
Merge branch 'main' of github.com:rkweehinzmann/user-documentation
rkweehinzmann 920aa33
Merge branch 'SciCatProject:main' into main
rkweehinzmann 2af1ff5
add ingestor guide
rkweehinzmann 389745d
adding example curl commands for origdatablocks and attachments
rkweehinzmann 534f791
add datafiles and attachment ingestion with curl and add swagger tips…
rkweehinzmann 4e3b306
adding link to swagger section
rkweehinzmann 2329b1f
Merge pull request #6 from SciCatProject/main
rkweehinzmann 67ba0e6
Apply suggestions from code review
rkweehinzmann f014a6d
Apply suggestions from code review
rkweehinzmann File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,203 @@ | ||
| # Dataset Ingestion Guide | ||
|
|
||
| Once you have your database, the SciCat API server up and running, several options of ingesting a dataset into SciCat exist - ranging from quick, one-time ingestion via CURL (not recommended) to a fully automized ingestion setup using python software for SciCat API access based on [`pydantic`]([https://pydantic.dev/docs/validation/latest/concepts/models/]) (a python validation class to ensure valid formats) for example [`pyscicat`](https://www.scicatproject.org/pyscicat/) or SciCat's `python-sdk`. | ||
|
|
||
| Another example that uses Jupyter Notebook in SciCatLive can be found [here]([https://github.com/SciCatProject/scicatlive/blob/main/services/jupyter/config/notebooks/pyscicat.ipynb) which includes how to authenticate, create a dataset, add datablocks and upload an attachement. | ||
|
|
||
| ## The `CURL` command | ||
| The highest chance to make a successful request to one of SciCats endpoints is to learn from Swagger. Browse to see the syntax formation, obtain a valid token via the auth login or through looking at the Users endpoint in settings on the frontend, provide the correct fields ensuring you exclude the forbidden fields. In the following we give a skeleton for examples. | ||
|
|
||
| ### Simple GET request | ||
| with a known `pid` as authenticated user (replace placeholders) | ||
|
|
||
| ```bash | ||
| curl -X "GET" \ | ||
| "http://"${URL}"/api/v4/datasets/${pid}" \ | ||
| -H "accept: application/json" | ||
| -H 'Authentication: Bearer "${TOKEN}"' | ||
| ``` | ||
| If you are only interested in public records , you will not need to authenticate so repeat the command without the last line `-H Authentication: Bearer "${TOKEN}"`. | ||
|
|
||
| ### GET identifiers of ```origdatablocks``` (OIDs) of a dataset | ||
|
|
||
| The dataset is identified by its pid, we use v4 API and first define a filter | ||
|
|
||
| ```bash | ||
| export URL="valid_url" | ||
| export TOKEN="valid_token" | ||
| export pid="example_pid" | ||
|
|
||
| # -- GET the oids ------------------ | ||
| FILTER_JSON='{ | ||
| "where": { "pid": "'${pid}'" }, | ||
| "include": [ | ||
| { | ||
| "relation": "origdatablocks", | ||
| "scope": { | ||
| "fields": ["_id"], | ||
| "limit": 20, | ||
| "order": "filename ASC" | ||
| } | ||
| } | ||
| ], | ||
| "fields": ["datasetName"], | ||
| "limit": 10 | ||
| }' | ||
|
|
||
| curl -G "${URL}/api/v4/datasets" \ | ||
| --data-urlencode "filter=${FILTER_JSON}" \ | ||
| -H "Accept: application/json" \ | ||
| -H "Authorization: Bearer ${TOKEN}" | ||
|
|
||
| ``` | ||
|
|
||
| ### Create a dataset | ||
| Make sure the json is formatted OK. Some fields are mandatory, you can check in swagger to see which fields are mandatory or not, see also [swagger documentation](../../swagger/index.md#tips-and-tricks). | ||
| ```bash | ||
| curl -X 'POST' \ | ||
| 'https://${URL}/api/v4/datasets' \ | ||
| -H 'accept: application/json' \ | ||
| -H 'Content-Type: application/json' \ | ||
| -d '{ | ||
| "ownerGroup": "string", | ||
| "accessGroups": [ | ||
| "string" | ||
| ], | ||
| "instrumentGroup": "string", | ||
| "owner": "string", | ||
| "ownerEmail": "user@example.com", | ||
| "orcidOfOwner": "string", | ||
| "contactEmail": "string", | ||
| "sourceFolder": "string", | ||
| "sourceFolderHost": "string", | ||
| "size": 0, | ||
| "packedSize": 0, | ||
| "numberOfFiles": 0, | ||
| "numberOfFilesArchived": 0, | ||
| "creationTime": "2026-04-22T10:54:53.642Z", | ||
| "validationStatus": "string", | ||
| "keywords": [ | ||
| "string" | ||
| ], | ||
| "description": "string", | ||
| "datasetName": "string", | ||
| "classification": "string", | ||
| "license": "string", | ||
| "isPublished": false, | ||
| "techniques": [], | ||
| "sharedWith": [], | ||
| "relationships": [], | ||
| "datasetlifecycle": {}, | ||
| "scientificMetadata": {}, | ||
| "scientificMetadataSchema": "string", | ||
| "scientificMetadataValid": true, | ||
| "comment": "string", | ||
| "dataQualityMetrics": 0, | ||
| "principalInvestigators": [ | ||
| "string" | ||
| ], | ||
| "startTime": "2026-04-22T10:54:53.642Z", | ||
| "endTime": "2026-04-22T10:54:53.642Z", | ||
| "creationLocation": "string", | ||
| "dataFormat": "string", | ||
| "proposalIds": [ | ||
| "string" | ||
| ], | ||
| "sampleIds": [ | ||
| "string" | ||
| ], | ||
| "instrumentIds": [ | ||
| "string" | ||
| ], | ||
| "inputDatasets": [ | ||
| "string" | ||
| ], | ||
| "usedSoftware": [ | ||
| "string" | ||
| ], | ||
| "jobParameters": {}, | ||
| "jobLogData": "string", | ||
| "runNumber": "string", | ||
| "pid": "string", | ||
| "type": "string" | ||
| }' | ||
|
|
||
| ``` | ||
| Note, that datafiles and attachments related to a SciCat dataset need to be POSTed separtely as they are a priori independent entities. | ||
|
|
||
| ### Adding datafiles | ||
| Associated datafiles are organised in datablocks. There are original datablocks (`origdatablocks`) and `datablocks`. The latter are obsolete and functionality has fully moved to `origdatablocks`. To attach your metadata of these associated datafiles to the dataset use e.g. `/api/v4/origdatablock`. Note, that the dataset to which the blocks belong are indicated by `dataasetId` which corresponds to the `pid` field of the dataset itself. This can be achieved by the following command (and placeholders replaced): | ||
|
|
||
| ```bash | ||
| curl -X 'POST' \ | ||
| 'http://localhost:3000/api/v4/origdatablocks' \ | ||
| -H 'accept: application/json' \ | ||
| -H 'Content-Type: application/json' \ | ||
| -d '{ | ||
| "ownerGroup": "string", | ||
| "accessGroups": [ | ||
| "string" | ||
| ], | ||
| "instrumentGroup": "string", | ||
| "size": 0, | ||
| "chkAlg": "string", | ||
| "dataFileList": [ | ||
| { | ||
| "path": "string", | ||
| "size": 0, | ||
| "time": "2026-04-22T11:28:56.870Z", | ||
| "chk": "string", | ||
| "uid": "string", | ||
| "gid": "string", | ||
| "perm": "string", | ||
| "type": "string" | ||
| } | ||
| ], | ||
| "isPublished": true, | ||
| "datasetId": "string" | ||
| }' | ||
| ``` | ||
| Also note, that the `ownerGroup` must match the same field in the dataset. Same holds for attachments. | ||
|
|
||
| ### Adding attachments | ||
|
|
||
| Here too a POST request will ingest attachments to the dataset, e.g. like this (with placeholders replaced): | ||
|
|
||
| ```bash | ||
| curl -X 'POST' \ | ||
| 'http://localhost:3000/api/v4/attachments' \ | ||
| -H 'accept: application/json' \ | ||
| -H 'Content-Type: application/json' \ | ||
| -d '{ | ||
| "ownerGroup": "string", | ||
| "accessGroups": [ | ||
| "string" | ||
| ], | ||
| "instrumentGroup": "string", | ||
| "thumbnail": "string", | ||
| "caption": "string", | ||
| "relationships": [ | ||
| { | ||
| "targetId": "string", | ||
| "targetType": "dataset", | ||
| "relationType": "is attached to" | ||
| } | ||
| ], | ||
| "isPublished": true, | ||
| "aid": "string" | ||
| }' | ||
| ``` | ||
|
|
||
|
|
||
| ## Pythonic way: python sdk | ||
| The python software development kit, sdk, is entirely generated from the backend based on the OpenAPI initiative and swagger definitions. Find more info [here](https://www.piwheels.org/project/scicat-sdk-py/). | ||
|
|
||
| ## Pythonic way: pyscicat | ||
| `pyscicat` has the same functionality as the python sdk but is meant to be more user friendly and maintained by [dmreyno](https://pypi.org/user/dmcreyno/). Some intuitive examples and its documentation how to ingest can be found [here](https://www.scicatproject.org/pyscicat/howto/ingest.html). | ||
|
|
||
| ## Pythonic way: sciteacean | ||
| Scitacean is a high level Python package for downloading and uploading datasets from and to SciCat. | ||
|
|
||
| See the [documentation](https://www.scicatproject.org/scitacean/) for installation and usage instructions. | ||
|
|
||
| If you need help, have a look at our [GitHub discussions](https://github.com/SciCatProject/scitacean). For questions, please start a Q&A discussion if you can't find an answer. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be in here? It seems to have crept in from another branch?
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually this is not complete. Can one reject this file to be included in this PR? If not, it will require a separate fix unrelated to ingestion guide!