Skip to content

Conversation

@sgfost
Copy link
Contributor

@sgfost sgfost commented Jan 9, 2025

this relies on changes in #790


part 1 (1-way mirror):

adds a button that allows model submitters to create an auto-updating, read-only git repository archive which is hosted on a central organization

the 'mirror' git repository consists of commits for each release that are tagged and branched off so that metadata can be updated for individual releases without re-writing history

additions

  • library.fs.CodebaseGitRepositoryApi: functionality for building/updating a git repository from a Codebase
  • library.github.GithubApi: provides an interface over PyGithub for interacting with repositories on github
  • library.github.GithubRepoNameValidator: provides validate() to make sure a repo name is valid and unused
  • mirror_codebase() and update_mirrored_codebase() huey tasks which call the CodebaseGitRepositoryApi to build the git repo on the file system and then GithubApi to create/push to the remote
  • update_mirrored_release_metadata() huey task which is triggered when there is an update to codemeta, and updates the corresponding release branch in git after the submission package is rebuilt
  • feature overview page at /github/
  • button + form on the release detail page for mirroring a codebase

configuration steps

  1. create an app on the comses-model-library organization with the following permissions:
    • Administration: read and write
    • Contents: read and write
    • Metadata: read only
  2. create a webhook secret and set the webhook url <HOST>/github-sync-webhook/ ** the trailing slash is very important for some reason
  3. subscribe to the Release webhook event
  4. generate private key
  5. install the app on the organization and add the installation id to .env
  6. add the app id, app name, and organization name to .env
  7. add the private key and webhook secret to secrets/

@sgfost sgfost force-pushed the feat/git-mirroring branch from 73bee28 to b25ba7b Compare January 9, 2025 21:58
@sgfost sgfost changed the title git mirroring github integration - mirroring Jan 10, 2025
@sgfost sgfost force-pushed the feat/git-mirroring branch from b25ba7b to a64a05e Compare January 10, 2025 20:02
https://huey.readthedocs.io/

the huey consumer runs as a runit daemon in the server service

the default dev mode behavior of immediate=False (tasks run
synchronously) is currently disabled for testing purposes
adding these manually was an easily forgotten step that wouldn't be
noticed in dev but would fail to build in prod
codemeta_snapshot will be used to keep a codemeta data structure updated
along with changes to metadata, which makes it easier to watch for
changes and speeds up access

license text is created from a template for each release and included in
the fs package as LICENSE file

ref comses/planning#234
and replace redis caching of codebase all_contributor lists with
querysets (did save a few queries but doesn't seem to have any
meaningful performance impact)

codebases were considering any citable release contributor as an author
and releases considered anyone with a role of "author" to be an author.
Now we use a union of the two -- not sure if this is the best way but
regardless, its easier to change since it all stems from authors()
and nonauthors() on the ReleaseContributorQuerySet

codebase and release both now have 'nonauthor contributor' accessors,
which is useful because this is what things like codemeta/datacite/etc.
consider 'contributors'
move metadata conversion to a metadata module which provides converters
for different formats. codemeta is used as the primary format which the
others (datacite, cff) can be derived from

the primary codemeta accessor is the codemeta_snapshot json field, which
is rebuilt each time a codebase/release is saved

* add `update_codebase_metadata` command to update the codemeta snapshot
  for all objects, then update packages on the fs
* add CITATION.cff file to fs package
usage of the old datacite metadata generation still needs to be replaced
* visually indicate that the release metadata form is saving, since this
  takes a little bit longer now
@sgfost sgfost force-pushed the refactor/metadata-generation branch from 17f1895 to 9cbc597 Compare January 14, 2025 00:23
@sgfost sgfost force-pushed the feat/git-mirroring branch from a64a05e to 81edfd5 Compare January 14, 2025 02:23
and resolve some edge case bugs with metadata generation.

test_codemeta was primarily checking to make sure that codemeta was
conforming to the expected schema, and this is implicit now

we may still want some test module that uses hypothesis, but it would be
even more useful to do this at a higher level e.g. create a bunch of
codebase+releases and see if anything goes wrong downstream
this API is responsible for managing a local git repository mirror for a
comses codebase. PUBLIC release archives are commits/tags in the
history. Release branches are created for each release and only added to
if there is an update to metadata

`build()` and `append_releases()` are the two main API methods which
construct (or rebuild) a git repo and add new releases to the repo,
respectively

`update_release_branch()` will add a new commit containing changes to a
release branch (and update main if they point to the same thing). This
will mainly be used for updating metadata
the GithubApi provides access to auth and repository actions

adds 3 huey (async) tasks for creating a mirror, updating a mirror, and
updating metadata for a single release of a mirror
* /github page to describe the integration features
* sidebar element on release detail page will show information about
  integration status for that codebase, and allow users with edit
  permissions to create a new mirror
@sgfost sgfost force-pushed the feat/git-mirroring branch 2 times, most recently from a92874c to 8c42b35 Compare January 23, 2025 00:10
sgfost referenced this pull request Jan 23, 2025
- include in .env PATH and make docker-compose.yml target depend on .env
- replace deprecated usage of self.assertEquals
https://docs.python.org/3/whatsnew/3.12.html#id3
@sgfost sgfost force-pushed the feat/git-mirroring branch from 8c42b35 to baed067 Compare January 25, 2025 01:42
* use installation access tokens for user repos instead of user access
  tokens. this is a more secure workflow
* add GithubIntegrationAppInstallation model for recording app
  installations (this will need to be created/updated using webhooks)
* CodebaseGitMirror/"mirror" now refers to the local git repository
* ^ can have multiple CodebaseGitRemote's which keep track of all the
  information needed to push to/archive from remote repositories

TODO: re-implement views
@sgfost sgfost force-pushed the feat/git-mirroring branch from baed067 to d3432cb Compare January 25, 2025 01:43
@sgfost sgfost changed the title github integration - mirroring git(hub) integration feature Jan 25, 2025
The distinguishing feature is whether the release has a non-null
external_release_package

This will be used to 'archive' or pull in releases made on github for
synced repos

currently, the release assets/package is not stored on the filesystem,
instead relying on an external download url, and being only concerned
with metadata
App installation tokens did not give access to get/create repos on a
user's account. Still trying to avoid storing user access tokens
(https://docs.github.com/en/apps/creating-github-apps/authenticating-with-a-github-app/generating-a-user-access-token-for-a-github-app)
so as a workaround, we will direct the submitter to create a new bare
repo before continuing

* add handler for webhook events for the github sync app
* add form/wizard for linking a pre-existing github repo (archive only)
allows setting up a push/archive sync that will automatically have the
generated git repo pushed to by providing an empty repo, as well as
setting up an archive only sync by providing any github repo

in both cases, the submitter needs to:
- link their github account with the regular oauth flow
  (so we have a way to match users with a github account)
- install the provided 'GitHub Sync' app on their github account with
  access to any repository that will be synced
and squash migrations

"import(ing)" is the wording I keep finding describes the process the
best

other potential names and their issues:
* publish - same name as the direct publishing, releases need to be
  manually published after they are imported
* pull - git command that is not used
* fetch - git command that is not used
* ingest - ok and similar to import but not quite as clear
@sgfost sgfost force-pushed the feat/git-mirroring branch from b4786f6 to 8c05280 Compare February 13, 2025 17:58
* re-order and clarify the steps to set up a sync (app installation
  takes place after creating a repo so that permissions can be
  restricted)
* fix push log to actual show useful information
* when toggling push back on, do a build/update + push on the spot
* better error/success messages
@sgfost sgfost force-pushed the feat/git-mirroring branch from ded78e4 to f3b298d Compare February 22, 2025 01:49
sgfost added 4 commits March 5, 2025 15:15
refactors the `CodebaseReleaseFsApi` to inherit from an abstract
`BaseCodebaseReleaseFsApi`, along with the new
`ImportedCodebaseReleaseFsApi`

the main function of the imported release fs api is to import releases
from a remote source by downloading an archive to originals/, extracting
to sip/data/ and then using the inherited functionality to build
archives for review/publishing. It does not implement methods for
dealing with files directly outside of this importing workflow like
add(), delete(), rebuild(), etc.

the imported release fs api manages a manifest for keeping track of file
categories. Using this for all releases is potentially something to do
in the future but is a rather complicated refactor that had to be dialed
back here
* webhook handler watches for github release events and creates a db
  record and delegates to the FS api to download and extract a release
  archive into the library FS

TODO:
- tailor the release editor UI for imported releases, including a
  way to categorize files
- extract initial metadata from the github release archive
including file categorization with an improved file tree
currently pulls out: license, release notes, languages, platforms, os
from: github repo/release data, any found codemeta file, any found
      CITATION.cff file
@sgfost
Copy link
Contributor Author

sgfost commented Mar 21, 2025

I believe all of the main functionality for the 2-way sync is now implemented and works for simple cases. Likely not bulletproof nor polished so I now need to do lots and lots of testing

@sgfost sgfost force-pushed the feat/git-mirroring branch 3 times, most recently from a4e039d to 4b4dd19 Compare March 27, 2025 04:13
and fix some typescript complaints
@sgfost sgfost force-pushed the feat/git-mirroring branch from 4b4dd19 to d00c9b3 Compare March 27, 2025 05:02
@alee alee force-pushed the refactor/metadata-generation branch from 2254cdb to a3d2e14 Compare April 1, 2025 21:12
@sgfost sgfost changed the base branch from refactor/metadata-generation to main May 23, 2025 23:46
@sgfost sgfost changed the base branch from main to refactor/metadata-generation May 23, 2025 23:57
@sgfost
Copy link
Contributor Author

sgfost commented May 24, 2025

replaced by #815

@sgfost sgfost closed this May 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant