Skip to content

Commit e03e162

Browse files
committed
Implement per-resource last_update timestamps
Closes josegonzalez#62
1 parent e1d8c33 commit e03e162

4 files changed

Lines changed: 348 additions & 25 deletions

File tree

CHANGES.rst

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,11 @@ Unreleased
77
optional attachment downloads, and per-repository incremental checkpoints.
88
- Add pull request review backups with ``--pull-reviews`` and one-time
99
incremental backfill for existing backups.
10+
- Store incremental ``last_update`` checkpoints per repository resource instead
11+
of using one global checkpoint for the whole output directory. Existing
12+
backups use the legacy global checkpoint as a migration fallback, and the
13+
legacy file is removed once existing issue/pull backups have resource
14+
checkpoints (#62).
1015
- Add ``--token-from-gh`` to read authentication from ``gh auth token``.
1116

1217

README.rst

Lines changed: 8 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -347,15 +347,19 @@ About pull request reviews
347347

348348
Use ``--pull-reviews`` with ``--pulls`` to include GitHub pull request review metadata under each pull request's ``review_data`` key. Reviews are separate from review comments: ``--pull-comments`` backs up inline review comments via ``comment_data`` and regular PR conversation comments via ``comment_regular_data``, while ``--pull-reviews`` backs up review state, submitted time, commit ID, and the top-level review body.
349349

350-
``--pull-reviews`` is included in ``--all``. Incremental backups use a per-repository checkpoint at ``repositories/{repo}/pulls/reviews_last_update``. If ``--pull-reviews`` is enabled on an existing incremental backup, the first run performs a one-time backfill for pull request reviews so older PRs are not skipped by the existing repository checkpoint. Existing ``comment_data``, ``comment_regular_data`` and ``commit_data`` fields are preserved when only review data is being added.
350+
``--pull-reviews`` is included in ``--all``. Incremental backups use a per-repository checkpoint at ``repositories/{repo}/pulls/reviews_last_update``. If ``--pull-reviews`` is enabled on an existing incremental backup, the first run performs a one-time backfill for pull request reviews so older PRs are not skipped by the existing pull request checkpoint. Existing ``comment_data``, ``comment_regular_data`` and ``commit_data`` fields are preserved when only review data is being added.
351351

352352

353353
Incremental Backup
354354
------------------
355355

356-
Using (``-i, --incremental``) will only request new data from the API **since the last run (successful or not)**. e.g. only request issues from the API since the last run.
356+
Using (``-i, --incremental``) will only request new data from the API **since the last successful resource backup**. e.g. only request issues from the API since the last issue backup for that repository.
357357

358-
This means any blocking errors on previous runs can cause a large amount of missing data in backups.
358+
Incremental checkpoints for issue and pull request API backups are stored per resource in that repository's backup directory (for example ``repositories/{repo}/issues/last_update``, ``repositories/{repo}/pulls/last_update`` or ``starred/{owner}/{repo}/pulls/last_update``). Older versions stored a single global ``last_update`` file in the output directory root. During migration, the legacy global checkpoint is used as a fallback only for resource directories that already contain backup data but do not yet have their own checkpoint. New repositories or newly enabled resources with no existing data get a full backup instead of inheriting an unrelated global checkpoint.
359+
360+
After all existing issue and pull request resource directories have per-resource checkpoints, the legacy global ``last_update`` file is removed automatically.
361+
362+
This means any blocking errors on previous runs can cause missing data in backups for the affected repository resource.
359363

360364
Using (``--incremental-by-files``) will request new data from the API **based on when the file was modified on filesystem**. e.g. if you modify the file yourself you may miss something.
361365

@@ -368,7 +372,7 @@ Known blocking errors
368372

369373
Some errors will block the backup run by exiting the script. e.g. receiving a 403 Forbidden error from the Github API.
370374

371-
If the incremental argument is used, this will result in the next backup only requesting API data since the last blocked/failed run. Potentially causing unexpected large amounts of missing data.
375+
If the incremental argument is used, per-resource checkpoints are only advanced after that resource's backup work completes. A blocking error can still abort the overall run, but repositories and resources that were not processed will keep their previous checkpoints.
372376

373377
It's therefore recommended to only use the incremental argument if the output/result is being actively monitored, or complimented with periodic full non-incremental runs, to avoid unexpected missing data in a regular backup runs.
374378

github_backup/github_backup.py

Lines changed: 146 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -1920,26 +1920,138 @@ def filter_repositories(args, unfiltered_repositories):
19201920
return repositories
19211921

19221922

1923+
INCREMENTAL_LAST_UPDATE_FILENAME = "last_update"
1924+
INCREMENTAL_RESOURCE_DIRECTORIES = ("issues", "pulls")
1925+
1926+
1927+
def get_repository_checkpoint_time(repository):
1928+
timestamps = [
1929+
timestamp
1930+
for timestamp in (repository.get("updated_at"), repository.get("pushed_at"))
1931+
if timestamp
1932+
]
1933+
if timestamps:
1934+
return max(timestamps)
1935+
1936+
return time.strftime("%Y-%m-%dT%H:%M:%SZ", time.localtime())
1937+
1938+
1939+
def resource_backup_exists(resource_cwd):
1940+
if not os.path.isdir(resource_cwd):
1941+
return False
1942+
1943+
ignored_names = {
1944+
INCREMENTAL_LAST_UPDATE_FILENAME,
1945+
PULL_REVIEWS_LAST_UPDATE_FILENAME,
1946+
}
1947+
for name in os.listdir(resource_cwd):
1948+
if name in ignored_names or name.endswith(".temp"):
1949+
continue
1950+
return True
1951+
1952+
return False
1953+
1954+
1955+
def read_legacy_last_update(args, output_directory):
1956+
if not args.incremental:
1957+
return None, None
1958+
1959+
last_update_path = os.path.join(output_directory, INCREMENTAL_LAST_UPDATE_FILENAME)
1960+
if os.path.exists(last_update_path):
1961+
return last_update_path, open(last_update_path).read().strip()
1962+
1963+
return last_update_path, None
1964+
1965+
1966+
def read_resource_last_update(args, resource_cwd, legacy_last_update=None):
1967+
if not args.incremental:
1968+
return None
1969+
1970+
last_update_path = os.path.join(resource_cwd, INCREMENTAL_LAST_UPDATE_FILENAME)
1971+
if os.path.exists(last_update_path):
1972+
return open(last_update_path).read().strip()
1973+
1974+
if legacy_last_update and resource_backup_exists(resource_cwd):
1975+
return legacy_last_update
1976+
1977+
return None
1978+
1979+
1980+
def write_resource_last_update(args, resource_cwd, repository):
1981+
if not args.incremental:
1982+
return
1983+
1984+
mkdir_p(resource_cwd)
1985+
last_update_path = os.path.join(resource_cwd, INCREMENTAL_LAST_UPDATE_FILENAME)
1986+
open(last_update_path, "w").write(get_repository_checkpoint_time(repository))
1987+
1988+
1989+
def iter_incremental_resource_dirs(output_directory):
1990+
repositories_dir = os.path.join(output_directory, "repositories")
1991+
if os.path.isdir(repositories_dir):
1992+
for repository_name in os.listdir(repositories_dir):
1993+
repo_cwd = os.path.join(repositories_dir, repository_name)
1994+
if not os.path.isdir(repo_cwd):
1995+
continue
1996+
for resource_name in INCREMENTAL_RESOURCE_DIRECTORIES:
1997+
yield os.path.join(repo_cwd, resource_name)
1998+
1999+
starred_dir = os.path.join(output_directory, "starred")
2000+
if os.path.isdir(starred_dir):
2001+
for owner_name in os.listdir(starred_dir):
2002+
owner_cwd = os.path.join(starred_dir, owner_name)
2003+
if not os.path.isdir(owner_cwd):
2004+
continue
2005+
for repository_name in os.listdir(owner_cwd):
2006+
repo_cwd = os.path.join(owner_cwd, repository_name)
2007+
if not os.path.isdir(repo_cwd):
2008+
continue
2009+
for resource_name in INCREMENTAL_RESOURCE_DIRECTORIES:
2010+
yield os.path.join(repo_cwd, resource_name)
2011+
2012+
2013+
def has_unmigrated_incremental_resources(output_directory):
2014+
for resource_cwd in iter_incremental_resource_dirs(output_directory):
2015+
last_update_path = os.path.join(
2016+
resource_cwd, INCREMENTAL_LAST_UPDATE_FILENAME
2017+
)
2018+
if resource_backup_exists(resource_cwd) and not os.path.exists(
2019+
last_update_path
2020+
):
2021+
return True
2022+
2023+
return False
2024+
2025+
2026+
def remove_legacy_last_update_if_migrated(
2027+
args, output_directory, legacy_last_update_path
2028+
):
2029+
if not args.incremental or not legacy_last_update_path:
2030+
return
2031+
if not os.path.exists(legacy_last_update_path):
2032+
return
2033+
if has_unmigrated_incremental_resources(output_directory):
2034+
logger.info(
2035+
"Keeping legacy global last_update until all existing issue/pull "
2036+
"backups have per-resource checkpoints"
2037+
)
2038+
return
2039+
2040+
os.remove(legacy_last_update_path)
2041+
logger.info(
2042+
"Removed legacy global last_update after migrating incremental checkpoints"
2043+
)
2044+
2045+
19232046
def backup_repositories(args, output_directory, repositories):
19242047
logger.info("Backing up repositories")
19252048
repos_template = "https://{0}/repos".format(get_github_api_host(args))
2049+
legacy_last_update_path, legacy_last_update = read_legacy_last_update(
2050+
args, output_directory
2051+
)
2052+
incremental_resource_work_attempted = False
19262053

1927-
if args.incremental:
1928-
last_update_path = os.path.join(output_directory, "last_update")
1929-
if os.path.exists(last_update_path):
1930-
args.since = open(last_update_path).read().strip()
1931-
else:
1932-
args.since = None
1933-
else:
1934-
args.since = None
1935-
1936-
last_update = "0000-00-00T00:00:00Z"
19372054
for repository in repositories:
1938-
if repository.get("updated_at") and repository["updated_at"] > last_update:
1939-
last_update = repository["updated_at"]
1940-
elif repository.get("pushed_at") and repository["pushed_at"] > last_update:
1941-
last_update = repository["pushed_at"]
1942-
19432055
if repository.get("is_gist"):
19442056
repo_cwd = os.path.join(output_directory, "gists", repository["id"])
19452057
elif repository.get("is_starred"):
@@ -2002,18 +2114,32 @@ def backup_repositories(args, output_directory, repositories):
20022114
no_prune=args.no_prune,
20032115
)
20042116
if args.include_issues or args.include_everything:
2117+
incremental_resource_work_attempted = True
2118+
issue_cwd = os.path.join(repo_cwd, "issues")
2119+
args.since = read_resource_last_update(
2120+
args, issue_cwd, legacy_last_update
2121+
)
20052122
backup_issues(args, repo_cwd, repository, repos_template)
2123+
write_resource_last_update(args, issue_cwd, repository)
20062124

20072125
if args.include_pulls or args.include_everything:
2126+
incremental_resource_work_attempted = True
2127+
pulls_cwd = os.path.join(repo_cwd, "pulls")
2128+
args.since = read_resource_last_update(
2129+
args, pulls_cwd, legacy_last_update
2130+
)
20082131
backup_pulls(args, repo_cwd, repository, repos_template)
2132+
write_resource_last_update(args, pulls_cwd, repository)
20092133

20102134
if args.include_discussions or args.include_everything:
20112135
backup_discussions(args, repo_cwd, repository)
20122136

20132137
if args.include_milestones or args.include_everything:
20142138
backup_milestones(args, repo_cwd, repository, repos_template)
20152139

2016-
if args.include_security_advisories or (args.include_everything and not repository.get("private", False)):
2140+
if args.include_security_advisories or (
2141+
args.include_everything and not repository.get("private", False)
2142+
):
20172143
backup_security_advisories(args, repo_cwd, repository, repos_template)
20182144

20192145
if args.include_labels or args.include_everything:
@@ -2037,11 +2163,10 @@ def backup_repositories(args, output_directory, repositories):
20372163
logger.info(f"Skipping remaining resources for {repository['full_name']}")
20382164
continue
20392165

2040-
if args.incremental:
2041-
if last_update == "0000-00-00T00:00:00Z":
2042-
last_update = time.strftime("%Y-%m-%dT%H:%M:%SZ", time.localtime())
2043-
2044-
open(last_update_path, "w").write(last_update)
2166+
if incremental_resource_work_attempted:
2167+
remove_legacy_last_update_if_migrated(
2168+
args, output_directory, legacy_last_update_path
2169+
)
20452170

20462171

20472172
DISCUSSION_PAGE_SIZE = 100

0 commit comments

Comments
 (0)