Skip to content

Define the archive format #9

@killercup

Description

@killercup

Let's define the format of out archives.

Current state

A binary file that is actually just concatenated gzip blobs.

Features:

  1. Extract gzip files
  2. Append is trivial

Prior art

  • WARC. An implementation seems to live here.
    • I have never used this, but someone pointed it out on Twitter.
    • It's a long spec
    • Not sure append is possible
  • .tar.gz files
    • It's well-known
    • It's from the 70s with all the 'features' that come with it
    • Append?

What I learned: GZIP members

While reading the WARC spec I found this interesting section:

As specified in 2.2 of the GZIP specification (see [RFC 1952]), a valid GZIP file consists of any number of GZIP “members”, each independently compressed.

Where possible, this property should be exploited to compress each record of a WARC file independently. This results in a valid GZIP file whose per-record subranges also stand alone as valid GZIP files.

External indexes of WARC file content may then be used to record each record’s starting position in the GZIP file, allowing for random access of individual records without requiring decompression of all preceding records.

I did not know this about gzip! If I'm reading this correctly, it means that we can, in theory use files compatible with tar (or WARC) with the additional requirement that each file is a new GZIP member (so that we can continue to get slices from our index file that point to valid gzip files we can serve).

Options

  • continue to use custom archive format, but specify it, and maybe add some stuff to sure forward-compatibility
  • use tar, and find a way to ensure gzip members are used
    • research how to do append, or skip appending as an update strategy altogether

cc @QuietMisdreavus

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions