-
Notifications
You must be signed in to change notification settings - Fork 0
What happened? Was there any interest or the OCR side? #5
Description
I'm curious as to what happened with this. It looks like you did an excellent job of architecture and documentation.
Update
I got the main Tracks database down to 800 KB (270 KB gzipped) using CSV as
ID Title Tags Posted Length Bitrate FileName FileSize YouTubeI think I might do another 2 like that for Games + Songs and Artists / Composers and still keep the total data under 1mb, and provide a small snipped of JavaScript that fetches the 3 files and links them by numerical ID on the client side along with hard-coded numeric ids for Tags and the list of Mirrors.
Implementation Ideas / Notes
(for people who end up on this repo like I did)
Since there's less than 10,000 remixes + albums, and it's highly unlikely that there will be 100,000 items within our lifetimes, it would probably be simpler to
- ship the entire database as a single JSON file for GET / filter / etc with per-IP rate limit (to encourage proper caching)
- could be brute-force optimally precompressed w/ gzip, zstd, and brotli
- could also use single-digit ids for relationships
(like a typical db rather than a typical api, though I don't like this idea as much) - possibly use permanent, browser-level caching for ID ranges e.g. 1-2000, 2001-4000, and dynamic for newer IDs
- could also use multiple CSVs rather than a single JSON... maybe
- POST by ID for atomic updates
- have a GET that only hands back updates since a last_updated_at parameter
Scraping
I'm going to give this a shot myself with a little help from Grok to save on the tedious HTML parsing.
- https://ocremix.org/remixes/?&offset=0&sort=datedesc gives
- the approximate number of OCR ids
- the actual last OCR id
- offset increments by 30 in sorted mode
- in non-sorted mode there is a duplicate row for the Game, followed by remixes for the game
- Remixes
- start at https://ocremix.org/remix/OCR00001
- end at https://ocremix.org/remix/OCR04851
- about 5% no longer exist (maybe pulled for copyright or never finished submission?)
- all seem to have md5sum hash ids, and direct download links which match the torrent names
- Albums
- many IDs between https://ocremix.org/album/1 and https://ocremix.org/album/100 happen to be valid
- other IDs are random (?)
- must crawl https://ocremix.org/albums/?&offset=0&sort=datedesc to get all album ids
- some have direct downloads, others have dedicated pages
- !! not all remixes from /albums appear in the /remixes