I’m a bit of a fan of 1930s popular music on gramophone records, so much so that I own an original early-30s gramophone player and an extensive collection of discs. So the announcement that the Internet Archive had released a collection of 29,000 records was pretty amazing.
[Edit: If you want a light introduction to this, I recommend this double CD]
I wanted to download it … all!
But apart from this gnomic explanation it isn’t obvious how, so I had to work it out. Here’s how I did it …
Firstly you do need to start with the Advanced Search form. Using the second form on that page, in the query box put collection:georgeblood
, select the identifier
field (only), set the format to CSV. Set the limit to 30000
(there are about 25000+ records), and download the huge CSV:
$ ls -l search.csv -rw-rw-r--. 1 rjones rjones 2186375 Aug 14 21:03 search.csv $ wc -l search.csv 25992 search.csv $ head -5 search.csv "identifier" "78_jeannine-i-dream-of-you-lilac-time_bar-harbor-society-orch.-irving-kaufman-shilkr_gbia0010841b" "78_a-prisoners-adieu_jerry-irby-modern-mountaineers_gbia0000549b" "78_if-i-had-the-heart-of-a-clown_bobby-wayne-joe-reisman-rollins-nelson-kane_gbia0004921b" "78_how-many-times-can-i-fall-in-love_patty-andrews-and-tommy-dorsey-victor-young-an_gbia0013066b"
A bit of URL exploration found a fairly straightforward way to turn those identifiers into directory listings. For example:
78_jeannine-i-dream-of-you-lilac-time_bar-harbor-society-orch.-irving-kaufman-shilkr_gbia0010841b
→ https://archive.org/download/78_jeannine-i-dream-of-you-lilac-time_bar-harbor-society-orch.-irving-kaufman-shilkr_gbia0010841b
What I want to do is pick the first MP3 file in the directory and download it. I’m not fussy about how to do that, and Python has both a CSV library and an HTML fetching library. This turns the CSV file of links into a list of MP3 URLs. You could easily adapt this to download FLAC files instead.
#!/usr/bin/python import csv import re import urllib2 import urlparse from BeautifulSoup import BeautifulSoup with open('search.csv', 'rb') as csvfile: csvreader = csv.reader(csvfile, delimiter=',', quotechar='"') for row in csvreader: if row[0] == "identifier": continue url = "https://archive.org/download/%s/" % row[0] page = urllib2.urlopen(url).read() soup = BeautifulSoup(page) links = soup.findAll('a', attrs={'href': re.compile("\.mp3$")}) # Only want the first link in the page. link = links[0] link = link.get('href', None) link = urlparse.urljoin(url, link) print link
When you run this it converts each identifier into a download URL:
Edit: Amusingly WordPress turns the next pre
section with MP3 URLs into music players. I recommend listening to them!
$ ./download.py | head -10
And after that you can download as many 78s as you can handle 🙂 by doing:
$ ./download.py > downloads $ wget -nc -i downloads
Update
I only downloaded about 5% of the tracks, but it looks as if downloading it all would be ~ 100 GB. Also most of these tracks are still in copyright (thanks to insane copyright terms), so they may not be suitable for sampling on your next gramophone-rap record.
Update #2
Don’t forget to donate to the Internet Archive. I gave them $50 to continue their excellent work.