Downloading all the 78rpm rips at the Internet Archive

I’m a bit of a fan of 1930s popular music on gramophone records, so much so that I own an original early-30s gramophone player and an extensive collection of discs. So the announcement that the Internet Archive had released a collection of 29,000 records was pretty amazing.

[Edit: If you want a light introduction to this, I recommend this double CD]

I wanted to download it … all!

But apart from this gnomic explanation it isn’t obvious how, so I had to work it out. Here’s how I did it …

Firstly you do need to start with the Advanced Search form. Using the second form on that page, in the query box put collection:georgeblood, select the identifier field (only), set the format to CSV. Set the limit to 30000 (there are about 25000+ records), and download the huge CSV:

$ ls -l search.csv
-rw-rw-r--. 1 rjones rjones 2186375 Aug 14 21:03 search.csv
$ wc -l search.csv
25992 search.csv
$ head -5 search.csv

A bit of URL exploration found a fairly straightforward way to turn those identifiers into directory listings. For example:


What I want to do is pick the first MP3 file in the directory and download it. I’m not fussy about how to do that, and Python has both a CSV library and an HTML fetching library. This turns the CSV file of links into a list of MP3 URLs. You could easily adapt this to download FLAC files instead.


import csv
import re
import urllib2
import urlparse
from BeautifulSoup import BeautifulSoup

with open('search.csv', 'rb') as csvfile:
    csvreader = csv.reader(csvfile, delimiter=',', quotechar='"')
    for row in csvreader:
        if row[0] == "identifier":
        url = "https://archive.org/download/%s/" % row[0]
        page = urllib2.urlopen(url).read()
        soup = BeautifulSoup(page)
        links = soup.findAll('a', attrs={'href': re.compile("\.mp3$")})
        # Only want the first link in the page.
        link = links[0]
        link = link.get('href', None)
        link = urlparse.urljoin(url, link)
        print link

When you run this it converts each identifier into a download URL:

Edit: Amusingly WordPress turns the next pre section with MP3 URLs into music players. I recommend listening to them!

$ ./download.py | head -10

And after that you can download as many 78s as you can handle 🙂 by doing:

$ ./download.py > downloads
$ wget -nc -i downloads


I only downloaded about 5% of the tracks, but it looks as if downloading it all would be ~ 100 GB. Also most of these tracks are still in copyright (thanks to insane copyright terms), so they may not be suitable for sampling on your next gramophone-rap record.

Update #2

Don’t forget to donate to the Internet Archive. I gave them $50 to continue their excellent work.



