Downloading all the 78rpm rips at the Internet Archive

I’m a bit of a fan of 1930s popular music on gramophone records, so much so that I own an original early-30s gramophone player and an extensive collection of discs. So the announcement that the Internet Archive had released a collection of 29,000 records was pretty amazing.

[Edit: If you want a light introduction to this, I recommend this double CD]

I wanted to download it … all!

But apart from this gnomic explanation it isn’t obvious how, so I had to work it out. Here’s how I did it …

Firstly you do need to start with the Advanced Search form. Using the second form on that page, in the query box put collection:georgeblood, select the identifier field (only), set the format to CSV. Set the limit to 30000 (there are about 25000+ records), and download the huge CSV:

$ ls -l search.csv
-rw-rw-r--. 1 rjones rjones 2186375 Aug 14 21:03 search.csv
$ wc -l search.csv
25992 search.csv
$ head -5 search.csv

A bit of URL exploration found a fairly straightforward way to turn those identifiers into directory listings. For example:


What I want to do is pick the first MP3 file in the directory and download it. I’m not fussy about how to do that, and Python has both a CSV library and an HTML fetching library. This turns the CSV file of links into a list of MP3 URLs. You could easily adapt this to download FLAC files instead.


import csv
import re
import urllib2
import urlparse
from BeautifulSoup import BeautifulSoup

with open('search.csv', 'rb') as csvfile:
    csvreader = csv.reader(csvfile, delimiter=',', quotechar='"')
    for row in csvreader:
        if row[0] == "identifier":
        url = "" % row[0]
        page = urllib2.urlopen(url).read()
        soup = BeautifulSoup(page)
        links = soup.findAll('a', attrs={'href': re.compile("\.mp3$")})
        # Only want the first link in the page.
        link = links[0]
        link = link.get('href', None)
        link = urlparse.urljoin(url, link)
        print link

When you run this it converts each identifier into a download URL:

Edit: Amusingly WordPress turns the next pre section with MP3 URLs into music players. I recommend listening to them!

$ ./ | head -10

And after that you can download as many 78s as you can handle 🙂 by doing:

$ ./ > downloads
$ wget -nc -i downloads


I only downloaded about 5% of the tracks, but it looks as if downloading it all would be ~ 100 GB. Also most of these tracks are still in copyright (thanks to insane copyright terms), so they may not be suitable for sampling on your next gramophone-rap record.

Update #2

Don’t forget to donate to the Internet Archive. I gave them $50 to continue their excellent work.



Filed under Uncategorized

5 responses to “Downloading all the 78rpm rips at the Internet Archive

  1. kun

    Demand is the drive of creation

  2. it’s just me? or this is incredibly slow?

  3. Anyway: is there a simple way to to create a link containing flacs f mp3s are not available? I’m currently using a if-else-if-else ladder for this but i’d like to do instead something like:

    links = soup.findAll(‘a’, attrs={‘href’: re.compile(“\.mp3$” or “\.ogg$” or “\.flac$”)})

    which i don’t really expect it to work 🙂

  4. markov035

    If somebody would post a monthly “downloads” file, that would help us a lot. Now I see it takes an hour for ~1500 links, so, .. maybe 2 days, before I get the whole donwloads file …

    My internet is about 2Mbps down.


  5. Rob

    There’s a slightly easier way to do that. Use the ia download utility to generate an itemlist, and then work on that to make urls.
    ia search “collection:georgeblood” –itemlist > items
    That gives you a list of identifiers. Then you can split it into smaller chunks so you don’t have a huge massive file to deal with all at once, like so.
    mkdir items
    cd items
    split -d -l 100 ../items
    That gives you a series of files with 100 identifiers. Then you can generate your urls.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.