Downloading all the 78rpm rips at the Internet Archive

I’m a bit of a fan of 1930s popular music on gramophone records, so much so that I own an original early-30s gramophone player and an extensive collection of discs. So the announcement that the Internet Archive had released a collection of 29,000 records was pretty amazing.

[Edit: If you want a light introduction to this, I recommend this double CD]

I wanted to download it … all!

But apart from this gnomic explanation it isn’t obvious how, so I had to work it out. Here’s how I did it …

Firstly you do need to start with the Advanced Search form. Using the second form on that page, in the query box put collection:georgeblood, select the identifier field (only), set the format to CSV. Set the limit to 30000 (there are about 25000+ records), and download the huge CSV:

$ ls -l search.csv
-rw-rw-r--. 1 rjones rjones 2186375 Aug 14 21:03 search.csv
$ wc -l search.csv
25992 search.csv
$ head -5 search.csv
"identifier"
"78_jeannine-i-dream-of-you-lilac-time_bar-harbor-society-orch.-irving-kaufman-shilkr_gbia0010841b"
"78_a-prisoners-adieu_jerry-irby-modern-mountaineers_gbia0000549b"
"78_if-i-had-the-heart-of-a-clown_bobby-wayne-joe-reisman-rollins-nelson-kane_gbia0004921b"
"78_how-many-times-can-i-fall-in-love_patty-andrews-and-tommy-dorsey-victor-young-an_gbia0013066b"

A bit of URL exploration found a fairly straightforward way to turn those identifiers into directory listings. For example:

78_jeannine-i-dream-of-you-lilac-time_bar-harbor-society-orch.-irving-kaufman-shilkr_gbia0010841b → https://archive.org/download/78_jeannine-i-dream-of-you-lilac-time_bar-harbor-society-orch.-irving-kaufman-shilkr_gbia0010841b

What I want to do is pick the first MP3 file in the directory and download it. I’m not fussy about how to do that, and Python has both a CSV library and an HTML fetching library. This turns the CSV file of links into a list of MP3 URLs. You could easily adapt this to download FLAC files instead.

#!/usr/bin/python

import csv
import re
import urllib2
import urlparse
from BeautifulSoup import BeautifulSoup

with open('search.csv', 'rb') as csvfile:
    csvreader = csv.reader(csvfile, delimiter=',', quotechar='"')
    for row in csvreader:
        if row[0] == "identifier":
            continue
        url = "https://archive.org/download/%s/" % row[0]
        page = urllib2.urlopen(url).read()
        soup = BeautifulSoup(page)
        links = soup.findAll('a', attrs={'href': re.compile("\.mp3$")})
        # Only want the first link in the page.
        link = links[0]
        link = link.get('href', None)
        link = urlparse.urljoin(url, link)
        print link

When you run this it converts each identifier into a download URL:

Edit: Amusingly WordPress turns the next pre section with MP3 URLs into music players. I recommend listening to them!

$ ./download.py | head -10

And after that you can download as many 78s as you can handle 🙂 by doing:

$ ./download.py > downloads
$ wget -nc -i downloads

Update

I only downloaded about 5% of the tracks, but it looks as if downloading it all would be ~ 100 GB. Also most of these tracks are still in copyright (thanks to insane copyright terms), so they may not be suitable for sampling on your next gramophone-rap record.

Update #2

Don’t forget to donate to the Internet Archive. I gave them $50 to continue their excellent work.

5 responses to “Downloading all the 78rpm rips at the Internet Archive”

kun

August 15, 2017 at 12:59 am

Demand is the drive of creation

cpanceacCornel Panceac

August 19, 2017 at 5:16 am

it’s just me? or this is incredibly slow?

cpanceacCornel Panceac

August 19, 2017 at 7:07 pm

Anyway: is there a simple way to to create a link containing flacs f mp3s are not available? I’m currently using a if-else-if-else ladder for this but i’d like to do instead something like:

links = soup.findAll(‘a’, attrs={‘href’: re.compile(“\.mp3$” or “\.ogg$” or “\.flac$”)})

which i don’t really expect it to work 🙂

markov035

January 27, 2018 at 9:54 pm

If somebody would post a monthly “downloads” file, that would help us a lot. Now I see it takes an hour for ~1500 links, so, .. maybe 2 days, before I get the whole donwloads file …

My internet is about 2Mbps down.

Thanks,
Marc

Rob

July 30, 2018 at 8:32 am

There’s a slightly easier way to do that. Use the ia download utility to generate an itemlist, and then work on that to make urls.
ia search “collection:georgeblood” –itemlist > items
That gives you a list of identifiers. Then you can split it into smaller chunks so you don’t have a huge massive file to deal with all at once, like so.
mkdir items
cd items
split -d -l 100 ../items
That gives you a series of files with 100 identifiers. Then you can generate your urls.

	Jonathan on SSH from RHEL 9 to RHEL 5 or R…
	Steve on BLKDISCARD, BLKZEROOUT, BLKDIS…
	Tony on guestfish now supports 502…
	Tony on New guestfish -N options in…
	Kitty on SSH from RHEL 9 to RHEL 5 or R…
	rich on BLKDISCARD, BLKZEROOUT, BLKDIS…
	Steve on BLKDISCARD, BLKZEROOUT, BLKDIS…
	Joachim on New tool: virt-customize
	rich on New tool: virt-customize
	Joachim on New tool: virt-customize

Downloading all the 78rpm rips at the Internet Archive

5 responses to “Downloading all the 78rpm rips at the Internet Archive”

Leave a comment Cancel reply

Recent Posts

Recent Comments

About the author

Downloading all the 78rpm rips at the Internet Archive

Share this:

Related

5 responses to “Downloading all the 78rpm rips at the Internet Archive”

Leave a comment Cancel reply

Recent Posts

Recent Comments

About the author