Downloading all the 78rpm rips at the Internet Archive

I’m a bit of a fan of 1930s popular music on gramophone records, so much so that I own an original early-30s gramophone player and an extensive collection of discs. So the announcement that the Internet Archive had released a collection of 29,000 records was pretty amazing.

[Edit: If you want a light introduction to this, I recommend this double CD]

I wanted to download it … all!

But apart from this gnomic explanation it isn’t obvious how, so I had to work it out. Here’s how I did it …

Firstly you do need to start with the Advanced Search form. Using the second form on that page, in the query box put collection:georgeblood, select the identifier field (only), set the format to CSV. Set the limit to 30000 (there are about 25000+ records), and download the huge CSV:

$ ls -l search.csv
-rw-rw-r--. 1 rjones rjones 2186375 Aug 14 21:03 search.csv
$ wc -l search.csv
25992 search.csv
$ head -5 search.csv
"identifier"
"78_jeannine-i-dream-of-you-lilac-time_bar-harbor-society-orch.-irving-kaufman-shilkr_gbia0010841b"
"78_a-prisoners-adieu_jerry-irby-modern-mountaineers_gbia0000549b"
"78_if-i-had-the-heart-of-a-clown_bobby-wayne-joe-reisman-rollins-nelson-kane_gbia0004921b"
"78_how-many-times-can-i-fall-in-love_patty-andrews-and-tommy-dorsey-victor-young-an_gbia0013066b"

A bit of URL exploration found a fairly straightforward way to turn those identifiers into directory listings. For example:

78_jeannine-i-dream-of-you-lilac-time_bar-harbor-society-orch.-irving-kaufman-shilkr_gbia0010841b โ†’ https://archive.org/download/78_jeannine-i-dream-of-you-lilac-time_bar-harbor-society-orch.-irving-kaufman-shilkr_gbia0010841b

What I want to do is pick the first MP3 file in the directory and download it. I’m not fussy about how to do that, and Python has both a CSV library and an HTML fetching library. This turns the CSV file of links into a list of MP3 URLs. You could easily adapt this to download FLAC files instead.

#!/usr/bin/python

import csv
import re
import urllib2
import urlparse
from BeautifulSoup import BeautifulSoup

with open('search.csv', 'rb') as csvfile:
    csvreader = csv.reader(csvfile, delimiter=',', quotechar='"')
    for row in csvreader:
        if row[0] == "identifier":
            continue
        url = "https://archive.org/download/%s/" % row[0]
        page = urllib2.urlopen(url).read()
        soup = BeautifulSoup(page)
        links = soup.findAll('a', attrs={'href': re.compile("\.mp3$")})
        # Only want the first link in the page.
        link = links[0]
        link = link.get('href', None)
        link = urlparse.urljoin(url, link)
        print link

When you run this it converts each identifier into a download URL:

Edit: Amusingly WordPress turns the next pre section with MP3 URLs into music players. I recommend listening to them!

$ ./download.py | head -10











And after that you can download as many 78s as you can handle ๐Ÿ™‚ by doing:

$ ./download.py > downloads
$ wget -nc -i downloads

Update

I only downloaded about 5% of the tracks, but it looks as if downloading it all would be ~ 100 GB. Also most of these tracks are still in copyright (thanks to insane copyright terms), so they may not be suitable for sampling on your next gramophone-rap record.

Update #2

Don’t forget to donate to the Internet Archive. I gave them $50 to continue their excellent work.

Advertisements

3 Comments

Filed under Uncategorized

Fedora 26 is out, virt-builder images available

Fedora 26 is released today. virt-builder images are already available for almost all architectures:

$ virt-builder -l | grep fedora-26
fedora-26                aarch64    Fedoraยฎ 26 Server (aarch64)
fedora-26                armv7l     Fedoraยฎ 26 Server (armv7l)
fedora-26                i686       Fedoraยฎ 26 Server (i686)
fedora-26                ppc64      Fedoraยฎ 26 Server (ppc64)
fedora-26                ppc64le    Fedoraยฎ 26 Server (ppc64le)
fedora-26                x86_64     Fedoraยฎ 26 Server

For example:

$ virt-builder fedora-26
$ qemu-system-x86_64 -machine accel=kvm:tcg -cpu host -m 2048 \
    -drive file=fedora-26.img,format=raw,if=virtio

Why not s390x? That’s because qemu doesn’t yet emulate enough of the s390x instruction set / architecture so that we can run Fedora under TCG emulation.

Leave a comment

Filed under Uncategorized

Patch review and message brokers

One thing I’ve wanted to do for a long time is get better at patch review. It’s pretty important for successful open source projects to provide feedback to developers quickly, and as anyone who follows me on the libguestfs mailing list will know, I’m terrible at it.

One thing I could do to make it a bit better is to automate the boring bits: Does the patch series apply? Does it compile? Does it pass the test suite? If one of those things isn’t true then we tell the submitter to fix it.

Some projects โ€” the Linux Kernel Mailing List (LKML) for instance โ€” provide basic feedback automatically. For LKML this is provided by Intel’s 0-day test service. If you post a patch on LKML then sooner or later you’ll receive an automated reply like this one.

Today I thought I’d write something like this, partly to reinvent the wheel, but mostly to learn more about the RabbitMQ message broker.

You see, if you have to receive emails, run large tests, and send more emails, then at least two and possibly more machines and going to be involved, and as soon as you are using two or more machines, you are writing a distributed system and you need to use the right tools. Message brokers and RabbitMQ in particular make writing distributed systems easy โ€” trust me, I’ll show you how!

Receiving emails

Our first task is going to be how to get the emails into the system. We can use a procmail rule to copy emails to a command of our choice, but the command only sees one email at a time. Patch series are spread over many individual emails, you don’t always get them at once, and you certainly aren’t guaranteed to get them in order.

So first of all I set up a RabbitMQ queue which just takes in emails in any order and queues them:

The input to this queue is a simple script which can inject single emails or (for testing) mboxes.

Threading emails

Reconstructing the email thread, filtering out non-patch emails, and putting the patches into the right order, is done by a second task which runs periodically from a cron job.

The threading task examines the input queue and tries to reconstruct whole patch series from it. If it gets a patch series, it removes those messages from the input queue and places the whole patch series as a single message on a second queue:

What makes this possible is that RabbitMQ allows you to get messages from a queue, and then acknowledge (or not acknowledge) them later. So the threader gets all the available messages, tries to assemble them into threads. If it finds a complete patch series, then it acknowledges all of those emails โ€” which deletes them from the input queue. For incomplete patch series, it doesn’t bother to acknowledge them, so they stay on the queue for next time.

By the magic of message brokers, the threader doesn’t even need to run on the same machine. Conceivably you could even run it on multiple machines if there was a very high load situation, and it would still work reliably.

Performing the tests

Once we have our fully assembled patch series threads, these are distributed to multiple queues by a RabbitMQ fanout exchange. There is one queue for each type of test we need to run:

An instance of a third task called perform-tests.py picks up the patches and tests them using custom scripts that you have to write.

The tests usually run inside a virtual machine to keep them out of harm’s way. Again the use of a message broker makes this trivial, and you can even distribute the tests over many machines if you want with no extra programming.

Reporting

There is a final queue: When tests finish, we don’t necessarily want to email out the report from the test machine. There would be several problems with that: it would reveal details of your testing infrastructure in email headers; SMTP servers aren’t necessarily available all the time; and you don’t always want your test machines to have access to the public internet.

Instead the result is placed on the patchq_reports queue, and a final task called send-reports.py picks these reports up and sends them out periodically. The report emails have the proper headers so they are threaded into the original mailing list postings.

Conclusion

It’s a simple but powerful multi-machine test framework, all in under 600 lines of code.

3 Comments

Filed under Uncategorized

virt-builder Debian 9 image available

Debian 9 (“Stretch”) was released last week and now it’s available in virt-builder, the fast way to build virtual machine disk images:

$ virt-builder -l | grep debian
debian-6                 x86_64     Debian 6 (Squeeze)
debian-7                 sparc64    Debian 7 (Wheezy) (sparc64)
debian-7                 x86_64     Debian 7 (Wheezy)
debian-8                 x86_64     Debian 8 (Jessie)
debian-9                 x86_64     Debian 9 (stretch)

$ virt-builder debian-9 \
    --root-password password:123456
[   0.5] Downloading: http://libguestfs.org/download/builder/debian-9.xz
[   1.2] Planning how to build this image
[   1.2] Uncompressing
[   5.5] Opening the new disk
[  15.4] Setting a random seed
virt-builder: warning: random seed could not be set for this type of guest
[  15.4] Setting passwords
[  16.7] Finishing off
                   Output file: debian-9.img
                   Output size: 6.0G
                 Output format: raw
            Total usable space: 3.9G
                    Free space: 3.1G (78%)

$ qemu-system-x86_64 \
    -machine accel=kvm:tcg -cpu host -m 2048 \
    -drive file=debian-9.img,format=raw,if=virtio \
    -serial stdio

4 Comments

Filed under Uncategorized

New in libguestfs: Rewriting bits of the daemon in OCaml

libguestfs is a C library for creating and editing disk images. In the most common (but not the only) configuration, it uses KVM to sandbox access to disk images. The C library talks to a separate daemon running inside a KVM appliance, as in this Unicode-art diagram taken from the fine manual:

 โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
 โ”‚ main program      โ”‚
 โ”‚                   โ”‚
 โ”‚                   โ”‚           child process / appliance
 โ”‚                   โ”‚          โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
 โ”‚                   โ”‚          โ”‚ qemu                     โ”‚
 โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค   RPC    โ”‚      โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚
 โ”‚ libguestfs  โ—€โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ–ถ guestfsd        โ”‚ โ”‚
 โ”‚                   โ”‚          โ”‚      โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚
 โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜          โ”‚      โ”‚ Linux kernel    โ”‚ โ”‚
                                โ”‚      โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚
                                โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”‚โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                                                โ”‚
                                                โ”‚ virtio-scsi
                                         โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”
                                         โ”‚  Device or  โ”‚
                                         โ”‚  disk image โ”‚
                                         โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

The library has to be written in C because it needs to be linked to any main program. The daemon (guestfsd in the diagram) is also written in C. But there’s not so much a specific reason for that, except that’s what we did historically.

The daemon is essentially a big pile of functions, most corresponding to a libguestfs API. Writing the daemon in C is painful to say the least. Because it’s a long-running process running in a memory-constrained environment, we have to be very careful about memory management, religiously checking every return from malloc, strdup etc., making even the simplest task non-trivial and full of untested code paths.

So last week I modified libguestfs so you can now write APIs in OCaml if you want to. OCaml is a high level language that compiles down to object files, and it’s entirely possible to link the daemon from a mix of C object files and OCaml object files. Another advantage of OCaml is that you can call from C โ†” OCaml with relatively little glue code (although a disadvantage is that you still need to write that glue mostly by hand). Most simple calls turn into direct CALL instructions with just a simple bitshift required to convert between ints and bools on the C and OCaml sides. More complex calls passing strings and structures are not too difficult either.

OCaml also turns memory errors into a single exception, which unwinds the stack cleanly, so we don’t litter the code with memory handling. We can still run the mixed C/OCaml binary under valgrind.

Code gets quite a bit shorter. For example the case_sensitive_path API โ€” all string handling and directory lookups โ€” goes from 183 lines of C code to 56 lines of OCaml code (and much easier to understand too).

I’m reimplementing a few APIs in OCaml, but the plan is definitely not to convert them all. I think we’ll have C and OCaml APIs in the daemon for a very long time to come.

Leave a comment

Filed under Uncategorized

AMD Seattle LeMaker Cello

I was a bit optimistic when I said:

the LeMaker Cello is available for preorder with delivery next month.

back in March 2016 (sic).

But hey, better late than never.

AMD seem to have decided to give up on ARM, making this board now only a historical curiosity. And look at that heatsink! I suspect these early chips have cooling problems.

3 Comments

Filed under Uncategorized

How many disks can you add to a (virtual) Linux machine? (contd)

In my last post I tried to see what happens when you add thousands of virtio-scsi disks to a Linux virtual machine. Above 10,000 disks the qemu command line grew too long for the host to handle. Several people pointed out that I could use the qemu -readconfig parameter to read the disks from a file. So I modified libguestfs to allow that. What will be the next limit?

18,278

Linux uses a strange scheme for naming disks which I’ve covered before on this blog. In brief, disks are named /dev/sda through /dev/sdz, then /dev/sdaa through /dev/sdzz, and after 18,278 drives we reach /dev/sdzzz. What’s special about zzz? Nothing really, but historically Linux device drivers would fail after this, although that is not a problem for modern Linux.

20,000

In any case I created a Linux guest with 20,000 drives with no problem except for the enormous boot time: It was over 12 hours at which point I killed it. Most of the time was being spent in:

-   72.62%    71.30%  qemu-system-x86  qemu-system-x86_64  [.] drive_get
   - 72.62% drive_get
      - 1.26% __irqentry_text_start
         - 1.23% smp_apic_timer_interrupt
            - 1.00% local_apic_timer_interrupt
               - 1.00% hrtimer_interrupt
                  - 0.82% __hrtimer_run_queues
                       0.53% tick_sched_timer

Drives are stored inside qemu on a linked list, and the drive_get function iterates over this linked list, so of course everything is extremely slow when this list grows long.

QEMU bug filed: https://bugs.launchpad.net/qemu/+bug/1686980

Edit: Dan Berrange posted a hack which gets me past this problem, so now I can add 20,000 disks.

The guest boots fine, albeit taking about 30 minutes (and udev hasn’t completed device node creation in that time, it’s still going on in the background).

><rescue> ls -l /dev/sd[Tab]
Display all 20001 possibilities? (y or n)
><rescue> mount
/dev/sdacog on / type ext2 (rw,noatime,block_validity,barrier,user_xattr,acl)

As you can see the modern Linux kernel and userspace handles “four letter” drive names like a champ.

Over 30,000

I managed to create a guest with 30,000 drives. I had to give the guest 50 GB (yes, not a mistake) of RAM to get this far. With less RAM, disk probing fails with:

scsi_alloc_sdev: Allocation failure during SCSI scanning, some SCSI devices might not be configured

I’d seen SCSI probing run out of memory before, and I made a back-of-the-envelope calculation that each disk consumed 200 KB of RAM. However that cannot be correct — there must be a non-linear relationship between number of disks and RAM used by the kernel.

Because my development machine simply doesn’t have enough RAM to go further, I wasn’t able to add more than 30,000 drives, so that’s where we have to end this little experiment, at least for the time being.

><rescue> ls -l /dev/sd???? | tail
brw------- 1 root root  66, 30064 Apr 28 19:35 /dev/sdarin
brw------- 1 root root  66, 30080 Apr 28 19:35 /dev/sdario
brw------- 1 root root  66, 30096 Apr 28 19:35 /dev/sdarip
brw------- 1 root root  66, 30112 Apr 28 19:35 /dev/sdariq
brw------- 1 root root  66, 30128 Apr 28 19:35 /dev/sdarir
brw------- 1 root root  66, 30144 Apr 28 19:35 /dev/sdaris
brw------- 1 root root  66, 30160 Apr 28 19:35 /dev/sdarit
brw------- 1 root root  66, 30176 Apr 28 19:24 /dev/sdariu
brw------- 1 root root  66, 30192 Apr 28 19:22 /dev/sdariv
brw------- 1 root root  67, 29952 Apr 28 19:35 /dev/sdariw

3 Comments

Filed under Uncategorized