Tag Archives: cluster

Installing Fedora 34 on my Turing Pi 7 node cluster

I now have Fedora 34 running on all 7 nodes of my Turing Pi 1 cluster. Tedious to install, so these are just my notes.

Compile https://github.com/raspberrypi/usbboot

Insert the compute module to program, set the jumper to flash mode, connect the cable, power it on and run ./rpiboot

/dev/sdX will appear once the CM is in the right mode. Following the instructions here:

arm-image-installer --image=$HOME/Fedora-Server-34-1.2.aarch64.raw.xz --target=rpi3 --media=/dev/sdb --resizefs --relabel --addkey $HOME/.ssh/id_rsa.pub

This takes quite a long time to run. You also need to kill initial-setup:

virt-customize -a /dev/sdb --link /dev/null:/etc/systemd/system/initial-setup.service --selinux-relabel

If it’s successful, reboot the Turing Pi with the jumper in the boot setting and check the first CM comes up (I used an HDMI monitor)

Repeat with the other 6 boards(!)

Edit: I couldn’t work out how to get stable MAC addresses. To get a stable MAC address:

nmcli c modify "Wired connection 1" 802-3-ethernet.cloned-mac-address aa:bb:cc:dd:ee:ff

Leave a comment

Filed under Uncategorized

Turing Pi 1

Finally getting down to assembling and installing this 7 node cluster.

Leave a comment

Filed under Uncategorized

Mini Cloud/Cluster v2.0

Last year I wrote and rewrote a little command line tool for managing my virtualization cluster.

Of course I could use OpenStack RDO but OpenStack is a vast box of somewhat working bits and pieces. I think for a small cluster like mine you can get the essential functionality of OpenStack a lot more simply — in 1300 lines of code as it turns out.

The first thing that small cluster management software doesn’t need is any permanent daemon running on the nodes. The reason is that we already have sshd (for secure management access) and libvirtd (to manage the guests) out of the box. That’s quite sufficient to manage all the state we care about. My Mini Cloud/Cluster software just goes out and queries each node for that information whenever it needs it (in parallel of course). Nodes that are switched off are handled by ignoring them.

The second thing is that for a small cloud we can toss features that aren’t needed at all: multi-user/multi-tenant, failover, VLANs, a nice GUI.

The old mclu (Mini Cluster) v1.0 was written in Python and used Ansible to query nodes. If you’re not familiar with Ansible, it’s basically parallel ssh on steroids. This was convenient to get the implementation working, but I ended up rewriting this essential feature of Ansible in ~ 60 lines of code.

The huge down-side of Python is that even such a small program has loads of hidden bugs, because there’s no safety at all. The rewrite (in OCaml) is 1,300 lines of code, so a fraction larger, but I have a far higher confidence that it is mostly bug free.

I also changed around the way the software works to make it more “cloud like” (and hence the name change from “Mini Cluster” to “Mini Cloud”). Guests are now created from templates using virt-builder, and are stateless “cattle” (although you can mix in “pets” and mclu will manage those perfectly well because all it’s doing is remote libvirt-over-ssh commands).

$ mclu status
ham0                     on
                           total: 8pcpus 15.2G
                            used: 8vcpus 8.0G by 2 guest(s)
                            free: 6.2G
ham1                     on
                           total: 8pcpus 15.2G
                            free: 14.2G
ham2                     on
                           total: 8pcpus 30.9G
                            free: 29.9G
ham3                     off

You can grab mclu v2.0 from the git repository.

2 Comments

Filed under Uncategorized

LVM cache contd: Tip: Using tags

(Thanks Alasdair Kergon)

# pvchange --addtag ssd /dev/sdc1
  Physical volume "/dev/sdc1" changed
  1 physical volume changed / 0 physical volumes not changed
# pvchange --addtag slow /dev/md127
  Physical volume "/dev/md127" changed
  1 physical volume changed / 0 physical volumes not changed
# pvs -o+tags
  PV         VG        Fmt  Attr PSize   PFree   PV Tags
  /dev/md127 vg_guests lvm2 a--    1.82t 962.89g slow
  /dev/sdc1  vg_guests lvm2 a--  232.88g 232.88g ssd

These tags can be used when placing logical volumes on to specific physical volumes:

# lvcreate -L 800G -n testorigin vg_guests @slow
  1. https://www.redhat.com/archives/linux-lvm/2014-May/msg00047.html
  2. https://www.redhat.com/archives/linux-lvm/2014-May/msg00048.html

Leave a comment

Filed under Uncategorized

Cluster performance: baseline testing

I’m using fio (as recommended by Linus!) to baseline test my virtualization cluster. My fio script is supposed to look a bit like a qemu process:

[virt]
ioengine=libaio
iodepth=4
rw=randrw
bs=64k
direct=1
size=1g
numjobs=4

It has 4 large “disks” (size=1g) and 4 large “qemu processes” (numjobs=4) running in parallel. Each test thread can have up to 4 IOs in flight (iodepth=4) and the size of IOs is 64K which matches qcow2 default cluster size. I enabled O_DIRECT (direct=1) because we normally use qemu cache=none so that live migration works.

The first node now has a RAID 1 array of spinning rust (hard disks) and a smaller SSD, and the plan is to use LVM-cache so the SSD can sit on top of the RAID array.

Performance of the RAID 1 array of hard disks

The raw performance of the RAID 1 array (this includes the filesystem) is fairly dismal:

virt-ham0-raid1.txt

Performance of the SSD

The SSD in contrast does a lot better:

virt-ham0-ssd.txt

However you want to look at the details, the fact is that the test runs 11 times faster on the SSD.

The effect of NFS

What about when we NFS-mount the RAID array or the SSD on another node? This should tell us the effect of NFS.

virt-ham1-raid1-nfs.txt

NFS makes this test run 3 times slower.

For the NFS-mounted SSD:

virt-ham1-ssd-nfs.txt

NFS makes this test run 4.5 times slower.

The effect of virtualization

By running the virtual machine on the first node (with the disks) it should be possible to see just the effect of virtualization. Since this is backed by the RAID 1 hard disk array, not SSDs or LVM cache, it should be compared only to the RAID 1 performance.

virt-vm-on-ham0.txt

The effect of virtualization (virtio-scsi in this case) is about an 8% drop in performance, which is not something I’m going to worry about.

Conclusions

  • The gains from the SSD (ie. using LVM cache) could outweigh the losses from having to use NFS to share the disk images.
  • It’s worth looking at alternate high bandwidth, low-latency interconnects (instead of 1 gigE) to make NFS perform better. I’m going to investigate using Infiniband soon.

These are just the baseline measurements without LVM cache.

I’ve included links to the full test results. fio gives a huge amount of detail, and it’s helpful to keep the HOWTO open so you can understand all the figures it is producing.

1 Comment

Filed under Uncategorized

Removing the cache from an LV

If you’ve cached an LV (see yesterday’s post), how do you remove the cache and go back to just a simple origin LV?

It turns out to be simple, but you must make sure you are removing the cache pool (not the origin LV, not the CacheMetaLV):

# lvremove vg_guests/lv_cache
  Flushing cache for testoriginlv.
  0 blocks must still be flushed.
  Logical volume "lv_cache" successfully removed

This command deletes the CacheDataLV and CacheMetaLV. To reenable the cache you have to go through the process here again.

This is also what you must (currently) do if you want to resize the LV — ie. remove the cache pool, resize the origin LV, then recreate the cache on top. Apparently they’re going to get lvresize to do the right thing in a future version of LVM.

1 Comment

Filed under Uncategorized

Using LVM’s new cache feature

If you have a machine with slow hard disks and fast SSDs, and you want to use the SSDs to act as fast persistent caches to speed up access to the hard disk, then until recently you had three choices: bcache and dm-cache are both upstream, or Flashcache/EnhanceIO. Flashcache is not upstream. dm-cache required you to first sit down with a calculator to compute block offsets. bcache was the sanest of the three choices.

But recently LVM has added caching support (built on top of dm-cache), so in theory you can take your existing logical volumes and convert them to be cached devices.

The Set-up

To find out how well this works in practice I have added 3 disks to my previously diskless virtualization cluster:

20140522_101903

There are two 2 TB WD hard disks in mirrored configuration. Those are connected by the blue (“cold”) wires. And on the left there is one Samsung EVO 250 GB SSD, which is the red (“hot”) drive that will act as the cache.

In other news: wow, SSDs from brand manufacturers are getting really cheap now!

In the lsblk output below, sda and sdb are the WD hard drives, and sdc is the Samsung SSD:

# lsblk
NAME                                     MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
sda                                        8:0    0   1.8T  0 disk  
└─sda1                                     8:1    0   1.8T  0 part  
  └─md127                                  9:127  0   1.8T  0 raid1 
sdb                                        8:16   0   1.8T  0 disk  
└─sdb1                                     8:17   0   1.8T  0 part  
  └─md127                                  9:127  0   1.8T  0 raid1 
sdc                                        8:32   0 232.9G  0 disk  
└─sdc1                                     8:33   0 232.9G  0 part  

Performance

Before starting to set up the caching layer, let’s find out how fast the hard disks are. Note that these figures include the ext4 and LVM overhead (ie. they are done on files on a filesystem, not on the raw block devices). I also used O_DIRECT.

HDD writes: 114 MBytes/sec
HDD reads: 138 MBytes/sec
SSD writes: 157 MBytes/sec
SSD reads: 197 MBytes/sec

Note these numbers don’t show the real benefit of SSDs — namely that performance doesn’t collapse as soon as you randomly access the disk.

Terminology

The lvmcache(7) [so new there is no copy online yet] documentation defines various terms that I will use:

origin LV           OriginLV      large slow LV
cache data LV       CacheDataLV   small fast LV for cache pool data
cache metadata LV   CacheMetaLV   small fast LV for cache pool metadata
cache pool LV       CachePoolLV   CacheDataLV + CacheMetaLV
cache LV            CacheLV       OriginLV + CachePoolLV

Creating the LVs

Since the documentation contains a frankly rather scary and confusing section about all the ways that removing the wrong LV will completely nuke your OriginLV, for the purposes of testing I created a dummy OriginLV with some dummy disk images on the slow HDDs:

# lvcreate -L 100G -n testoriginlv vg_guests
  Logical volume "testoriginlv" created
# mkfs -t ext4 /dev/vg_guests/testoriginlv

Also note that resizing cached LVs is not currently supported (coming later — for now you can work around it by removing the cache, resizing, then recreating the cache).

Creating the cache layer

What is not clear from the documentation is that everything must be in a single volume group. That is, you must create a volume group which includes both the slow and fast disks — it simply doesn’t work otherwise.

Therefore my first step is to extend my existing VG to include the fast disk:

# vgextend vg_guests /dev/sdc1
  Volume group "vg_guests" successfully extended

I create two LVs on the fast SSD. One is the CacheDataLV, which is where the caching takes place. The other is the CacheMetaLV which is used to store an index of the data blocks that are cached on the CacheDataLV. The documentation says that the CacheMetaLV should be 1/1000th of the size of the CacheDataLV, but a minimum of 8MB. Since my total available fast space is 232GB, and I want a 1000:1 split, I choose a generous 1GB for CacheMetaLV, 229G for CacheDataLV, and that will leave some left over space (my eventual split turns out to be 229:1).

# lvcreate -L 1G -n lv_cache_meta vg_guests /dev/sdc1
  Logical volume "lv_cache_meta" created
# lvcreate -L 229G -n lv_cache vg_guests /dev/sdc1
  Logical volume "lv_cache" created
# lvs
  LV                     VG        Attr       LSize
  lv_cache               vg_guests -wi-a----- 229.00g
  lv_cache_meta          vg_guests -wi-a-----   1.00g
  testoriginlv           vg_guests -wi-a----- 100.00g
# pvs
  PV         VG        Fmt  Attr PSize   PFree  
  /dev/md127 vg_guests lvm2 a--    1.82t 932.89g
  /dev/sdc1  vg_guests lvm2 a--  232.88g   2.88g

(You’ll notice that my cache is bigger than my test OriginLV, but that’s fine as once I’ve worked out all the gotchas, my real OriginLV will be over 1 TB).

Why did I leave 2.88GB of free space in the PV? I’m not sure actually. However the first time I did this, I didn’t leave any space, and the lvconvert command [below] complained that it needed 256 extents (1GB) of workspace. See Alex’s comment below.

Convert the CacheDataLV and CacheMetaLV into a “cache pool”:

# lvconvert --type cache-pool --poolmetadata vg_guests/lv_cache_meta vg_guests/lv_cache
  Logical volume "lvol0" created
  Converted vg_guests/lv_cache to cache pool.

Now attach the cache pool to the OriginLV to create the final cache object:

# lvconvert --type cache --cachepool vg_guests/lv_cache vg_guests/testoriginlv
  vg_guests/testoriginlv is now cached.

Benchmark

Looks good, but how well does it work? I repeated my benchmarks above on the cached LV:

LV-cache writes: 114 MBytes/sec
LV-cache reads: 138 MBytes/sec

Which is exactly the same as the backing hard disk.

Luckily this is correct behaviour. Mike Snitzer gave me an explanation of why my test using dd isn’t a useful test of dm-cache.

What I’m going to do next is to start setting up guests, and check the performance inside each guest (which is what in the end I care about).

70 Comments

Filed under Uncategorized

Mini-cluster (mclu) rewritten to use ansible

As (kind of) requested in several comments on the previous post I’ve rewritten mclu so it uses Ansible for most communications. (diff)

Since Ansible handles communicating with ssh servers in parallel, it’s a bit faster and in theory a bit more scalable.

Update

Another advantage is that you can run regular ansible commands, for example:

ansible cluster -u root -a "yum -y update"

1 Comment

Filed under Uncategorized

Mini-cluster (mclu) command line tool

After setting up my virtualization cluster I was disappointed by the available cluster management tools. You can either “go large” (Openstack) with all the associated trauma of setting that up, or go with various GUIs. Or you can run ssh & virsh commands, which gets tedious quickly.

Therefore I wrote some simple scripts to help perform common cluster operations on small clusters (up to about 10 nodes). It’s only 1000 lines of code, and the advantage is you only need sshd and libvirtd on each node, which you almost certainly have already.

You can download them from this git repository.

Get cluster status:

$ mclu status
ham0 (ham0.home.annexia.org) down
ham1 (ham1.home.annexia.org) down
ham2 (ham2.home.annexia.org) down
ham3 (ham3.home.annexia.org) down
$ mclu wake --all
$ mclu status
ham0 (ham0.home.annexia.org) up ssh: OK libvirt: OK
ham1 (ham1.home.annexia.org) up ssh: OK libvirt: OK
ham2 (ham2.home.annexia.org) up ssh: OK libvirt: OK
ham3 (ham3.home.annexia.org) up ssh: OK libvirt: OK

Build a new VM (using virt-builder):

$ mclu build --size=20G -- \
   ham0:tmp-f20-1 fedora-20 \
   --root-password password:123456

List running VMs, live-migrate the new one around:

Note that wildcards can be used when starting, stopping and migrating VMs:

$ mclu list
ham0:tmp-f20-1	running
$ mclu migrate \* ham2:
$ mclu list
ham2:tmp-f20-1	running
$ mclu stop ham2:*
$ mclu list
tmp-f20-1	inactive

Open a console

$ mclu start ham3:tmp-f20-1
$ mclu console tmp-f20-1
Connected to domain tmp-f20-1
Escape character is ^]
^]
$ mclu viewer tmp-f20-1
(graphical window opens)

9 Comments

Filed under Uncategorized

Setting up virtlockd on NFS

virtlockd is a lock manager implementation for libvirt. It’s designed to prevent you from starting two virtual machines (eg. on different nodes in your cluster) which are backed by the same writable disk image, something which can cause disk corruption. It uses plain fcntl-based file locking, so it is ideal for use when you are using NFS to share your disk images.

Since documentation is rather lacking, this post summarises how to set up virtlockd. I am using NFS to share /var/lib/libvirt/images across all the nodes in my virtualization cluster.

Firstly it is not clear from the documentation, but virtlockd runs alongside libvirtd on every node. The reason for this is so that libvirtd can be killed without having it drop all the locks, which would leave all your VMs unprotected. (You can restart virtlockd independently when it is safe to do so). I guess the other reason is because POSIX file locking is so fscking crazy unless you use it from an independent process.

Another thing which is not clear from the documentation: virtlockd doesn’t listen on any TCP ports, so you don’t need to open up the firewall. The local libvirtd and virtlockd processes communicate over a private Unix domain socket and virtlockd doesn’t need to communicate with anything else.

There are two ways that virtlockd can work: It can either lock the images directly (this is contrary to what the current documentation says, but Dan told me this so it must be true).

Or you can set up a separate lock file directory, where virtlockd will create zero-sized lock files. This lock file directory must be shared with all nodes over NFS. The lock directory is only needed if you’re not using disk image files (eg. you’re using iSCSI LUNs or something). The reason is that you can’t lock things like devices using fcntl. If you want to go down this route, apart from setting up the shared lock directory somewhere, exporting it from your NFS server, and mounting it on all nodes, you will also have to edit /etc/libvirt/qemu-lockd.conf. The comments are fairly self-explanatory.

However I’m using image files, so I’m going to opt for locking the files directly. This is easy to set up because there’s hardly configuration at all: as long as virtlockd is running, it will just lock the image files. All you have to do is make sure the virtlockd service is installed on every node. (It is socket-activated, so you don’t need to enable it), and tell libvirt’s qemu driver to use it:

--- /etc/libvirt/qemu.conf ---
lock_manager = "lockd"

2 Comments

Filed under Uncategorized