This is a disclaimer:
Using the notes below is dangerous for both your sanity and peace of mind.
If you still want to read them beware of the fact that they may be "not even wrong".

Everything I write in there is just a mnemonic device to give me a chance to
fix things I badly broke because I'm bloody stupid and think I can tinker with stuff
that is way above my head and go away with it. It reminds me of Gandalf's warning:
"Perilous to all of us are the devices of an art deeper than we ourselves possess."

Moreover, a lot of it I blatantly stole on the net from other obviously cleverer
persons than me -- not very hard. Forgive me. My bad.

Please consider it and go away. You have been warned!

Virtualized Pacemaker Cluster using Xen and GlusterFS.

Those notes are just a mnemonic device due to my frequent brain-farts. You are on your own if you’re crazy enough to use them. You have been warned.


Dom0

I have slapped the OS (Debian/squeeze) on a mirrored system disk using Linux software raid (md).


Xen:

Install the Xen stuff, kernel, hypervisor and some tools:

~# apt-get install linux-image-2.6-xen-amd64
~# apt-get install xen-hypervisor-4.0-amd64
~# apt-get install xen-tools
~# apt-get -t squeeze-backports install libvirt-bin

Grub2:

Squeeze has a bug with grub2 on how to pass arguments to the hypervisor and kernel image — the equivalent of xenhopt= and xenkopt= on grub legacy.

See http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=639931

It is fixed upstream in grub-1.99 but is not yet backported to squeeze for 1.98. In the mean time one has to carefully update /boot/grub/grub.cfg manually.

See: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=617538

Add the following to /etc/default/grub to boot from the quoted menuentry:

GRUB_DEFAULT='Debian GNU/Linux, with Linux 2.6.32-5-xen-amd64 and XEN 4.0-amd64'
GRUB_SAVEDEFAULT=true

See http://www.gnu.org/software/grub/manual/html_node/Simple-configuration.html

and look for the variables GRUB_CMDLINE_XEN’ and GRUB_CMDLINE_XEN_DEFAULT’

Note[20120808]

It seems that the fix has been backported as putting in /etc/default/grub:

GRUB_CMDLINE_XEN="dom0_mem=2048M dom0_max_vcpus=2 loglvl=all guest_loglvl=all console=tty0"

seems to do the job.


Xen Config.

  • disable the network-manager

On the Dom0 disable network-manager if it is enabled as it will mess up with the Xen network scripts.

~# /etc/init.d/network-manager stop
~# update-rc.d network-manager remove
  • dummy network interfaces kernel modules

Add ‘dummy’ to /etc/modules, and update the initramfs:

~# modprobe dummy
~# depmode -a 
~# update-initramfs -u -k `uname -r`

If you want more that one dummy network interface you will have to notify the kernel module dummy.ko through an option:

~# modinfo dummy
filename:       /lib/modules/2.6.32-5-xen-amd64/kernel/drivers/net/dummy.ko
alias:          rtnl-link-dummy
license:        GPL
depends:        
vermagic:       2.6.32-5-xen-amd64 SMP mod_unload modversions 
parm:           numdummies:Number of dummy pseudo devices (int)
~# modprobe dummy numdummies=2

To have it automatically when the module is loaded at boot time:

~#> echo "options dummy numdummies=2" > /etc/modprobe.d/dummy.conf
  • Xen bridging setup

Create a network script in /etc/xen/scripts/my_network_script:

~# cat <<EOF >/etc/xen/scripts/my_network_script 
#!/bin/sh
dir=$(dirname "$0")
"$dir/network-bridge" "$@" vifnum=0 netdev=eth0 bridge=xenbr0
"$dir/network-bridge" "$@" vifnum=1 netdev=dummy0 bridge=xenbr1
"$dir/network-bridge" "$@" vifnum=2 netdev=dummy1 bridge=xenbr2
EOF

Make sure it is executable by all. You have to modify the way the dummy network module is invoked (in /etc/modprobe.d/dummy.conf) if you want to go beyond one dummy interface (the default) as said above.

Modify the network interface file /etc/network/interface and insert the the dummy interfaces configuration values:

~# cat <EOF >>/etc/network/interface
# Xen Backend
auto dummy0
iface dummy0 inet static
  address 172.16.16.56
  gateway 172.16.16.1
  broadcast 172.16.16.255
  netmask 255.255.255.0

auto dummy1
  iface dummy1 inet static
  address 10.0.0.56
  gateway 10.0.0.1
  broadcast 10.0.0.255
  netmask
  255.255.255.0
EOF

This will create three bridges, xenbr0, xenbr1 xenbr2.

The ‘network-bridge’ script works the following way:

* creates a new bridge named xenbr0
* "real" ethernet interface eth0 is brought down
* the IP and MAC addresses of eth0 are copied to virtual network interface veth0
* real interface eth0 is renamed peth0
* virtual interface veth0 is renamed eth0
* peth0 and vif0.0 are attached to bridge xenbr0. 
* the bridge, peth0, eth0 and vif0.0 are brought up 

Note that this is not the ‘canonical’ way anymore of bringing up networks in Xen and that this method is deprecated but still supported. It’s better to use the native OS tools to create and configure the xen network stack.

  • I’m lazy.

You can see the bridges configuration with:

~# brctl show
bridge name     bridge id               STP enabled     interfaces
pan0            8000.000000000000       no
xenbr0          8000.00e0815ffc95       no              peth0
                                                        vif3.0
                                                        vif4.0
xenbr1          8000.764cb6b0d299       no              pdummy0
                                                        vif3.1
                                                        vif4.1
xenbr2          8000.7670adb674da       no              pdummy1
                                                        vif3.2
                                                        vif4.2

In this case we have 3 bridges, xenbr0 attached to the physical network interface of the Dom0, along with 2 virtual interfaces, one for each DomU (vif3.0 is for the DomU id=3 with eth0, vif4.0 is for the DomU id=4 with eth0)

The same goes for the other bridges: vif3.1 if the virtual interface for DomU id=3 with eth1, attached to xenbr1. pdummy0 is just a dummy interface used as a bookkeeping device.


Xen hypervisor

Modify the xen hypervisor config file:

/etc/xen/xend-config.sxp:

(xend-http-server yes)
(xend-unix-server yes)
(xend-relocation-server yes)
(xend-address '')
(xend-relocation-address '')
(xend-relocation-hosts-allow '')
(network-script 'my_network_script')
(vif-script vif-bridge)
(dom0-min-mem 196)
(enable-dom0-ballooning no)
(total_available_memory 0) 
(dom0-cpus 0)
(vncpasswd '')

Notice the line ‘network-script’ that refers to the network script shown above.


Xen Tools:

/etc/xen-tools/xen-tools.conf:

dir = /opt/xen
install-method = debootstrap
size   = 8Gb      # Disk image size.
memory = 1024Mb    # Memory size
swap   = 1024Mb    # Swap size
fs     = ext3     # use the EXT3 filesystem for the disk image.
dist   = `xt-guess-suite-and-mirror --suite` # Default distribution to install.
image  = sparse   # Specify sparse vs. full disk images.
gateway    = 132.206.178.1
netmask    = 255.255.255.0
broadcast  = 132.206.178.255
nameserver = 132.206.178.7
bridge = eth0
kernel = /boot/vmlinuz-`uname -r`
initrd = /boot/initrd.img-`uname -r`
mirror = `xt-guess-suite-and-mirror --mirror`
ext3_options     = noatime,nodiratime,errors=remount-ro
ext2_options     = noatime,nodiratime,errors=remount-ro
xfs_options      = defaults
reiserfs_options = defaults
btrfs_options    = defaults
  • Create the guest:
~# xen-create-image --hostname node0 --ip=132.206.178.230 --arch=amd64

...

DomU node0:

Installation Summary
---------------------
Hostname        :  node0
Distribution    :  squeeze
IP-Address(es)  :  132.206.178.230 
RSA Fingerprint :  21:79:67:ad:c1:c9:c9:7e:ff:3b:ce:3e:bf:64:82:b1
Root Password   :  ********

~#> xen-create-image --hostname node1 --ip=132.206.178.231 --arch=amd64

...

DomU node1:

Installation Summary
---------------------
Hostname        :  node1
Distribution    :  squeeze
IP-Address(es)  :  132.206.178.231 
RSA Fingerprint :  00:a2:4c:eb:04:2b:ec:36:c8:77:f9:0e:76:2f:c9:37
Root Password   :  ********

Libvirt:

A tip Florian Hass gave me last winter (2012).

http://www.hastexo.com/resources/hints-and-kinks/fencing-virtual-cluster-nodes

In order for your hypervisor to listen on an unauthenticated, insecure, unencrypted network socket (did we mention that’s unsuitable for production?), add the following lines to your libvirtd configuration file:

  • Edit /etc/default/libvirt-bin and make sure to have:
# options passed to libvirtd, add "-l" to listen on tcp
libvirtd_opts="-d -l -v -t 30"
  • Edit the libvirtd config file and add in /etc/libvirt/libvirtd.conf
listen_tls = 0
listen_tcp = 1
tcp_port = "16509"
auth_tcp = "none"

You can connect remotely query the xen guests status. For instance after having booted the guests one can remotely connect (unsecurely!) to the Dom0 with:

malin@cicero:~$ virsh --connect xen+tcp://132.206.178.56 list
 Id Name                 State
----------------------------------
  0 Domain-0             running
  3 node0                idle
  4 node1                idle

Corosync:

  • install corosync, pacemaker and openais.
~# apt-get install corosync pacemaker openais
  • generate the keys for corosync secure communications:
~# corosync-keygen

This will create a key in /etc/corosync/authkey. Copy the key to the other node(s) and make sure that only root has access.

  • enable core files with pid’s extension, to ease debugging if need be:
~# echo 1 > /proc/sys/kernel/core_uses_pid
~# echo core.%e.%p > /proc/sys/kernel/core_pattern 

~# cat <EOF > /etc/sysctl.d/core_uses_pid.conf
kernel.core_uses_pid = 1
kernel.core_pattern = core.%e.%p
EOF
  • edit /etc/corosync/corosync.conf and stuff the following in the totem{} section:
        rrp_mode: active

        interface {
                # The following values need to be set based on your environment 
                ringnumber: 0
                bindnetaddr: 172.16.16.0
                mcastaddr: 226.94.1.1
                mcastport: 5405
        }
        interface {
                # The following values need to be set based on your environment 
                ringnumber: 1
                bindnetaddr: 10.0.0.0
                mcastaddr: 226.94.1.1
                mcastport: 5405
        }
  • enable the start of corosync at boot in /etc/default/corosync and start it up:
~# /etc/init.d/corosync start
  • one should see the corosync messaging layer:
~# corosync-cfgtool -s
Printing ring status.
Local node ID 34607276
RING ID 0
        id      = 172.16.16.2
        status  = ring 0 active with no faults
RING ID 1
        id      = 10.0.0.2
        status  = ring 1 active with no faults

Pacemaker:

With no cluster config, the cluster monitor reports:

~# crm_mon -1
============
Last updated: Wed Aug  8 11:45:59 2012
Stack: openais
Current DC: node0 - partition with quorum
Version: 1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b
2 Nodes configured, 2 expected votes
0 Resources configured.
============

Online: [ node0 node1 ]

Note: I have upgraded all pacemaker components (cluster-glue, openais, etc) to squeeze backports so the version shown above is not was is currently displayed on the live cluster.


Original pacemaker testbed on testgod:

node node1 \
        attributes standby="off"
node node2 \
        attributes standby="off"
primitive A ocf:heartbeat:Dummy \
        meta target-role="Started" \
        op start interval="0" timeout="20s" \
        op monitor interval="20s" timeout="60s"
primitive B ocf:heartbeat:Dummy \
        meta target-role="Started" \
        op monitor interval="20s" timeout="60s"
primitive C ocf:heartbeat:Dummy \
        op monitor interval="20s" timeout="60s"
primitive clusterIP ocf:heartbeat:IPaddr2 \
        params ip="132.206.178.246" cidr_netmask="24" \
        op monitor interval="20s" timeout="60s" \
        meta target-role="Started"
primitive fence_node1 stonith:external/libvirt \
        params hostlist="node1" hypervisor_uri="xen+tcp://172.16.16.1"
pcmk_host_check="dynamic-list" \
        op monitor interval="60s" \
        meta target-role="Started"
primitive fence_node2 stonith:external/libvirt \
        params hostlist="node2" hypervisor_uri="xen+tcp://172.16.16.1"
pcmk_host_check="dynamic-list" \
        op monitor interval="60s" \
        meta target-role="Started"
primitive resPing ocf:pacemaker:ping \
        params dampen="5s" multiplier="100" host_list="132.206.178.82"
attempts="3" \
        op start interval="0" timeout="60s" \
        op monitor interval="20s" timeout="60s"
clone cloPing resPing \
        meta globally-unique="false"
location lo_fence_node1 fence_node1 -inf: node1
location lo_fence_node2 fence_node2 -inf: node2
location locIP clusterIP \
        rule $id="locIP-rule" -inf: not_defined pingd or pingd lte 0
colocation coloc-1 inf: A ( B C )
property $id="cib-bootstrap-options" \
        dc-version="1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c" \
        cluster-infrastructure="openais" \
        expected-quorum-votes="2" \
        no-quorum-policy="ignore" \
        default-resource-stickiness="100" \
        pe-input-series-max="672" \
        pe-error-series-max="672" \
        pe-warn-series-max="672" \
        stonith-enabled="true" \
        last-lrm-refresh="1339688713" \
        start-failure-is-fatal="true"

New, improved(TM) cluster.

Three dummy resources, A, B and C. One virtual IP, along with a cloned ping resource to detect netwwork failure and one dummy stateful (master-slave) resource. Finally a stonith resource using the libvirt agent is deployed using an unsecure xen+tcp connection. Not for production for sure. Apart from the resources definitions, the whole cluster logic reduces to:

clone clo-ping res-ping \
        meta target-role="Started"
location lo-fence-node0 fence-node0 -inf: node0
location lo-fence-node1 fence-node1 -inf: node1
location lo-ip cluster-ip \
        rule $id="lo-ip-rule" -inf: not_defined pingd or pingd lte 0
colocation co-ip-on-master inf: cluster-ip ms-stateful:Master
order or-master-before-ip inf: ms-stateful:promote cluster-ip:start
order order1 inf: A B C

The CIB in full glory:

node0:~# crm configure show
node node0
node node1 \
        attributes standby="off"
primitive A ocf:heartbeat:Dummy \
        meta target-role="Started" \
        op start interval="0" timeout="20s" \
        op monitor interval="20s" timeout="60s"
primitive B ocf:heartbeat:Dummy \
        meta target-role="Started" \
        op monitor interval="20s" timeout="60s"
primitive C ocf:heartbeat:Dummy \
        op monitor interval="20s" timeout="60s"
primitive S ocf:heartbeat:Stateful \
        op start interval="0" timeout="30s" \
        op stop interval="0" timeout="30s" \
        op monitor interval="15s" timeout="30s"
primitive cluster-ip ocf:heartbeat:IPaddr2 \
        params ip="132.206.178.245" cidr_netmask="24" iflabel="eth0" \
        op monitor interval="20s" timeout="60s" \
        meta target-role="Started"
primitive fence-node0 stonith:external/libvirt \
        params hostlist="node0" hypervisor_uri="xen+tcp://132.206.178.56" \
        op start interval="0" timeout="60s" \
        op monitor interval="3600s" timeout="60"
primitive fence-node1 stonith:external/libvirt \
        params hostlist="node1" hypervisor_uri="xen+tcp://132.206.178.56" \
        op start interval="0" timeout="60s" \
        op monitor interval="3600s" timeout="60s"
primitive res-ping ocf:pacemaker:ping \
        params dampen="5s" multiplier="100" host_list="132.206.178.56" attempts="3" \
        op start interval="0" timeout="60s" \
        op monitor interval="20s" timeout="60s"
ms ms-S S \
        meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" notify="true" ordered="false" globally-unique="false" is-managed="true"
clone clo-ping res-ping \
        meta target-role="Started"
location lo-fence-node0 fence-node0 -inf: node0
location lo-fence-node1 fence-node1 -inf: node1
location lo-ip cluster-ip \
        rule $id="lo-ip-rule" -inf: not_defined pingd or pingd lte 0
colocation co-ip-on-master inf: cluster-ip ms-S:Master
order or-master-before-ip inf: ms-S:promote cluster-ip:start
order order1 inf: A ( B C )
property $id="cib-bootstrap-options" \
        dc-version="1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff" \
        cluster-infrastructure="openais" \
        expected-quorum-votes="2" \
        no-quorum-policy="ignore" \
        default-resource-stickiness="100" \
        pe-input-series-max="672" \
        pe-error-series-max="672" \
        pe-warn-series-max="672" \
        stonith-enabled="true" \
        last-lrm-refresh="1344628645"

GlusterFS shit.

http://www.howtoforge.com/high-availability-storage-with-glusterfs-3.0.x-on-debian-squeeze-automatic-file-replication-across-two-storage-servers

http://www.gluster.org/2012/08/gluster-new-user-guide/

  • Install GlusterFS on each node.

First, a few necessary packages:

~# apt-get install bison automake autoconf flex make libtool portmap fuse-utils nfs-common

On the gluster servers:

~# apt-get -t squeeze-backports install glusterfs-server glusterfs-examples

On the client:

~# apt-get -t squeeze-backports install glusterfs-client glusterfs-examples
  • Attach the block devices used by Gluster to the DomU’s (gluster servers):
~# virsh list
 Id    Name                           State
----------------------------------------------------
 0     Domain-0                       running
 3     node1                          idle
 4     node0                          idle
 6     node7                          idle
~# virsh attach-disk 4 --source /dev/mapper/raid--vg-scratch1 --target xvdb1
~# virsh attach-disk 3 --source /dev/mapper/raid--vg-scratch2 --target xvdb1

Save xen xml DomUs config files for later.

~# virsh dumpxml 4 > node0.xml
~# virsh dumpxml 3 > node1.xml
~# virsh dumpxml 6 > node7.xml
  • Gluster Server Config:

First, will use the following private network between the Xen guests:

servers node0 -> 10.0.0.2
        node1 -> 10.0.0.3
client  node7 -> 10.0.0.7
  • Save the original config file:
node0:~# cp /etc/glusterfs/glusterd.vol{,_orig}
  • Create some storage space on each guest.

The block device set in passthrough to the guests (xvdb1) have been already formated as XFS filesystems on the Dom0 so one just need to create a mount point and mount the filesystem (/data/export):

node0:~# mkdir -p /data/export
node0:~# mount -t auto /dev/xvdb1 /data/export 
  • Edit the glusterd config file:
node0:~# cat /etc/glusterfs/glusterd.vol
volume posix
  type storage/posix
  option directory /data/export
end-volume

volume locks
  type features/locks
  subvolumes posix
end-volume

volume brick
  type performance/io-threads
  option thread-count 8
  subvolumes locks
end-volume

volume server
  type protocol/server
  option transport-type tcp
  option auth.addr.brick.allow 10.0.0.7
  subvolumes brick
end-volume
  • Do the same for the other guest.
  • On the gluster client:
node7:~# cat /etc/glusterfs/glusterfs.vol 
volume node0
  type protocol/client
  option transport-type tcp
  option remote-host 10.0.0.2
  option remote-subvolume brick
end-volume

volume node1
  type protocol/client
  option transport-type tcp
  option remote-host 10.0.0.3
  option remote-subvolume brick
end-volume

volume replicate
  type cluster/replicate
  subvolumes node0 node1
end-volume

volume writebehind
  type performance/write-behind
  option window-size 1MB
  subvolumes replicate
end-volume

volume cache
  type performance/io-cache
  option cache-size 512MB
  subvolumes writebehind
end-volume
  • Client Mount
node7:~# glusterfs -f /etc/glusterfs/glusterfs.vol /mnt/glusterfs
  • stuff it in /etc/rc.local.d/glusterfs.sh
node7:~# cat /etc/rc.local.d/glusterfs.sh 
#!/bin/sh

/bin/mount -t glusterfs /etc/glusterfs/glusterfs.vol /mnt/glusterfs