You are crazy if you use this. Those are just my personnal, scattered notes on setting up a cluster. DO NOT USE! You have been warned!

Pacemaker/Corosync Cluster + DRBD + STONITH + Xen

(:#toc:)

Packages Installation Base

ii  drbd8-module-2.6.32-5-xen-amd64                         2:8.3.8-0+2.6.32-30            RAID 1 over tcp/ip for Linux kernel module
ii  drbd8-module-source                                     2:8.3.8-0                      RAID 1 over tcp/ip for Linux module source
ii  drbd8-utils                                             2:8.3.8-0                      RAID 1 over tcp/ip for Linux utilities

ii  libxenstore3.0                                          4.0.1-2                        Xenstore communications library for Xen

ii  linux-headers-2.6.32-5-common-xen                       2.6.32-34squeeze1              Common header files for Linux 2.6.32-5-xen
ii  linux-headers-2.6.32-5-xen-amd64                        2.6.32-34squeeze1              Header files for Linux 2.6.32-5-xen-amd64
ii  linux-image-2.6.32-5-xen-amd64                          2.6.32-34squeeze1              Linux 2.6.32 for 64-bit PCs, Xen dom0 support

ii  linux-headers-2.6.32-5-common-xen                       2.6.32-35                      Common header files for Linux 2.6.32-5-xen
ii  linux-headers-2.6.32-5-xen-amd64                        2.6.32-35                      Header files for Linux 2.6.32-5-xen-amd64
ii  linux-image-2.6.32-5-xen-amd64                          2.6.32-35                      Linux 2.6.32 for 64-bit PCs, Xen dom0 support

ii  xen-hypervisor-4.0-amd64                                4.0.1-2                        The Xen Hypervisor on AMD64
ii  xen-tools                                               4.2-1                          Tools to manage Xen virtual servers
ii  xen-utils-4.0                                           4.0.1-2                        XEN administrative tools
ii  xen-utils-common                                        4.0.0-1                        XEN administrative tools - common files
ii  xenstore-utils                                          4.0.1-2                        Xenstore utilities for Xen
ii  xenwatch                                                0.5.4-2                        Virtualization utilities, mostly for Xen

ii  libvirt-bin                                             0.8.3-5                        the programs for the libvirt library
ii  libvirt-doc                                             0.8.3-5                        documentation for the libvirt library
ii  libvirt0                                                0.8.3-5                        library for interfacing with different virtualization systems
ii  python-libvirt                                          0.8.3-5                        libvirt Python bindings
ii  virt-manager                                            0.8.4-8                        desktop application for managing virtual machines
ii  virt-viewer                                             0.2.1-1                        Displaying the graphical console of a virtual machine
ii  virtinst                                                0.500.3-2                      Programs to create and clone virtual machines

ii  ocfs2-tools                                             1.4.4-3                        tools for managing OCFS2 cluster filesystems
ii  ocfs2-tools-pacemaker                                   1.4.4-3                        tools for managing OCFS2 cluster filesystems - pacemaker support
ii  ocfs2console                                            1.4.4-3                        tools for managing OCFS2 cluster filesystems - graphical interface

ii  corosync                                                1.2.1-4                        Standards-based cluster framework (daemon and modules)
ii  pacemaker                                               1.0.9.1+hg15626-1              HA cluster resource manager
ii  libopenais3                                             1.1.2-2                        Standards-based cluster framework (libraries)
ii  openais                                                 1.1.2-2                        Standards-based cluster framework (daemon and modules)
ii  dlm-pcmk                                                3.0.12-2                       Red Hat cluster suite - DLM pacemaker module
ii  libdlm3                                                 3.0.12-2                       Red Hat cluster suite - distributed lock manager library
ii  libdlmcontrol3                                          3.0.12-2                       Red Hat cluster suite - distributed lock manager library

ii  ipmitool                                                1.8.11-2                       utility for IPMI control with kernel driver or LAN interface
ii  libopenipmi0                                            2.0.16-1.2                     Intelligent Platform Management Interface - runtime
ii  openhpi-plugin-ipmidirect                               2.14.1-1                       OpenHPI plugin module for direct IPMI over LAN (RMCP) or SMI
ii  openipmi                                                2.0.16-1.2                     Intelligent Platform Management Interface (for servers)

ii  libopenhpi2                                             2.14.1-1                       OpenHPI libraries (runtime and support files)
ii  openhpi-plugin-ipmidirect                               2.14.1-1                       OpenHPI plugin module for direct IPMI over LAN (RMCP) or SMI
ii  openhpid                                                2.14.1-1                       OpenHPI daemon, supports gathering of manageability information

Power and Physical network connections

The goal is to have no single point of failure (SPOF). Right off the bat there is one: both nodes have a single PSU (cash constraint!) so it is important to have all the power feeds as distributed as possible. The nodes are connected to an Automatic Transfer Switch (R3-ATS-SW-1 in rack-3) as are the network switches. There is still the possibility that one ATS fails…

helena: power feed from R9-ATS-SW-2 (APS-C and UPS-1500-1)

    eth0            eth1       eth2          IPMI
 132.206.178.60   10.0.0.2   192.168.1.19  192.168.1.17
   (sw3-p17)       p-to-p     (sw4-p19)     (sw7-p21)

puck: power feed from R3-ATS-SW-1 (APS-A and UPS-B)

    eth0            eth1       eth2          IPMI
 132.206.178.61   10.0.0.1   192.168.1.18  192.168.1.16
   (sw6-p18)       p-to-p     (sw7-p23)     (sw7-p22)

sw3: power feed from R9-ATS-SW-2 (APS-C and UPS-1500-1)
sw4: power feed from R3-ATS-SW-1 (APS-A and UPS-B)
sw7: power feed from R3-ATS-SW-1 (APS-A and UPS-B)
sw6: power feed from R3-ATS-SW-1 (APS-A and UPS-B)

Note the weakness:

IPMI ports and one eth2 (one of puck’s corosync dual link) are on the same switch.

20150112. 
The above is obsolete and kept only for historical reasons and tracking.

Noth cluster nodes have been moved to their new location in Rack-A4-U33 and Rack-A4-U34.
This entails a few changes to power and network connections. 
This is in flux and not the final layout. Stay tuned.

helena: power feed from ATS-A3-U26

 eth0             eth1       eth2          IPMI
 132.206.178.60   10.0.0.2   192.168.1.19  192.168.1.17
 (SW-A1-p25)      p-to-p     (SW-B4-p50)   (SW-A4-p17)
 yellow           yellow     blue          white

puck: power feed from ATS-A3-U26 ***

 eth0             eth1       eth2          IPMI
 132.206.178.61   10.0.0.1   192.168.1.18  192.168.1.16
 (SW-A4-p08)      p-to-p     (SW-B2-p49)   (SW-A1-p29)
 yellow           yellow      blue         white


SW-A1-U35 and SW-A4-U35: power feed from ATS-A3-U26 (UPS-A1-U1 + UPS-A1-U3)
SW-B2-U35 and SW-B4-U35: power feed from ATS-B2-U13 (UPS-B1-U1 + UPS-B1-U3)

A graphical view of the physical network connections between the nodes and network switches.

primitive resDRBDr0 ocf:linbit:drbd params drbd_resource=“r0” op start interval=“0” timeout=“240s” op stop interval=“0” timeout=“120s” op monitor interval=“20s” role=“Master” timeout=“240s” meta migration-threshold=“3” failure-timeout=“60s” primitive resLVM0 ocf:heartbeat:LVM params volgrpname=“vg0” op monitor interval=“10s” timeout=“60s” op start interval=“0” timeout=“60s” op stop interval=“0” timeout=“60s” meta migration-threshold=“3” failure-timeout=“60s” primitive resXen0 ocf:heartbeat:Xen params xmfile=“/etc/xen/matsya.bic.mni.mcgill.ca.cfg” name=“matsya.bic.mni.mcgill.ca” op monitor interval=“20s” timeout=“60s” op start interval=“0” timeout=“90s” op stop interval=“0” timeout=“90s” meta migration-threshold=“3” failure-timeout=“60s” ms msDRBDr0 grpDRBDr0 meta notify=“true” interleave=“true” colocation colLVM0-on-DRBDMaster0 inf: resLVM0 msDRBDr0:Master colocation colXen0-on-LVM0 inf: resXen0 resLVM0 order ordDRBDr0-before-LVM0 inf: msDRBDr0:promote resLVM0:start order ordLVM0-before-Xen0 inf: resLVM0 resXen0 (:sourceend:)

Network Configuration

Stuff the following in the file /etc/network/interfaces for helena:

(:source:)

local loopback

auto lo iface lo inet loopback

eth0 - wan

auto eth0 iface eth0 inet static

  address 132.206.178.60
  netmask 255.255.255.0
  broadcast 132.206.178.255
  gateway 132.206.178.1

eth1 - drbd pt2pt replication link

auto eth1 iface eth1 inet static

  address 10.0.0.2
  netmask 255.0.0.0
  broadcast 10.0.0.255
  pointopoint 10.0.0.1

eth2 - corosync ring

auto eth2 iface eth2 inet static

  address 192.168.1.19
  netmask 255.255.255.0
  broadcast 192.168.1.255
  gateway 192.168.1.1

(:sourceend:)

Make sure to change the relevant IP’s for puck.

Logical Volume Manager (LVM) Configuration

As explained in the DRBD user guide at http://www.drbd.org/users-guide-legacy/s-nested-lvm.html it is possible, if slightly advanced, to both use Logical Volumes as backing devices for DRBD and at the same time use a DRBD device itself as a Physical Volume (PV). In order to enable this configuration, follow these steps:

Set an appropriate filter option in your /etc/lvm/lvm.conf:

(:source:) filter = [ “a|/dev/drbd.*|”, “a|/dev/md0|”, “a|/dev/md/0|”, “r|.*|” ] (:sourceend:)

This filter expression accepts PV signatures found on any DRBD devices and on the underlying md0 mirror, while rejecting (ignoring) all others.

After modifying the lvm.conf file, you must run the vgscan command so LVM discards its configuration cache and re-scans devices for PV signatures.

Disable the LVM cache by setting:

(:source:) write_cache_state = 0 (:sourceend:)

After disabling the LVM cache, make sure you remove any stale cache entries by deleting /etc/lvm/cache/.cache.

DRBD Installation and Configuration

Important note: the modules compilations and installation are dependent upon the kernel installed! In this case I have:

ii  libxenstore3.0                                          4.0.1-2                        Xenstore communications library for Xen
ii  linux-headers-2.6.32-5-common-xen                       2.6.32-35                      Common header files for Linux 2.6.32-5-xen
ii  linux-headers-2.6.32-5-xen-amd64                        2.6.32-35                      Header files for Linux 2.6.32-5-xen-amd64
ii  linux-image-2.6.32-5-xen-amd64                          2.6.32-35                      Linux 2.6.32 for 64-bit PCs, Xen dom0 support
ii  xen-hypervisor-4.0-amd64                                4.0.1-2                        The Xen Hypervisor on AMD64
ii  xen-tools                                               4.2-1                          Tools to manage Xen virtual servers
ii  xen-utils-4.0                                           4.0.1-2                        XEN administrative tools
ii  xen-utils-common                                        4.0.0-1                        XEN administrative tools - common files
ii  xenstore-utils                                          4.0.1-2                        Xenstore utilities for Xen
ii  xenwatch                                                0.5.4-2                        Virtualization utilities, mostly for Xen

Check out sources from the public DRBD source repository

(Do this as a non-privileged user)

malin::~> git clone git://git.drbd.org/drbd-8.3.git
malin::~> cd drbd-8.3
malin::~> git checkout drbd-8.3.8
malin::~> dpkg-buildpackage -rfakeroot -b -uc

This build process will create two Debian packages:

A package containing the DRBD userspace tools, named drbd8-utils_x.y.z-BUILD_ARCH.deb
A module source package suitable for module-assistant named drbd8-module-source_x.y.z-BUILD_all.deb.

Packages installation and module installation

Install the packages with dpkg and then run module-assistant:

~# module-assistant --text-mode --force auto-install drbd8

I’ve had problems loading the out-of-tree new drbd module on helena when doing an upgrade. I don’t know why but it seems that the in-tree drbd module (8.3.7) requires lru_cache module and while this one is loaded, modprobe will fail to load drbd-8.3.8. My solution is to unload the drbd module, then unload lru_cache, move the in-tree kernel drbd module away, rebuild the depmod file and modprobe drbd. Weird.

Nov  1 15:58:13 helena kernel: [ 1920.589633] drbd: exports duplicate symbol lc_seq_dump_details (owned by lru_cache)
Nov  1 16:10:36 helena kernel: [ 2663.655588] drbd: exports duplicate symbol lc_seq_dump_details (owned by lru_cache)

~# rmmod drbd
~# rmmod lru_cache
~# mv /lib/modules/2.6.32-5-xen-amd64/kernel/drivers/block/drbd/drbd.ko{,8.3.7}
~# depmod -a
~# modprobe drbd

After that the system file modules.dep should contain the following:

~# grep drbd /lib/modules/`uname -r`/modules.dep
kernel/drivers/block/drbd.ko: kernel/drivers/connector/cn.ko

DRBD Storage configuration

Create a raid1 (mirror) with /dev/sdc1 and /dev/sdd1. Just follow the recipe given above in System disk migration to software raid1 (mirror).

Create one volume group vg_xen spanning the entire raid1 /dev/md0 and create a 24Gb logical volume on it for the VM disk image and the swap file. (Create as many LVMs as you will need Xen guests).

helena:~# pvcreate /dev/md0
  Physical volume "/dev/md0" successfully created

helena:~# vgcreate vg_xen /dev/md0
  Volume group "vg_xen" successfully created

helena:~# lvcreate --size 24G --name xen_lv0 vg_xen 
  Logical volume "xen_lv0" created

helena:~# lvcreate --size 24G --name xen_lv1 vg_xen 
  Logical volume "xen_lv1" created

helena:~# lvcreate --size 24G --name xen_lv2 vg_xen 
  Logical volume "xen_lv2" created

helena:~# pvs
  PV         VG     Fmt  Attr PSize   PFree  
  /dev/md0   vg_xen lvm2 a-   465.76g 393.76g

helena:~# vgs
  VG     #PV #LV #SN Attr   VSize   VFree  
  vg_xen   1   3   0 wz--n- 465.76g 393.76g

helena:~# lvs
  LV      VG     Attr   LSize  Origin Snap%  Move Log Copy%  Convert
  xen_lv0 vg_xen -wi-ao 24.00g                                      
  xen_lv1 vg_xen -wi-ao 24.00g                                      
  xen_lv2 vg_xen -wi-ao 24.00g

Do exactly the same same on the other node!

DRBD resource r1

The DRBD replication link is the point-to-point eth1-eth1 connection using 10.0.0.X addresses.
The become-primary-on both in the startup section and allow-two-primaries@2 in the net@@ section are mandatory for Xen live migration but we will no be using it.
The fencing stuff in the handler is required to avoid split-brain situation when stonith is enabled in the cluster. It is absolutely essential to have fencing enabled in order to avoid potentially disastrous split-brain situations. There might be issues with the handler fencing resource-and-stonith whereby one has multiple guests running on different nodes and an event triggers a stonith action. I still haven’t fully understood the implications but suffice to say that a stonith-deathmatch is possible. Another possibility is that the cluster could create resources constraints that prevent one or more guests to run anywhere.
The al-extents in the syncer section (the Activity Log Extents, 4MB each) is the nearest prime number close to E = (Sync_Rate)*(Sync_Time)/4. I used Sync_Rate = 30MB/s and Sync_Time= 120s yielding 900 (907 is nearest prime).
The resource configuration file is located in /etc/drbd.d/r1.res and the DRBD config file /etc/drbd.conf is such that all files matching /etc/drbd.d/*.res will be loaded when drbd starts.

(:source:)

/etc/drbd.d/r1.res

resource r1 {

  device /dev/drbd1;
  disk /dev/vg_xen/xen_lv1;
  meta-disk internal;
  startup {
      degr-wfc-timeout 30;
      wfc-timeout 30;

become-primary-on both;

  }
  net {

allow-two-primaries;

      cram-hmac-alg sha1;
      shared-secret “lucid”;
      after-sb-0pri discard-zero-changes;
      after-sb-1pri discard-secondary;
      after-sb-2pri disconnect;
      rr-conflict disconnect;
  }
  disk {
      fencing resource-only;

fencing resource-and-stonith;

      on-io-error detach;
  }
  handlers {
      fence-peer “/usr/lib/drbd/crm-fence-peer.sh”;
      after-resync-target “/usr/lib/drbd/crm-unfence-peer.sh”;
      outdate-peer “/usr/lib/drbd/outdate-peer.sh”;
      split-brain “/usr/lib/drbd/notify-split-brain.sh root”;
      pri-on-incon-degr “/usr/lib/drbd/notify-pri-on-incon-degr.sh root”;
      pri-lost-after-sb “/usr/lib/drbd/notify-pri-lost-after-sb.sh root”;
      local-io-error “/usr/lib/drbd/notify-io-error.sh root”;
  }
  syncer {
      rate 30M;
      csums-alg sha1;
      al-extents 809;
      verify-alg sha1;
  }
  on puck {
      address 10.0.0.1:7789;
  }
  on helena {
      address 10.0.0.2:7789;
  }

}
(:sourceend:)

Make the obvious changes for the other drbd resources: the resource name, the DRBD block device and the backing device name.

Make absolutely sure to use a unique replication port number (7789 above) for each and single resource! Also, in device=/dev/drbdX the value X MUST BE AN INTEGER (it will be the device minor device number — anything else will trigger a drbd shutdown with massive disruption/corruption. Yes, I tried it and it’s not pretty :)

DRBD/Pacemaker fencing ssh keys:

For the script /usr/lib/drbd/outdate-peer.sh to work (referenced above in the handler section resource conf file) requires a password-less ssh key for the 2 nodes:

puck:~/root/.ssh/authorized_keys:
from="helena,helena.bic.mni.mcgill.ca,132.206.178.60,10.0.0.2,192.168.1.19"...

helena:~/root/.ssh/authorized_keys:
from="puck,puck.bic.mni.mcgill.ca,132.206.178.61,10.0.0.1,192.168.1.18"...

Enabling DRBD resource and initial synchronization

helena:~# dd if=/dev/zero of=/dev/vg_xen/xen_lv1 bs=512 count=1024
1024+0 records in
1024+0 records out
533504 bytes (534 kB) copied, 0.0767713 s, 6.9 MB/s

helena:~# drbdadm create-md r1
Writing meta data...
initializing activity log
NOT initialized bitmap
New drbd meta data block successfully created.

helena:~# modprobe drbd

helena:~# drbdadm attach r1

helena:~# cat /proc/drbd
version: 8.3.8 (api:88/proto:86-94)
GIT-hash: d78846e52224fd00562f7c225bcc25b2d422321d build by root@helena, 2010-09-09 17:41:03

 1: cs:StandAlone ro:Secondary/Unknown ds:Inconsistent/DUnknown   r----
    ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:65009692

helena:~# drbdadm syncer r1

helena:~# cat /proc/drbd
version: 8.3.8 (api:88/proto:86-94)
GIT-hash: d78846e52224fd00562f7c225bcc25b2d422321d build by root@helena, 2010-09-09 17:41:03

 1: cs:StandAlone ro:Secondary/Unknown ds:Inconsistent/DUnknown   r----
    ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:65009692

helena:~# drbdadm connect r1

Do the same steps on the second node. The attach/syncer/connect steps can be done using drbdadm up resource.The DRBD state should then be like:

helena:~# cat /proc/drbd
version: 8.3.8 (api:88/proto:86-94)
GIT-hash: d78846e52224fd00562f7c225bcc25b2d422321d build by root@helena, 2010-09-09 17:41:03
 1: cs:Connected ro:Secondary/Secondary ds:Inconsistent/Inconsistent C r----
    ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:67106780

If the devices are empty (no data) there is no prefered node to do the following but it must be done on only one of them!

It might be better to lean on the safe side and disable become-primary-on both (if it is enabled) while doing the following. Edit the resource file and comment out the line. Then update the DRBD state with drbdadm -d adjust <res_name>. The -d flag is for ‘dry-run. If there is no error, run the command without dry-run.

helena:~# drbdadm -- --overwrite-data-of-peer primary r1
helena:~# cat /proc/drbd 
version: 8.3.8 (api:88/proto:86-94)
GIT-hash: d78846e52224fd00562f7c225bcc25b2d422321d build by root@helena, 2010-09-09 17:41:03
 1: cs:SyncSource ro:Primary/Secondary ds:UpToDate/Inconsistent C r----
    ns:32 nr:0 dw:0 dr:81520 al:0 bm:4 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:67025468
        [>....................] sync'ed:  0.2% (65452/65532)M delay_probe: 0
        finish: 0:41:04 speed: 27,104 (27,104) K/sec

Do the same for all the other drbd resources. Note that a drbd device has to be in primary mode in order to access any Volume Group (VG) living on top of it.

Manually Resolving a Split-Brain Situation

See http://www.drbd.org/users-guide-legacy/s-resolve-split-brain.html for DRBD-3.X. It still supported but newer version 4.X
has different syntax as explained in http://www.drbd.org/users-guide/s-resolve-split-brain.html

I will just copy what’s in the drbd user guide:

Manual split brain recovery

DRBD detects split brain at the time connectivity becomes available again and the peer nodes exchange the initial DRBD protocol handshake. If DRBD detects that both nodes are (or were at some point, while disconnected) in the primary role, it immediately tears down the replication connection. The tell-tale sign of this is a message like the following appearing in the system log:

Split-Brain detected, dropping connection!

After split brain has been detected, one node will always have the resource in a StandAlone connection state. The other might either also be in the StandAlone state (if both nodes detected the split brain simultaneously), or in WFConnection (if the peer tore down the connection before the other node had a chance to detect split brain).

At this point, unless you configured DRBD to automatically recover from split brain, you must manually intervene by selecting one node whose modifications will be discarded (this node is referred to as the split brain victim). This intervention is made with the following commands:

~# drbdadm secondary resource
~# drbdadm -- --discard-my-data connect resource

On the other node (the split brain survivor), if its connection state is also StandAlone, you would enter:

~# drbdadm connect resource

You may omit this step if the node is already in the WFConnection state; it will then reconnect automatically. If the resource affected by the split brain is a stacked resource, use drbdadm --stacked instead of just drbdadm. Upon connection, your split brain victim immediately changes its connection state to SyncTarget, and has its modifications overwritten by the remaining primary node.

The split brain victim is not subjected to a full device synchronization. Instead, it has its local modifications rolled back, and any modifications made on the split brain survivor propagate to the victim.

After re-synchronization has completed, the split brain is considered resolved and the two nodes form a fully consistent, redundant replicated storage system again.

Xen Setup

A good tutorial for Xen on Lenny 64bit can be found at http://www.howtoforge.com/virtualization-with-xen-on-debian-lenny-amd64.

Some notes on Xen-4.0 for Debian/Squeeze http://wiki.xensource.com/xenwiki/Xen4.0

Xen boot Serial Console Redirection

Setting up a serial console on a laptop to possibly capture Xen kernel boot messages can be useful for debugging. Here’s how.

Use the following grub menu.lst stanza (replace the root=UUID= with your boot disk):

(:source:) 
title           Raid1 (hd1) Xen 4.0-amd64 / Debian GNU/Linux, kernel 2.6.32–5-xen-amd64 / Serial Console

root            (hd1,0)

kernel          /boot/xen-4.0-amd64.gz dom0_mem=2048M dom0_max_vcpus=2 dom0_vcpus_pin loglvl=all guest_loglvl=all sync_console console_to_ring console=vga com1=38400,8n1 console=com1

module          /boot/vmlinuz-2.6.32–5-xen-amd64 root=UUID=c0186fd3–605d-4a23-b718–3b4c869040e2 ro console=hvc0 earlyprintk=xen nomodeset

module          /boot/initrd.img-2.6.32–5-xen-amd64

(:sourceend:)

This uses the first onboard serial port (COM1) on the host with a baud rate of 38400Bd with 8 databits, no parity and 1 stopbit.

The pv_ops dom0 Linux kernel is configured to use the Xen (hvc0) console. Dom0 Linux kernel console output will go to the serial console through Xen, so both Xen hypervisor and dom0 linux kernel output will go to the same serial console, where they can then be captured. More info can be found on the Xen Wiki http://wiki.xensource.com/xenwiki/XenSerialConsole.

Note in the above that dom0 memory is restricted to 2Gb (dom0_mem=2048M) and will use up to 2 virtual cpus (dom0_max_vcpus=2) and they are pinned with dom0_vcpus_pin. See the section below Xen VCPUs Memory Management for extra info.

The Xen boot options can be found here http://wiki.xensource.com/xenwiki/XenHypervisorBootOptions. Note that this is WorK In Progress (TM).

Use the following grub menu entry to hide the Xen boot messages and send the guest boot messages to the console (as usual) but keep the debug info for the Dom0 and the guests:

(:source:) 
title           Raid1 (hd1) Xen 4.0-amd64 / Debian GNU/Linux, kernel 2.6.32–5-xen-amd64

root            (hd1,0)

kernel          /boot/xen-4.0-amd64.gz dom0_mem=2048M dom0_max_vcpus=2 loglvl=all guest_loglvl=all console=tty0

module          /boot/vmlinuz-2.6.32–5-xen-amd64 root=UUID=c0186fd3–605d-4a23-b718–3b4c869040e2 ro 

module          /boot/initrd.img-2.6.32–5-xen-amd64

(:sourceend:)
Here is a procedure to gather information in case of a kernel hang:
In the Linux kernel, hit SysRq-L, SysRq-T
Go in the hypervisor, hit Ctrl-A three times. 
You should see a prompt saying (XEN) ** Serial …
Hit * - that will collect all of the relevant information.
Network BridgingNetwork bridging can be illustrated by this diagram (stolen from http://wiki.prgmr.com/mediawiki/index.php/Chapter_5:_Networking)
Here is the chain of events that happens when Xend starts up and runs the network-bridge script on the  dom0:
creates a new bridge named xenbr0
“real” ethernet interface eth0 is brought down
the IP and MAC addresses of eth0 are copied to virtual network interface veth0
real interface eth0 is renamed peth0
virtual interface veth0 is renamed eth0
peth0 and vif0.0 are attached to bridge xenbr0. Please notice that in xen 3.3 and 4.0.x, the default bridge name is the same than the interface it is attached to. Eg: bridge name eth0, eth1 or ethX.VlanID
the bridge, peth0, eth0 and vif0.0 are brought up 
Enable network bridging for Xend by commenting out (network-script network-bridge) in /etc/xen/xend-config.sxp. Make sure to specify the netdev option with the proper value if there is more than one physical network interface.

(network-script 'network-bridge netdev=eth0')

Reboot. Have a look at http://wiki.xen.org/xenwiki/XenNetworking for an explanation of the different network setups and topologies for Xen.

Xen4.1 onwards will NOT reconfigure the network stack unless it is not already configured. This is explained in http://wiki.xensource.com/xenwiki/HostConfiguration/Networking. My network config is therefore not very standard and the Xen mailing list gurus only recommed such a setup in the simplest case

as it is the case here.

Xen /dev and /proc files

Verify that Xend is well and running and that /proc/xen filesystem is mounted:

node1~: ls -l /proc/xen
-r--r--r-- 1 root root 0 Sep 15 13:42 capabilities
-rw------- 1 root root 0 Sep 15 13:42 privcmd
-rw------- 1 root root 0 Sep 15 13:42 xenbus
-rw------- 1 root root 0 Sep 15 13:42 xsd_kva
-rw------- 1 root root 0 Sep 15 13:42 xsd_port

node1~: cat /proc/xen/capabilities
control_d

The control_d means that you are in the Xen Dom0 (control domain). You should have the following devices in /dev/xen

crw------- 1 root root 10, 56 Sep 15 13:42 evtchn
crw------- 1 root root 10, 62 Sep 15 13:42 gntdev

and their minor/major device numbers should correspond to those in /proc/misc

56 xen/evtchn
62 xen/gntdev

Xend and xen-tools Configuration

Some default installation settings must be changed if one wants to enable live migration and ultimately have it supervized by pacemaker. It is important to restrict which host can issue a relocation request with (xend-relocation-hosts-allow ‘ ’) and be sure to check that there is no typo in there!

(:source:)
puck:~# ~malin/bin/crush /etc/xen/xend-config.sxp

(xend-http-server yes)
(xend-unix-server yes)
(xend-relocation-server yes)
(xend-port 8000)
(xend-relocation-port 8002)
(xend-address localhost)
(xend-relocation-address ‘ ’)

(xend-relocation-hosts-allow ‘localhost helena\.bic\.mni\.mcgill\.ca puck\.bic\.mni\.mcgill\.ca’)

(xend-relocation-hosts-allow ‘^localhost$ ^helena\\.bic\\.mni\\.mcgill\\.ca$ ^puck\\.bic\\.mni\\.mcgill\\.ca$ ^132\\.206\\.178\\.60$ ^132\\.206\\.178\\.61$’)
(network-script ‘network-bridge netdev=eth0’)
(vif-script vif-bridge)
(dom0-min-mem 2048)
(enable-dom0-ballooning no)
(total_available_memory 0)
(dom0-cpus 0)
(vncpasswd ‘ ’)
(:sourceend:)

(:source:)
puck:~# ~malin/bin/crush /etc/xen-tools/xen-tools.conf

lvm = xen
install-method = debootstrap
install-method = debootstrap
debootstrap-cmd = /usr/sbin/debootstrap
size = 20Gb
memory = 2Gb
swap = 3Gb
fs = ext3
dist = `xt-guess-suite-and-mirror —suite`
image = full
gateway = 132.206.178.1
netmask = 255.255.255.0
broadcast = 132.206.178.255
nameserver = 132.206.178.7
bridge = eth0
kernel = /boot/vmlinuz-`uname -r`
initrd = /boot/initrd.img-`uname -r`
arch = amd64
mirror = `xt-guess-suite-and-mirror —mirror`
ext3_options = noatime,nodiratime,errors=remount-ro
ext2_options = noatime,nodiratime,errors=remount-ro
xfs_options = defaults
reiserfs_options = defaults
btrfs_options = defaults
serial_device = hvc0
disk_device = xvda
output = /etc/xen
extension = .cfg
copyhosts = 1
(:sourceend:)

(:source:)
puck:~# ~malin/bin/crush /etc/default/xendomains

XENDOMAINS_SYSRQ=“ ”
XENDOMAINS_USLEEP=100000
XENDOMAINS_CREATE_USLEEP=5000000
XENDOMAINS_MIGRATE=“ ”
XENDOMAINS_SAVE=/var/lib/xen/save
XENDOMAINS_SHUTDOWN=“—halt —wait”
XENDOMAINS_SHUTDOWN_ALL=“—all —halt —wait”
XENDOMAINS_RESTORE=true
XENDOMAINS_AUTO=/etc/xen/auto
XENDOMAINS_AUTO_ONLY=false
XENDOMAINS_STOP_MAXWAIT=300
(:sourceend:)

See http://wiki.xensource.com/xenwiki/XenBestPractices for more info. It’s best to allocate a fixed amount of memory to the Dom0 (dom0-min-mem 2048) and also to disable dom0 memory ballooning with the setting (enable-dom0-ballooning no) in the xend config file. This will make sure that the Dom0 never runs out of memory.

The Xen Dom0 physical network device should be present:

puck:~# ifconfig
eth0      Link encap:Ethernet  HWaddr 00:1b:21:50:0f:ec  
          inet addr:132.206.178.61  Bcast:132.206.178.255  Mask:255.255.255.0
          inet6 addr: fe80::21b:21ff:fe50:fec/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:2621736 errors:0 dropped:0 overruns:0 frame:0
          TX packets:111799 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:290640331 (277.1 MiB)  TX bytes:17035127 (16.2 MiB)

eth1      Link encap:Ethernet  HWaddr 00:30:48:9e:ff:04  
          inet addr:10.0.0.1  Bcast:10.0.0.255  Mask:255.0.0.0
          inet6 addr: fe80::230:48ff:fe9e:ff04/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:62308 errors:0 dropped:0 overruns:0 frame:0
          TX packets:62314 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:4485102 (4.2 MiB)  TX bytes:3258684 (3.1 MiB)
          Memory:fbce0000-fbd00000 

eth2      Link encap:Ethernet  HWaddr 00:30:48:9e:ff:05  
          inet addr:192.168.1.18  Bcast:192.168.1.1  Mask:255.255.255.0
          inet6 addr: fe80::230:48ff:fe9e:ff05/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:69006 errors:0 dropped:0 overruns:0 frame:0
          TX packets:24 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:4140880 (3.9 MiB)  TX bytes:1934 (1.8 KiB)
          Memory:fbde0000-fbe00000 

lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:5590 errors:0 dropped:0 overruns:0 frame:0
          TX packets:5590 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:6419836 (6.1 MiB)  TX bytes:6419836 (6.1 MiB)

peth0     Link encap:Ethernet  HWaddr 00:1b:21:50:0f:ec  
          inet6 addr: fe80::21b:21ff:fe50:fec/64 Scope:Link
          UP BROADCAST RUNNING PROMISC MULTICAST  MTU:1500  Metric:1
          RX packets:73515001 errors:0 dropped:0 overruns:0 frame:0
          TX packets:114824 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:72955828195 (67.9 GiB)  TX bytes:17247510 (16.4 MiB)
          Memory:fbb20000-fbb40000

and the bridge configuration can be inspected with the brctl command:

puck:~# brctl show
bridge name     bridge id               STP enabled     interfaces
eth0            8000.001b21500fec       no              peth0

Create the Guest (LVM Disk Device)

Each Xen guest host will live on its own volume group as 2 logical volumes, one for its virtual disk and another for the virtual swap device). This volume group is located on top of its own replication device (so called nested DRBD).

See below for how to create a guest host using disk images rather than LVMs.

You will hit a buglet in xen-create-image if you specify the disk image size or swap in units of Kb in the file /etc/xen-tools/xen-tools.conf.

Ideally the domU config file should be shared between the nodes. Just make sure that both nodes have identical copies. For the moment I just stuff them in the default /etc/xen/<guest_name>.bic.mni.mcgill.ca.cfg and sync it manually to the other node. This will create a host guest called kurma with IP address 132.206.178.241 with the virtual disk devices on the volume group vg1. Obviously the volume group must be accessible and that means the drbd device under it must be in primary mode. If it’s not, then promote it to primary and force a scan of the volume groups with vgscan.

puck:~# xen-create-image --hostname=kurma.bic.mni.mcgill.ca --ip=132.206.178.241 --arch=amd64 --role=udev --lvm=vg1

General Information
--------------------
Hostname       :  kurma.bic.mni.mcgill.ca
Distribution   :  squeeze
Mirror         :  http://ftp.ca.debian.org/debian/
Partitions     :  swap            3Gb   (swap)
                  /               20Gb  (ext3)
Image type     :  full
Memory size    :  2Gb
Kernel path    :  /boot/vmlinuz-2.6.32-5-xen-amd64
Initrd path    :  /boot/initrd.img-2.6.32-5-xen-amd64

Networking Information
----------------------
IP Address 1   : 132.206.178.241 [MAC: 00:16:3E:85:58:34]
Netmask        : 255.255.255.0
Broadcast      : 132.206.178.255
Gateway        : 132.206.178.1
Nameserver     : 132.206.178.7
  Logical volume "kurma.bic.mni.mcgill.ca-swap" created

Creating swap on /dev/vg1/kurma.bic.mni.mcgill.ca-swap
mkswap: /dev/vg1/kurma.bic.mni.mcgill.ca-swap: warning: don't erase bootbits sec
tors
        on whole disk. Use -f to force.
Setting up swapspace version 1, size = 3145724 KiB
no label, UUID=cd23c566-4572-43b0-aed9-f8643eebbad8
Done
  Logical volume "kurma.bic.mni.mcgill.ca-disk" created

Creating ext3 filesystem on /dev/vg1/kurma.bic.mni.mcgill.ca-disk
mke2fs 1.41.12 (17-May-2010)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
Stride=0 blocks, Stripe width=0 blocks
1310720 inodes, 5242880 blocks
262144 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=4294967296
160 block groups
32768 blocks per group, 32768 fragments per group
8192 inodes per group
Superblock backups stored on blocks: 
        32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 
        4096000

Writing inode tables: done                            
Writing superblocks and filesystem accounting information: done

This filesystem will be automatically checked every 20 mounts or
180 days, whichever comes first.  Use tune2fs -c or -i to override.
Done
Installation method: debootstrap

Copying files from host to image.
Copying files from /var/cache/apt/archives/ -> /tmp/2eMPe0va9g/var/cache/apt/archives
Done
Done
I: Retrieving Release
I: Retrieving Packages
I: Validating Packages
I: Resolving dependencies of required packages...
I: Resolving dependencies of base packages...
I: Found additional required dependencies: insserv libbz2-1.0 libdb4.8 libslang2 
I: Found additional base dependencies: libnfnetlink0 libsqlite3-0 
I: Checking component main on http://ftp.ca.debian.org/debian...
I: Validating libacl1
I: Validating adduser
I: Validating apt-utils
I: Validating apt


Copying files from new installation to host.
Copying files from /tmp/2eMPe0va9g/var/cache/apt/archives -> /var/cache/apt/archives/
Done
Done
Done

Running hooks
Running hook 01-disable-daemons
hook 01-disable-daemons: done.
Running hook 05-shadowconfig-on
Shadow passwords are now on.
hook 05-shadowconfig-on: done.
Running hook 15-disable-hwclock
update-rc.d: using dependency based boot sequencing
hook 15-disable-hwclock: done.
Running hook 20-setup-apt
Get:1 http://ftp.ca.debian.org squeeze Release.gpg [835 B]
Ign http://ftp.ca.debian.org/debian/ squeeze/contrib Translation-en
Ign http://ftp.ca.debian.org/debian/ squeeze/main Translation-en
Ign http://ftp.ca.debian.org/debian/ squeeze/non-free Translation-en
Hit http://ftp.ca.debian.org squeeze Release

[...]

Creating Xen configuration file
Done
Setting up root password
Generating a password for the new guest.
All done

Installation Summary
---------------------
Hostname        :  kurma.bic.mni.mcgill.ca
Distribution    :  squeeze
IP-Address(es)  :  132.206.178.241 
RSA Fingerprint :  34:03:64:21:dd:9a:bb:c0:f5:4d:10:05:ce:20:6d:63
Root Password   :  *********

It is vital to explicitely specify the volume group name that will host the guest domU disks with —lvm=. Otherwize the default value from the xen-tools.conf will be used.

Boot the Guest

puck:~# xm create /xen_cluster/xen0/matsya.bic.mni.mcgill.ca.cfg

[...] 

~# xm list 
Name                                        ID   Mem VCPUs      State   Time(s)
Domain-0                                     0  2043     2     r----- 137872.3
matsya.bic.mni.mcgill.ca                     3  2048     3     -b----   9275.4

~# xm vcpu-list
Name                                ID  VCPU   CPU State   Time(s) CPU Affinity
Domain-0                             0     0     0   r--   63861.3 0
Domain-0                             0     1     1   -b-   74013.4 1
matsya.bic.mni.mcgill.ca             3     0     6   -b-    3536.2 any cpu
matsya.bic.mni.mcgill.ca             3     1     7   -b-    2450.8 any cpu
matsya.bic.mni.mcgill.ca             3     2     4   -b-    3288.4 any cpu

(:source:)
~# ~malin/bin/crush /etc/xen/matsya.bic.mni.mcgill.ca.cfg
kernel = ‘/boot/vmlinuz-2.6.32–5-xen-amd64’
ramdisk = ‘/boot/initrd.img-2.6.32–5-xen-amd64’
vcpus = ‘2’
memory = ‘2048’
root = ‘/dev/xvda2 ro’
disk = [

                ‘phy:/dev/vg0/xen0-disk,xvda2,w’,
                ‘phy:/dev/vg0/xen0-swap,xvda1,w’,
            ]

name = ‘matsya.bic.mni.mcgill.ca’
vif = [ ‘ip=132.206.178.240,mac=00:16:3E:17:1A:86,bridge=eth0’ ]
on_poweroff = ‘destroy’
on_reboot = ‘restart’
on_crash = ‘restart’
(:sourceend:)

Once the guest is created and running, Xen will setup the network. For instance here is the output of all the chains defined in the iptables of the Dom0 in the case of 2 running guests, matsya.bic.mni.mcgill.ca and kurma.bic.mni.mcgill.ca

helena:~# brctl show
bridge name     bridge id               STP enabled     interfaces
eth0            8000.001b21538ed6       no              peth0
                                                        vif7.0
                                                        vif8.0
helena:~# iptables -L
Chain INPUT (policy ACCEPT)
target     prot opt source               destination         

Chain FORWARD (policy ACCEPT)
target     prot opt source               destination         
ACCEPT     all  --  anywhere             anywhere            state RELATED,ESTABLISHED PHYSDEV match --physdev-out vif8.0 
ACCEPT     udp  --  anywhere             anywhere            PHYSDEV match --physdev-in vif8.0 udp spt:bootpc dpt:bootps 
ACCEPT     all  --  anywhere             anywhere            state RELATED,ESTABLISHED PHYSDEV match --physdev-out vif8.0 
ACCEPT     all  --  xennode-1.bic.mni.mcgill.ca  anywhere            PHYSDEV match --physdev-in vif8.0 
ACCEPT     all  --  anywhere             anywhere            state RELATED,ESTABLISHED PHYSDEV match --physdev-out vif7.0 
ACCEPT     udp  --  anywhere             anywhere            PHYSDEV match --physdev-in vif7.0 udp spt:bootpc dpt:bootps 
ACCEPT     all  --  anywhere             anywhere            state RELATED,ESTABLISHED PHYSDEV match --physdev-out vif7.0 
ACCEPT     all  --  xennode-2.bic.mni.mcgill.ca  anywhere            PHYSDEV match --physdev-in vif7.0 

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination

You can peak at the networking setup of the guest using the xenstore command xenstore-ls fed with the Xen backend path of the vif (virtual interface):

puck:~# xm network-list kurma.bic.mni.mcgill.ca
Idx BE     MAC Addr.     handle state evt-ch tx-/rx-ring-ref BE-path
0   0  00:16:3E:85:58:34    0     4      10    768  /769     /local/domain/0/backend/vif/11/0  

puck:~# xenstore-ls /local/domain/0/backend/vif/11/0
bridge = "eth0"
domain = "kurma.bic.mni.mcgill.ca"
handle = "0"
uuid = "f0cef322-dc1b-c935-223b-62a44a985181"
script = "/etc/xen/scripts/vif-bridge"
ip = "132.206.178.241"
state = "4"
frontend = "/local/domain/11/device/vif/0"
mac = "00:16:3E:85:58:34"
online = "1"
frontend-id = "11"
feature-sg = "1"
feature-gso-tcpv4 = "1"
feature-rx-copy = "1"
feature-rx-flip = "0"
feature-smart-poll = "1"
hotplug-status = "connected"

Guest Console Login

malin@cicero:~$ ssh kurma
Linux kurma 2.6.32-5-xen-amd64 #1 SMP Wed Jan 12 05:46:49 UTC 2011 x86_64

The programs included with the Debian GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.

kurma::~> pwd
/home/bic/malin
kurma::~> df
Filesystem            Size  Used Avail Use% Mounted on
/dev/xvda2             20G  1.1G   18G   6% /
tmpfs                1023M     0 1023M   0% /lib/init/rw
udev                  987M   32K  987M   1% /dev
tmpfs                1023M  4.0K 1023M   1% /dev/shm
gustav:/raid/home/bic
                      503G  161G  342G  33% /home/bic
kurma::~>

Hit CTRL-] to disconnect from the virtusl console. At this point the domU is also accessible through ssh. Editing in the console can lead to frustation as the terminal device emulation is a bit screwy so it’s better to connect to the guest using a secure ssh connection.

Xen Block Device, Scheduling, Virtual CPUs and Memory Management

See http://book.xen.prgmr.com/mediawiki/index.php/Scheduling for good info.
http://publib.boulder.ibm.com/infocenter/lnxinfo/v3r0m0/index.jsp?topic=/liaai/xen/rhel/liaaixenrbindpin.htm

http://wiki.xensource.com/xenwiki/XenBestPractices for Xen Best Practices, like disabling ballooning for dom0, pinning core(s) to it and giving it more cpu time to service outstanding IO requests.

As of Xen 4.1 the xm-tools are deprecated. A new tool call xl is now available.

Virtual Block Devices

You can attach and detach block devices to a guest using the xm-tools commands:

~# xm block-attach domain-id be-dev fe-dev mode [bedomain-id]

domain-id

The domain id of the guest domain that the device will be attached to.

be-dev

The device in the backend domain (usually domain 0) to be exported. This can be specified as a physical partition (phy:/dev/sda7) or as a file mounted as loopback (file://path/to/loop.iso).

fe-dev

How the device should be presented to the guest domain. It can be specified as either a symbolic name, such as /dev/xvdb1 or even xvdb1.

mode

The access mode for the device from the guest domain. Supported modes are w (read/write) or r (read-only).

bedomain-id

The back end domain hosting the device. This defaults to domain 0.

To detach a domain’s virtual block device use:

~# xm block-detach domain-id devid [--force]

devid may be the symbolic name or the numeric device id given to the device by domain 0. You will need to run xm block-list to determine that number.

You need the cooperation of the guest to detach a device (it can be opened or in use inside the guest) and using —force can lead to IO errors. Be careful!

As an example, lets say one wants to give a guest with guest id 1 access to a logical volume lv0 created on the dom0 using the standard LVM tools:

~# lvcreate -L 250GB vg_xen lv0
~# xm block-attach 1 phy:/dev/vg_xen/lv0 /dev/xvdb1 w

Scheduling

By default, unless specified with the Xen boot line sched= parameter the domU scheduler is set to credit. The xm sched-credit syntax:

xm sched-credit [ -d domain-id [ -w[=WEIGHT] | -c[=CAP] ] ]

This sets the credit scheduler parameters. The credit scheduler is a proportional fair share CPU scheduler built from the ground up to be work conserving on SMP hosts.

Each domain (including Domain0) is assigned a weight and a cap.

PARAMETERS:

WEIGHT A domain with a weight of 512 will get twice as much CPU as a domain with a weight of 256 on a contended host. Legal weights range from 1 to 65535 and the default is 256.
CAP The cap optionally fixes the maximum amount of CPU a domain will be able to consume, even if the host system has idle CPU cycles. The cap is expressed in percentage of one physical CPU: 100 is 1 physical CPU, 50 is half a CPU, 400 is 4 CPUs, etc. The default, 0, means there is no upper cap.

VCPUs and Memory

To restrict Dom0 to run on a certain CPUs one has to add the kernel stanza dom0_max_vcpus=X to the Xen boot line in /boot/grub/menu.lst, where X is the number of vcpus dedicated to Dom0. The grub menu entry for a dom0 with an allocation of 2GB of ram and 2 virtual CPUs would look like:

(:source:) 
title Raid1 (hd1) Xen 4.0-amd64 / Debian GNU/Linux, kernel 2.6.32–5-xen-amd64 / Serial Console

root (hd1,0)

kernel /boot/xen-4.0-amd64.gz dom0_mem=2048M dom0_max_vcpus=2 loglvl=all guest_loglvl=all sync_console console_to_ring console=vga com1=38400,8n1 console=com1

module /boot/vmlinuz-2.6.32–5-xen-amd64 root=UUID=c0186fd3–605d-4a23-b718–3b4c869040e2 ro console=hvc0 earlyprintk=xen nomodeset

module /boot/initrd.img-2.6.32–5-xen-amd64

(:sourceend:)

The memory and virtual cpus allocated to an unprivileged Xen domain (domU) can be set both at the domain creation and reset while the domU is running. As an example of a domU allocated with 4 virtual cpus (vcpus=‘4’) and pinned to the third logical cpu to the 6th (cpus=[‘2–5’] and 2GB of memory (memory=‘2048’):
(:source:)
kernel = ‘/boot/vmlinuz-2.6.32–5-xen-amd64’
ramdisk = ‘/boot/initrd.img-2.6.32–5-xen-amd64’
vcpus = ‘4’
cpus = [‘2’,’3’,’4’,’5’]
memory = ‘2048’
root = ‘/dev/xvda1 ro’
disk = [

                ‘file:/xen_cluster/r1/disk.img,xvda1,w’,
                ‘file:/xen_cluster/r2/swap.img,xvda2,w’,
            ]

name = ‘xennode-1.bic.mni.mcgill.ca’
vif = [ ‘ip=132.206.178.241,mac=00:16:3E:12:94:DF,bridge=eth0’ ]
on_poweroff = ‘destroy’
on_reboot = ‘restart’
on_crash = ‘restart’
(:sourceend:)

Xen resources management

A multitude of tools exist to manage and gather information and/or statistics from a Xen virtual domain. Have a look at http://wiki.xensource.com/xenwiki/XenManagementTools for a list.

A GUI that can connect remotely over secured channels is virt-manager for example. In order to allow connections one must reconfigure Xend by enabling (xend-unix-server yes) in /etc/xen/xend-config.sxp and restart xend with /etc/init.d/xend restart. After that one export the environment variable VIRSH_DEFAULT_CONNECT_URI=xen:///.

The command xentop will show the state of the Dom0 and running DomUs:

puck:~# xentop

xentop - 16:11:50   Xen 4.0.1
2 domains: 1 running, 1 blocked, 0 paused, 0 crashed, 0 dying, 0 shutdown
Mem: 12573788k total, 4351752k used, 8222036k free    CPUs: 8 @ 2000MHz
      NAME  STATE   CPU(sec) CPU(%)     MEM(k) MEM(%)  MAXMEM(k) MAXMEM(%) VCPUS NETS NETTX(k) NETRX(k) VBDS   VBD_OO   VBD_RD   VBD_WR  VBD_RSECT VBD_WSECT SSID
  Domain-0 -----r        514    1.6    2092288   16.6   no limit       n/a     2    0        0        0    0        0        0        0          0         0 0
VCPUs(sec):   0:        362s  1:        152s
xennode-1. --b---         15    0.8    2089984   16.6    2097152      16.7     4    1     1199        6    1        0        2      164         16      2792 0
VCPUs(sec):   0:         12s  1:          1s  2:          1s  3:          1s
Net0 RX:     6258bytes       39pkts        0err        0drop  TX:  1227871bytes 8872pkts        0err     9983drop
VBD BlkBack 51713 [ca: 1]  OO:        0   RD:        2   WR:      164   RSECT: 16   WSECT:       2792

From the Xen-tools toolbox comes xm, which has a fairly rich syntax:

helena:~# xm vcpu-list
Name                                ID  VCPU   CPU State   Time(s) CPU Affinity
Domain-0                             0     0     7   r--     368.5 any cpu
Domain-0                             0     1     1   -b-     154.9 any cpu
xennode-1.bic.mni.mcgill.ca          2     0     2   -b-      14.7 2
xennode-1.bic.mni.mcgill.ca          2     1     3   -b-       1.2 3
xennode-1.bic.mni.mcgill.ca          2     2     4   -b-       1.1 4
xennode-1.bic.mni.mcgill.ca          2     3     5   -b-       1.1 5

Another one is libvirt and related binaries (virsh help)

helena:~# virsh list   
 Id Name                 State
----------------------------------
  0 Domain-0             running
  2 xennode-1.bic.mni.mcgill.ca idle

helena:~# virsh dominfo xennode-1.bic.mni.mcgill.ca
Id:             2
Name:           xennode-1.bic.mni.mcgill.ca
UUID:           7df57c76-8555-bd41-b9da-31054aef3521
OS Type:        linux
State:          idle
CPU(s):         4
CPU time:       21.6s
Max memory:     2097152 kB
Used memory:    2089984 kB
Persistent:     no
Autostart:      disable

Xen Virtual Block Devices (VBDs) and DRBD

There is a Debian packaging bug in /etc/xen/scripts/block that prevents /etc/xen/scripts/block-drbd from being run if you use DRBD VBDs. To fix it, just add a line XEN_SCRIPT_DIR=/etc/xen/scripts before the last 2 lines of the shell script. Even after making that change I find that the virtual drbd device stuff doesn’t work with Debian/Squeeze. It seems that the Xen packages miss a hotplug device script.

In order to use a DRBD resource (<resource> is the resource name NOT the block device) as the virtual block device xvda1, you must add a line like the following to your Xen domU configuration:

(:source:)

disk = [ 'drbd:<resource>,xvda1,w' ]

See http://www.drbd.org/users-guide/ch-xen.html for more details. But suffice to say that It Just Doesn’t Work (TM) for Debian/Squeeze.

Corosync Installation and Configuration

First, what is Corosync?

Corosync is the first layer in the cluster stack (membership and messaging), Pacemaker is the second layer (cluster resource management), and services are on the third layer. We’ll cover Pacemaker later.

The following stuff is obsolete and was only valid while using Debian/Lenny. The madkiss repositary is no longer necessary as Debian/Squeeze has an updated version of corosync. Thus just skip ahead to Create an auth key for nodes communication

Create a new file /etc/apt/sources.list.d/pacemaker.list that contains:

deb http://people.debian.org/~madkiss/ha lenny main

Add the Madkiss key to you package system:

>~ apt-key adv --keyserver pgp.mit.edu --recv-key 1CFA3E8CD7145E30

If you omit this step you will get this error:

W: GPG error: http://people.debian.org lenny Release: The following signatures
couldn't be verified because the public key is not available: NO_PUBKEY 1CFA3E8CD7145E30

Update the package list

>~ aptitude update

Installing the package pacemaker will install pacemaker with corosync, if you need openais later on, you could install that as a plugin in corosync. OpenAIS is need for example for DLM or CLVM, but thats beyond the scope of this howto.

>~ aptitude install pacemaker

Create an auth key for nodes communication

To create an authkey for corosync communication between your two nodes do this on the first node:

node1~: sudo corosync-keygen

This creates a key in /etc/corosync/authkey. If it’s too slow that’s because there is not enough entropy. Get the system active by doing find / -file blah or something along this. You need to copy this file to the second node and put it in the /etc/corosync directory with the right permissions. So on the first node:

node1~: scp /etc/corosync/authkey node2:/etc/corosync/authkey

And on the second node:

node2~: sudo mv ~/authkey /etc/corosync/authkey
node2~: sudo chown root:root /etc/corosync/authkey
node2~: sudo chmod 400 /etc/corosync/authkey

Make core files with exec name and pid’s appended on both nodes:

node1~: echo 1 > /proc/sys/kernel/core_uses_pid
node1~: echo core.%e.%p > /proc/sys/kernel/core_pattern

and make the change permanent on both nodes with:

node1~: cat /etc/sysctl.d/core_uses_pid.conf

kernel.core_uses_pid = 1
kernel.core_pattern = core.p

Edit corosync config file

Most of the options in the /etc/corosync/corosync.conf file are ok to start with, but you must check that corosync can communicate so make sure to adjust the interface section:
(:source:)
interface {

The following values need to be set based on your environment

   ringnumber: 0
   bindnetaddr: 192.168.1.0
   mcastaddr: 226.94.1.1
   mcastport: 5405

}
(:sourceend:)

Adjust bindnetaddr to your local subnet: if you have configured the IP 10.0.0.1 for the first node and 10.0.0.2 for the second node, set bindnetaddr to 10.0.0.0. In my case the communication ring is redundant with one link with IP 132.206.178.60 to IP 132.206.178.61 (implying bindnetaddr set to 132.206.178.0 and another link with IP on a private network with IP 192.168.1.18 to 192.168.1.19 with the associated bindnetaddr set to 192.168.1.0

Corosync redundant dual ring config

One ring on public network 132.206.178.0/24 and the other ring on private network 192.168.1.0/24. /etc/corosync/corosync.conf:

(:source:)
totem {
version: 2
token: 3000
token_retransmits_before_loss_const: 10
join: 60
consensus: 4500
vsftype: none
max_messages: 20
clear_node_high_bit: yes

       secauth: off
       threads: 0
       rrp_mode: active
  interface {
      ringnumber: 0
      bindnetaddr: 132.206.178.0
      mcastaddr: 226.94.1.0
      mcastport: 5400
  }
  interface {
      ringnumber: 1
      bindnetaddr: 192.168.1.0
      mcastaddr: 226.94.1.1
      mcastport: 5401
  }

}
amf {
mode: disabled
}
service {

       ver:       0
       name:      pacemaker

}
aisexec {

      user:   root
      group:  root

}
logging {

      fileline: off
      to_stderr: yes
      to_logfile: no
      to_syslog: yes

syslog_facility: daemon

      debug: off
      timestamp: on
      logger_subsys {
              subsys: AMF
              debug: off
              tags: enter|leave|trace1|trace2|trace3|trace4|trace6
      }

}
(:sourceend:)

Now that you have configured both nodes you can start the cluster on both sides.
Edit /etc/default/corosync to enable corosync at startup and start the services:

node1~: sudo /etc/init.d/coroync start
Starting corosync daemon: corosync.
node2~: sudo /etc/init.d/coroync start
Starting corosync daemon: corosync.

The following processes should appear on both nodes:

root       /usr/sbin/corosync
root        \_ /usr/lib/heartbeat/stonithd
112         \_ /usr/lib/heartbeat/cib
root        \_ /usr/lib/heartbeat/lrmd
112         \_ /usr/lib/heartbeat/attrd
112         \_ /usr/lib/heartbeat/pengine
112         \_ /usr/lib/heartbeat/crmd

stonithd -> shoot the other node in the head daemon
cib      -> cluster information base
lrmd     -> local resource manager daemon
pengine  -> policy engine
crmd     -> cluster resource manager deamon

Corosync Status and Ring Management

The ring state is displayed using the command:

node1:~# corosync-cfgtool -s
Printing ring status.
Local node ID 1018351236
RING ID 0
        id      = 132.206.178.60
        status  = ring 0 active with no faults
RING ID 1
        id      = 192.168.1.19
        status  = ring 1 active with no faults

To enable a fixed-but-used-to-be-faulty ring use command corosync-cfgtool -r

There is an annoying bug in corosync-1.2.x (supposed to be fixed in the upcoming Weaver’s Needle release — a.k.a. corosync 2.y.z) in that if the ring doesn’t initialize correctly there is no way out short of a reboot: restarting the network just doesn’t work as corosync processes simply hang:

~# corosync-cfgtool -s

Could not initialize corosync configuration API error 6

What happens is that corosync cannot not bind to network socket(s) and everything becomes fubar. Some info can be found here https://lists.linux-foundation.org/pipermail/openais/2011-January/015626.html

IPMI Configuration for STONITH

In order to use the IPMI resource agent for stonith one must first install some packages on the cluster nodes.

puck:~# apt-get update
puck:~# apt-get install ipmitool openipmi

To have the IPMI device handler modules loaded one adds the following line to /etc/modules

# /etc/modules: kernel modules to load at boot time.
#
# This file contains the names of kernel modules that should be loaded
# at boot time, one per line. Lines beginning with "#" are ignored.
# Parameters can be specified after the module name.

ipmi_si
ipmi_devintf

To load the modules do modprobe ipmi_si and modprobe ipmi_devintf. If successfull, a new device /dev/ipmi<X> should be created that allows in-band communications for the BMC.

The BMC network config is done at the BIOS level, using a static IP address along with broadcast, gateway and netmask (using DHCP is possible). Login through the web interface using the manufacturer default username and password ADMIN/ADMIN. Change the passsord and create another user called root with administrative privileges.

Use ipmitool command to attempt a connection to the remote node BMC, user root and password in file /root/ipmipass:

puck:~# ipmitool -I lanplus -H 192.168.1.17 -U root -f /root/ipmipass sel
SEL Information
Version          : 1.5 (v1.5, v2 compliant)
Entries          : 0
Free Space       : 10240 bytes
Percent Used     : 0%
Last Add Time    : Not Available
Last Del Time    : Not Available
Overflow         : false
Supported Cmds   : 'Reserve' 'Get Alloc Info' 
# of Alloc Units : 512
Alloc Unit Size  : 20
# Free Units     : 512
Largest Free Blk : 512
Max Record Size  : 2

puck:~# ipmitool -I lanplus -H 192.168.1.17 -U root -f ./ipmipass sdr
FAN 1            | 11881 RPM         | ok
FAN 2            | 11881 RPM         | ok
FAN 3            | 11881 RPM         | ok
FAN 4            | 10404 RPM         | ok
FAN 5            | disabled          | ns
FAN 6            | disabled          | ns
CPU1 Vcore       | 0.95 Volts        | ok
CPU2 Vcore       | 0.96 Volts        | ok
+1.5 V           | 1.51 Volts        | ok
+5 V             | 5.09 Volts        | ok
+5VSB            | 5.09 Volts        | ok
+12 V            | 12.14 Volts       | ok
-12 V            | -12.29 Volts      | ok
+3.3VCC          | 3.31 Volts        | ok
+3.3VSB          | 3.26 Volts        | ok
VBAT             | 3.24 Volts        | ok
CPU1 Temp        | 0 unspecified     | ok
CPU2 Temp        | 0 unspecified     | ok
System Temp      | 29 degrees C      | ok
Chassis Intru    | 0 unspecified     | ok
PS Status        | 0 unspecified     | nc

puck:~# ipmitool -I lanplus -H 192.168.1.17 -U root -f /root/ipmipass chassis power status
Chassis Power is on

Pacemaker/CRM (Cluster Resource Manager) Configuration

Cluster Architecture and Internals.

The Pacemaker architecture is described in the Pacemaker documentation available at http://www.clusterlabs.org/doc/en-US/Pacemaker/1.0/html/Pacemaker_Explained/s-intro-architecture.html.

A quick view of the stack (from the doc mentionned above):

The pacemaker internals consist of four key components:

CIB (aka. Cluster Information Base)
CRMd (aka. Cluster Resource Management daemon)
PEngine (aka. PE or Policy Engine)
STONITHd

Initial Configuration

To configure the cluster stack one can either go the brutral way with cibadmin (must speak/grok XML) or use the crm shell command line interface.

See http://www.clusterlabs.org/wiki/DRBD_HowTo_1.0 and http://www.clusterlabs.org/doc/crm_cli.html
http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/index.html

First, one must disable a few services that are started automatically at boot time by the LSB init scripts. This is needed as the cluster will from now on manage the start/stop/restart and monitoring of those services and having them started at boot time before pacemaker has had a chance to get hold of them can cause confusion and even bring down the cluster.

Strickly speaking disabling corosync at boot time is not really necessary but, for the minyme paranoid inside me, having a node rebooted in a HA-cluster, me want to go through the logs and find out why and only when satisfied that all is good will I manually restart the cluster stack.

puck:~# update-rc.d -f corosync remove
puck:~# update-rc.d -f drbd remove
puck:~# update-rc.d -f o2cf remove
puck:~# update-rc.d -f xendomain remove

You can always reinstate them with update-rc.d [-n] <name> defaults.

node1 -> puck
node2 -> helena

After starting corosync on both systems one should see something like this (dates and versions will be different though, this is just from an old run of mine):

puck:~# crm_mon --one-shot -V
crm_mon[7363]: 2009/07/26_22:05:40 ERROR: unpack_resources: No STONITH
resources have been defined
crm_mon[7363]: 2009/07/26_22:05:40 ERROR: unpack_resources: Either configure
some or disable STONITH with the stonith-enabled option
crm_mon[7363]: 2009/07/26_22:05:40 ERROR: unpack_resources: NOTE: Clusters with
shared data need STONITH to ensure data integrity

# crm_mon -1
============
Last updated: Wed Oct 20 16:39:53 2010
Stack: openais
Current DC: helena - partition with quorum
Version: 1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b
2 Nodes configured, 2 expected votes
0 Resources configured.
============

Online: [ puck helena ]

Cluster Options and Properties.

First, we set up some cluster options that will affect its behaviour as a whole.

In a two-node cluster, the concept of quorum does not apply. It is thus safe to set Pacemaker’s “no-quorum-policy” to ignore loss of quorum.

We also set a default stickiness greater than zero so that resources have a tendency to ‘stick’ to a node. This avoids resources being bounced around when a node is restarted/rebooted or a resource is unmigrated.

The Policy Engine creates files in /var/lib/pengine/pe-{input,warn,error}.bz2 —those are the transition files used by the PE to plot a path to attain a new state when events are detected. By default, they will just indefinitely accumulate. With one transition check every 15mins, we keep 672 (worth a week). They are useful in case an hb_report has to be submitted to the Pacemaker Mailing List.

Finally we disable stonith (just for the moment — it will be enabled later on).

puck:~# crm configure
crm(live)configure# property no-quorum-policy=ignore
crm(live)configure# property default-resource-stickiness=100
crm(live)configure# property pe-input-series-max="672"
crm(live)configure# property pe-error-series-max="672"
crm(live)configure# property pe-warn-series-max="672"
crm(live)configure# property stonith-enabled=false
crm(live)configure# verify
crm(live)configure# commit
crm(live)configure# exit
bye

Actually, for a 2-node cluster, failure to disable quorum might lead to a stonith deathmatch when stonith is enabled. See the entry How STONITH works below.

LVM, DRBD and Xen

Next, define the resources for LVM, DRBD and Xen guest using the crm shell.

puck:~# crm configure edit

and stick the following in it:

(:source:)

primitive resDRBDr0 ocf:linbit:drbd         params drbd_resource=“r0”         op start interval=“0” timeout=“240s”         op stop interval=“0” timeout=“120s”         op monitor interval=“20s” role=“Master” timeout=“240s”         meta migration-threshold=“3” failure-timeout=“60s”

primitive resLVM0 ocf:heartbeat:LVM         params volgrpname=“vg0”         op monitor interval=“10s” timeout=“60s”         op start interval=“0” timeout=“60s”         op stop interval=“0” timeout=“60s”         meta migration-threshold=“3” failure-timeout=“60s”

primitive resXen0 ocf:heartbeat:Xen         params xmfile=“/etc/xen/matsya.bic.mni.mcgill.ca.cfg”         name=“matsya.bic.mni.mcgill.ca”         op monitor interval=“20s” timeout=“60s”         op start interval=“0” timeout=“90s”         op stop interval=“0” timeout=“90s”         meta migration-threshold=“3” failure-timeout=“60s” target-role=“Stopped”

ms msDRBDr0 grpDRBDr0         meta notify=“true” interleave=“true”

colocation colLVM0-on-DRBDMaster0 inf: resLVM0 msDRBDr0:Master

colocation colXen0-on-LVM0 inf: resXen0 resLVM0

order ordDRBDr0-before-LVM0 inf: msDRBDr0:promote resLVM0:start

order ordLVM0-before-Xen0 inf: resLVM0 resXen0

(:sourceend:)

A few notes regarding the above.

Modify the LVM Volume Group, the drbd resource name and the Xen config name, etc, according to your local setup.
The resource meta property migration-threshold=N will move the resource away to a new node after N failures. A default value is not set by the cluster so it has to be explicitely set.
The resource meta property failure-timeout=“60s” sets a timeout to expire the above constraint. So in this case the node will be able to run a failed resource after 60s. There is no default value for this meta property so again we must set it up explicitely.
The Xen resource is created in a Stopped state and have explicit timeouts set for migration to or away from a node. Once were are done setting up the CIB we can start the resource with crm resource start resXen1.

The bits:

(:source:)

ms msDRBDr1 resDRBDr1 	meta notify=“true” interleave=“true” target-role=“Started”

(:sourceend:)

creates a Master/Slave resource for the DRBD primitive.

The resources must be started in a specific order and colocation constraints between them must be established:
- The LVM Volume Group can only be activated if the drbd block device is in primary mode.
- Xen DomU can run only and only if it has access to its disk images on the volume group.
- The drbd resource must first be promoted to Master (primary) before LVM scans for volume groups.
- The LVM resource must be started (ie, made available by scanning for VG signatures) before the Xen DomU can be started.

Network connectivity and ping

After that one sets up a ping resource to monitor network connectivity, clone it on both nodes and restrict the Xen guests to run on a node if and only if the ping resource can connect to a configured gateway — 132.206.178.1 in our case.

(:source:)

primitive resPing ocf:pacemaker:ping 	params dampen=“5s” multiplier=“100” host_list=“132.206.178.1” attempts=“3” 	op monitor interval=“20s” timeout=“60s” 	op start interval=“0” timeout=“65s” 	op stop interval=“0” timeout=“30s”

clone cloPing resPing 	meta globally-unique=“false”

location locPing resXen1 	rule -inf: not_defined pingd or pingd lte 0

(:sourceend:)

You MUST define a monitor interval HIGHER than the the dampen delay. See http://www.gossamer-threads.com/lists/linuxha/pacemaker/69485 for details. This bug is present in all pacemaker version < 1.0.10.

STONITH and IPMI

The cluster as I built it doesn’t run as a dual-primary drbd cluster, so the thecnically speaking, the following doesn’t apply. But it might one day!

Finally one must have a way of fencing a node in an dual-primary cluster setup in case it behaves strangely or doesn’t respond. Failure to configure a fencing device will must likely lead to data corruption (a case of split-brain: both node are primary but data is not in a consistent state) between the nodes and a manual intervention will be necessary. See Manually Resolving a Split-Brain Situation for instructions on how to do that, God forbid.

The split-brain situation is avoided by using a stonith resource configured to use IPMI on both nodes. Once the primitives are defined one sets up location constraints such that each node has one instance of the stonith resource running such that it can shoot the other offending node in case of failure (ie, a resource won’t stop properly, a node seems to be available but doesn’t respond, etc). One then enables the cluster property stonith-enabled=“true” along with the options stonith-action=“poweroff” to power off a node when stonith’ed rather that a reset (the default).

(:source:)

primitive resStonitHelena stonith:external/ipmi 	params hostname=“helena” ipaddr=“192.168.1.17” userid=“root” passwd=“********” interface=“lanplus” 	op start interval=“0” timeout=“60s” 	op stop interval=“0” timeout=“60s” 	op monitor interval=“3600s” timeout=“60”

primitive resStonithPuck stonith:external/ipmi 	params hostname=“puck” ipaddr=“192.168.1.16” userid=“root” passwd=“********” interface=“lanplus” 	op start interval=“0” timeout=“60s” 	op stop interval=“0” timeout=“60s” 	op monitor interval=“3600s” timeout=“60”

location locStonithHelena resStonitHelena -inf: helena

location locStonithPuck resStonitPuck -inf: puck

property stonith-enabled=“true”

property stonith-action=“poweroff”

(:sourceend:)

Note that a shutdown down of corosync (the cluster communication layer) or an OOM-kill will trigger a stonith event. A corollary of this is that if a node is rebooted while still being online will have the effect of making the peer node shoot him in the head!

Final Setup

After all this, here’s a CIB for with the following properties:

4 Xen guests, each running on disk devices located on volume groups, sitting on top of DRBD block devices, who in turn are sitting on top of LVMs.
Active/Passive cluster, with DRBD in primary/secondary mode (no live migration).
Stonith is enabled, using IPMI, to poweroff offending nodes.
Ping resources are used to restrict guests to run on nodes with outside network connectivity.
One guest has a block device attached, used as a disk for web content.
Policy Engine files are restricted to 4*24*7 = 672

~# crm configure show
node helena
node puck
primitive resDRBDr-www ocf:linbit:drbd \
        params drbd_resource="r-www" \
        op start interval="0" timeout="240s" \
        op stop interval="0" timeout="120s" \
        op monitor interval="20s" role="Master" timeout="240s" \
        meta migration-threshold="3" failure-timeout="60s"
primitive resDRBDr0 ocf:linbit:drbd \
        params drbd_resource="r0" \
        op start interval="0" timeout="240s" \
        op stop interval="0" timeout="120s" \
        op monitor interval="20s" role="Master" timeout="240s" \
        meta migration-threshold="3" failure-timeout="60s"
primitive resDRBDr1 ocf:linbit:drbd \
        params drbd_resource="r1" \
        op start interval="0" timeout="240s" \
        op stop interval="0" timeout="120s" \
        op monitor interval="20s" role="Master" timeout="240s" \
        meta migration-threshold="3" failure-timeout="60s"
primitive resDRBDr2 ocf:linbit:drbd \
        params drbd_resource="r2" \
        op start interval="0" timeout="240s" \
        op stop interval="0" timeout="120s" \
        op monitor interval="20s" role="Master" timeout="240s" \
        meta migration-threshold="3" failure-timeout="60s"
primitive resDRBDr3 ocf:linbit:drbd \
        params drbd_resource="r3" \
        op start interval="0" timeout="240s" \
        op stop interval="0" timeout="120s" \
        op monitor interval="20s" role="Master" timeout="240s" \
        meta migration-threshold="3" failure-timeout="60s"
primitive resLVM0 ocf:heartbeat:LVM \
        params volgrpname="vg0" \
        op monitor interval="10s" timeout="60s" \
        op start interval="0" timeout="60s" \
        op stop interval="0" timeout="60s" \
        meta migration-threshold="3" failure-timeout="60s"
primitive resLVM1 ocf:heartbeat:LVM \
        params volgrpname="vg1" \
        op monitor interval="10s" timeout="60s" \
        op start interval="0" timeout="60s" \
        op stop interval="0" timeout="60s" \
        meta migration-threshold="3" failure-timeout="60s"
primitive resLVM2 ocf:heartbeat:LVM \
        params volgrpname="vg2" \
        op monitor interval="10s" timeout="60s" \
        op start interval="0" timeout="60s" \
        op stop interval="0" timeout="60s" \
        meta migration-threshold="3" failure-timeout="60s" target-role="Started"
primitive resLVM3 ocf:heartbeat:LVM \
        params volgrpname="vg3" \
        op monitor interval="10s" timeout="60s" \
        op start interval="0" timeout="60s" \
        op stop interval="0" timeout="60s" \
        meta migration-threshold="3" failure-timeout="60s"
primitive resPing ocf:pacemaker:ping \
        params dampen="5s" multiplier="100" host_list="132.206.178.1" attempts="3" \
        op monitor interval="20s" timeout="60s" \
        op start interval="0" timeout="65s" \
        op stop interval="0" timeout="30s"
primitive resStonithHelena stonith:external/ipmi \
        params hostname="helena" ipaddr="192.168.1.17" userid="root" passwd="********" interface="lanplus" \
        op start interval="0" timeout="60s" \
        op stop interval="0" timeout="60s" \
        op monitor interval="3600s" timeout="60"
primitive resStonithPuck stonith:external/ipmi \
        params hostname="puck" ipaddr="192.168.1.16" userid="root" passwd="********" interface="lanplus" \
        op start interval="0" timeout="60s" \
        op stop interval="0" timeout="60s" \
        op monitor interval="3600s" timeout="60"
primitive resXen0 ocf:heartbeat:Xen \
        params xmfile="/etc/xen/matsya.bic.mni.mcgill.ca.cfg" name="matsya.bic.mni.mcgill.ca" \
        op monitor interval="20s" timeout="60s" \
        op start interval="0" timeout="90s" \
        op stop interval="0" timeout="90s" \
        meta migration-threshold="3" failure-timeout="60s"
primitive resXen1 ocf:heartbeat:Xen \
        params xmfile="/etc/xen/kurma.bic.mni.mcgill.ca.cfg" name="kurma.bic.mni.mcgill.ca" \
        op monitor interval="20s" timeout="60s" \
        op start interval="0" timeout="90s" \
        op stop interval="0" timeout="90s" \
        meta migration-threshold="3" failure-timeout="60s" target-role="Started"
primitive resXen2 ocf:heartbeat:Xen \
        params xmfile="/etc/xen/varaha.bic.mni.mcgill.ca.cfg" name="varaha.bic.mni.mcgill.ca" \
        op monitor interval="20s" timeout="60s" \
        op start interval="0" timeout="90s" \
        op stop interval="0" timeout="90s" \
        meta migration-threshold="3" failure-timeout="60s" target-role="Started" is-managed="true"
primitive resXen3 ocf:heartbeat:Xen \
        params xmfile="/etc/xen/narasimha.bic.mni.mcgill.ca.cfg" name="narasimha.bic.mni.mcgill.ca" \
        op monitor interval="20s" timeout="60s" \
        op start interval="0" timeout="90s" \
        op stop interval="0" timeout="90s" \
        meta migration-threshold="3" failure-timeout="60s"
group grpDRBDweb resDRBDr2 resDRBDr-www
ms msDRBDr0 resDRBDr0 \
        meta notify="true" interleave="true"
ms msDRBDr1 resDRBDr1 \
        meta notify="true" interleave="true"
ms msDRBDr2 grpDRBDweb \
        meta notify="true" interleave="true" target-role="Started"
ms msDRBDr3 resDRBDr3 \
        meta notify="true" interleave="true"
clone cloPing resPing \
        meta globally-unique="false"
location locPing0 resXen0 \
        rule $id="locPing-rule" -inf: not_defined pingd or pingd lte 0
location locPing1 resXen1 \
        rule $id="locPing1-rule" -inf: not_defined pingd or pingd lte 0
location locPing2 resXen2 \
        rule $id="locPing2-rule" -inf: not_defined pingd or pingd lte 0
location locPing3 resXen3 \
        rule $id="locPing3-rule" -inf: not_defined pingd or pingd lte 0
location locStonithHelena resStonithHelena -inf: helena
location locStonithPuck resStonithPuck -inf: puck
colocation colLVM0-on-DRBDMaster0 inf: resLVM0 msDRBDr0:Master
colocation colLVM1-on-DRBDMaster1 inf: resLVM1 msDRBDr1:Master
colocation colLVM2-on-DRBDMaster2 inf: resLVM2 msDRBDr2:Master
colocation colLVM3-on-DRBDMaster3 inf: resLVM3 msDRBDr3:Master
colocation colXen0-on-LVM0 inf: resXen0 resLVM0
colocation colXen1-on-LVM1 inf: resXen1 resLVM1
colocation colXen2-on-LVM2 inf: resXen2 resLVM2
colocation colXen3-on-LVM3 inf: resXen3 resLVM3
order ordDRBDr0-before-LVM0 inf: msDRBDr0:promote resLVM0:start
order ordDRBDr1-before-LVM1 inf: msDRBDr1:promote resLVM1:start
order ordDRBDr2-before-LVM2 inf: msDRBDr2:promote resLVM2:start
order ordDRBDr3-before-LVM3 inf: msDRBDr3:promote resLVM3:start
order ordLVM0-before-Xen0 inf: resLVM0 resXen0
order ordLVM1-before-Xen1 inf: resLVM1 resXen1
order ordLVM2-before-Xen2 inf: resLVM2 resXen2
order ordLVM3-before-Xen3 inf: resLVM3 resXen3
property $id="cib-bootstrap-options" \
        dc-version="1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b" \
        cluster-infrastructure="openais" \
        expected-quorum-votes="2" \
        no-quorum-policy="ignore" \
        default-resource-stickiness="100" \
        pe-input-series-max="672" \
        pe-error-series-max="672" \
        pe-warn-series-max="672" \
        stonith-enabled="true" \
        last-lrm-refresh="1315482708" \
        stonith-action="poweroff"

Colocation Crazyness

This doesn’t belong here but for the moment it will suffice. This is just for me to remember to look at the colocation semantics more carefully. This is from the pacemaker mailing list gurus (and it’s really conterintuitive to say the least).

> If you ever consider do something about it, here is another thing that
> can be lived with, but is non-intuitive.
>
> 1) colocation c1 inf: A B
>
> the most significant is B (if B is stopped nothing else will be running)
>
> 2) colocation c2 inf: A B C
>
> most significant - A
>
> 3) colocation c3 inf: ( A B ) C
>
> most significant - C
>
> 4) colocation c4 inf: ( A B ) C D
>
> most significant - C again
>
> I am trying to find a logic to remember this, but fails so far :)

No wonder. I have a patch ready to fix this, but have never been
happy with it.

Resources within a resource set have opposite semantics to 2-rsc
collocations. But two adjacent resource sets are again as 2-rsc
collocations, i.e. the left set follows the right set. Now,
just to add to the confusion, in the example 4) above it is not
very obvious that there are two sets, and that the second set is
"C D". So, 4) should be equivalent to these 5 2-rsc collocations:

A C
B C
A D
B D
D C

What is the difference (if any) of the above to

A D
B D
D C

Well, we leave that to the interested as an exercise ;-)

More info on colocation semantics

colocation foo_on_bar inf: foo bar

means that:

1. foo is to run wherever bar runs;
2. if bar is not started, then foo won't run at all;
3. if bar fails and cannot recover, then foo will shut down;
4. if foo is not started, bar will of course run;
5. if foo fails and cannot recover, bar will continue to run.

Cluster Resource Management

The full blown syntax is explained in the pacemaker documentation at
http://www.clusterlabs.org/doc/en-US/Pacemaker/1.0/html/Pacemaker_Explained/s-moving-resource.html

The exit return code of resource agents is explained in http://refspecs.linuxfoundation.org/LSB_4.1.0/LSB-Core-generic/LSB-Core-generic/iniscrptact.html
for the LSB compliants RAs and in http://www.linux-ha.org/wiki/OCF_Resource_Agents for the OCF agents.
See http://www.linux-ha.org/doc/dev-guides/ra-dev-guide.html for info on how to write (and test!) OCF-compliant RAs.

The main tool to manage a pacemaker cluster through a command line interface is the Cluster Resource Manager shell or crm. There are graphic interfaces out there but they are finicky at times. One that looks promising is lcmc (Linux Cluster Manager Console), a java GUI that can configure, manage and vizualize Linux clusters. It requires Sun (Oracle, pouah!) Java6 but can work on other JREs, or so they say. In the following I’ll use crm.

Cluster Status

crm

Use the command crm_mon to query the cluster status along with resources states, etc. The flag −1 is for a one-shot view of the cluster state. You can also group the resources per nodes (—group-by-node and show the inactive ones (not shown by default) with —inactive.

puck:~# crm_mon --group-by-node --inactive -1
============
Last updated: Tue Sep 13 14:32:49 2011
Stack: openais
Current DC: helena - partition with quorum
Version: 1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b
2 Nodes configured, 2 expected votes
15 Resources configured.
============

Node puck: online
        resStonithHelena        (stonith:external/ipmi) Started 
        resDRBDr1:1     (ocf::linbit:drbd) Slave 
        resDRBDr3:1     (ocf::linbit:drbd) Slave 
        resPing:0       (ocf::pacemaker:ping) Started 
        resDRBDr0:1     (ocf::linbit:drbd) Slave 
        resDRBDr-www:1  (ocf::linbit:drbd) Slave 
        resDRBDr2:1     (ocf::linbit:drbd) Slave 
Node helena: online
        resStonithPuck  (stonith:external/ipmi) Started 
        resDRBDr-www:0  (ocf::linbit:drbd) Master 
        resDRBDr1:0     (ocf::linbit:drbd) Master 
        resDRBDr2:0     (ocf::linbit:drbd) Master 
        resDRBDr3:0     (ocf::linbit:drbd) Master 
        resLVM1 (ocf::heartbeat:LVM) Started 
        resXen2 (ocf::heartbeat:Xen) Started 
        resLVM2 (ocf::heartbeat:LVM) Started 
        resLVM3 (ocf::heartbeat:LVM) Started 
        resXen3 (ocf::heartbeat:Xen) Started 
        resXen1 (ocf::heartbeat:Xen) Started 
        resDRBDr0:0     (ocf::linbit:drbd) Master 
        resXen0 (ocf::heartbeat:Xen) Started 
        resPing:1       (ocf::pacemaker:ping) Started 
        resLVM0 (ocf::heartbeat:LVM) Started 

Inactive resources:

The crm shell can also be used with crm status.

lcmc

You remotely connect to a cluster using a secure ssh session from a host with a X11 display using the Linux Cluster Manager Console with the command (the jar file can be found on the LCMC web site http://lcmc.sourceforge.net/)

~# java -jar LCMC-1.1.2.jar

After connecting you’ll get a view like this.

Changing Node Status

To put a node on standby mode issue the command crm node standbye <node_name>. The resources located on node node_name will then be stopped in an ordely fashion and restarted on the other node —IF there are no cluster constraints that forbid the online node to run the resources!

For instance if a resource originally running on node1 is manually migrated to node2 then it won’t restart anywhere upon putting node2 in standby unless one un-migrate the resource prior to putting node2 offline. See Manually Migrating (Moving) Resources Around for details.
To put back the node online type crm node online <node_name>. Resources might start migrating around, all depending on their stickiness.

Manually Migrating (Moving) Resources Around

To move resources around the cluster use the command crm resource move <rsc> [<node>] [<lifetime>] (or replace move by migrate). On a 2-node cluster you obviously don’t have to specify the target node. Used this way, this command will move the specified resource away from its current node. Lets say on node node-1 that you issue the command crm resource move myRes then behind the scene a constraint is introduced in the CIB:

 <rsc_location rsc="myRes" node="node-1" score="-INFINITY"/>

This will achieve the desired effect but will also have long-term consequences. As the tool will warn you, the creation of an -INFINITY constraint will prevent the resource from ever running on node node-1 again, and not until crm resource unmove myRes (or unmigrate) is issued. This includes the situation where every other cluster node is no longer available.

Starting and Stopping Resources

You can stop and start a resource MyRes using the command:

~# crm resource stop MyRes
~# crm resource start MyRes

There are subtle differences between starting, moving and migrating multi-state resources (master/slave).

One would think that crm resource start would be the inverse of crm resource stop, but in fact it is not…Try crm resource promote for ms resources, or even better, crm resource meta <resource-id> delete target-role, which reverts it to the respective defaults and thus in fact is the inverse of crm resource stop for all resource types.

To stop all the resources cluster-wide:

~# crm configure property stop-all-resources=true

Cluster-Wide Properties and Resources Management

You can list the cluster resources with anyone of the following commands (crm is the cluster shell):

~# crm_resource --list
~# crm resource list
~# crm resource show

To display the current location of ‘myResource’:

~# crm_resource --resource myResource --locate

Use the following to tell the cluster not to manage ‘myResource’. The cluster will not attempt to start or stop a unmanaged resource under any circumstances. This is useful when performing maintenance tasks on a resource.

~# crm_resource --resource myResource --set-parameter is-managed --meta --parameter-value false
~# crm resource unmanage myResource

It is possible to shut down pacemaker but keep the current service state, ie. all services will be kept running on their current node. This typically used to perform an upgrade of the cluster stack without disturbing its state.

~# crm configure property maintenance-mode=true

This overrides all individual “is-managed=true” instance-attributes. Note that there is a subtle difference between the effect of the command above and the following one:

crm configure property is-managed-default=false

as, unlike maintenance-mode, this one is still going to run monitor ops. See http://www.gossamer-threads.com/lists/linuxha/pacemaker/70606#70606 for more details.

Managing Resources Failovers and Failcounts

See http://wiki.lustre.org/index.php/Using_Pacemaker_with_Lustre for example.

Resetting the failcounter

If Pacemaker monitors a resource and finds that it isn’t running, by default it will try to restart the it on the node. If the resource cannot be restarted on the node, it then migrates the resource to another node. The meta resource meta migration-threshold=“X” failure-timeout=“60s” modifies this behaviour. In this the cluster manager will try to restart the resource X number of times, and upon reaching the threshold will simply update the failcount to infinity (hence forbiding the resource to ever run on this node) and restart it on another node. The failure-timeout=“60s” is supposed to reset the failcount (Note: in my tests it didn’t appeat to do so).

It is the administrator’s task to find out the cause of the error, fix it, and reset the failcounter of the resource. This can be achieved by entering:

~# crm resource failcount <resource> delete <node>

“Cleaning up” a Resource

Sometimes it is necessary to “clean up” a resource. Internally, this command removes any information about a resource from the Local Resource Manager on every node and thus forces a complete re-read of the status of that resource. The command syntax is:

~# crm resource cleanup resMyOST

This command removes information about the resource called resMyOST on all nodes. You can see the resources failcounts with:

~# crm_mon -1 --failcounts
~# crm_mon -1 -f

How to Safely Modify the Cluster Configuration

When one uses the crm shell to modify the CIB of the cluster it is always advisable to verify the cluster response to the changes one is about to make before committing them. The crm shell will warn of any syntax or semantic errors but even if it doesn’t complain, it doesn’t mean that the cluster will react as you think! So the proper way of modifying the CIB is to lean on the safe side of things and involves creating a shadow config instead of modifying the live CIB. Once satisfied with the shadow CIB, just commit/apply it.

The follow creates a shadow CIB called gautama:

crm(live)configure# cib new gautama
do your edits/modifications and once done at the crm prompt: verify
if no errors or warnings are issued use the crm configure ptest to calculate the cluster’s PE (policy Engine) would-be-response to the supplied cluster state change. Note that if dotty is installed and that an X11 display session is available, a graphic will pop up, eg:

If satisfied, just commit it.

If you don’t feel like to commiting the shadow CIB right away, just save it and return to it later. You can list the shadow CIBs with list:

~# crm configure cib list
gautama

then just use it:

~# crm configure cib use gautama
%crm(gautama)configure#

Cluster Events and System Logs and Logcheck

Pacemaker and Corosync are extremely verbose. Even in case of a very quiescent cluster a lot is written to the syslog.

A few of those warnings are however completely irrelevant and are just leftovers from code development logging and the developpers forgot to remove them before commiting. Those will appear everytime the pengine (Policy Engine) timer pops. They can be safely dismissed.

Mar 25 06:35:37 puck crmd: [6124]: info: crm_timer_popped: PEngine Recheck Timer (I_PE_CALC) just popped!
...
Mar 25 06:35:37 puck pengine: [6123]: ERROR: create_notification_boundaries: Creating boundaries for msDRBDr1
Mar 25 06:35:37 puck pengine: [6123]: ERROR: create_notification_boundaries: Creating boundaries for msDRBDr1
Mar 25 06:35:37 puck pengine: [6123]: ERROR: create_notification_boundaries: Creating boundaries for msDRBDr1
Mar 25 06:35:37 puck pengine: [6123]: ERROR: create_notification_boundaries: Creating boundaries for msDRBDr1
Mar 25 06:35:37 puck pengine: [6123]: ERROR: create_notification_boundaries: Creating boundaries for cloOCFSr1

The following events though are a little more worrying but as long as they don’t occur too frequently things should be ok. I’ve asked on the pacemaker mailing list what they are about just to have one of the developper tell me to submit a bug report if they show up too often.!

Mar 25 11:30:06 puck lrmd: [6121]: WARN: G_SIG_dispatch: Dispatch function for SIGCHLD was delayed 880 ms (> 100 ms) before being called (GSource: 0x22f6fc0)
Mar 25 11:30:06 puck lrmd: [6121]: info: G_SIG_dispatch: started at 1751169259 should have started at 1751169171
Mar 25 11:53:27 puck lrmd: [6121]: WARN: G_SIG_dispatch: Dispatch function for SIGCHLD was delayed 200 ms (> 100 ms) before being called (GSource: 0x22f6fc0)
Mar 25 11:53:27 puck lrmd: [6121]: info: G_SIG_dispatch: started at 1751309357 should have started at 1751309337

The cluster communication ring will sometimes have hicups:

corosync[6130]:   [TOTEM ] Incrementing problem counter for seqid 1938641 iface 132.206.178.60 to [1 of 10]
corosync[6130]:   [TOTEM ] Incrementing problem counter for seqid 1938649 iface 132.206.178.60 to [2 of 10]
corosync[6130]:   [TOTEM ] Decrementing problem counter for iface 132.206.178.60 to [1 of 10]
corosync[6130]:   [TOTEM ] ring 0 active with no faults

If the problem counter ever reaches 10 then the ring will be shutdown and manual intervention is required. Note that since there is a dual communication ring corosync will happily continue his duties if the redundant ring is configured as an active one in /etc/corosync/corosync.conf:

totem {
        version: 2
...
        # This specifies the mode of redundant ring, which may be none, active, or passive.
        rrp_mode: active
...
}

As explained in the Corosync Configuration section, one can reinstate the dual ring once it has been verified that there is really no problem with it by using the command corosync-cfgtool -r

~# corosync-cfgtool -s
Printing ring status.
Local node ID 1018351236
RING ID 0
        id      = 132.206.178.60
        status  = ring 0 active with no faults
RING ID 1
        id      = 192.168.1.19
        status  = administrative fault

~# corosync-cfgtool -r

~# corosync-cfgtool -s
Printing ring status.
Local node ID 1018351236
RING ID 0
        id      = 132.206.178.60
        status  = ring 0 active with no faults
RING ID 1
        id      = 192.168.1.19
        status  = ring 1 active with no faults

Stonith Events

A small script /root/bin/whodidit.sh will show stonith events:

Looking for STONITH events in today's log /var/log/syslog:
...nope!
Looking for STONITH events in yesterdays's log /var/log/syslog.1:
...nope!
Looking for STONITH events in older logs: 
>>> In /var/log/syslog.2.gz ?
...nope!
>>> In /var/log/syslog.3.gz ?
...nope!
>>> In /var/log/syslog.4.gz ?
Mar 21 15:45:51 puck crmd: [6124]: info: te_fence_node: Executing reboot fencing operation (204) on helena (timeout=60000)
Mar 21 15:45:51 puck stonithd: [6119]: info: client tengine [pid: 6124] requests a STONITH operation RESET on node helena
Mar 21 15:45:52 puck stonithd: [6119]: info: Succeeded to STONITH the node helena: optype=RESET. whodoit: puck
>>> In /var/log/syslog.5.gz ?
...nope!
>>> In /var/log/syslog.6.gz ?
...nope!
>>> In /var/log/syslog.7.gz ?
...nope!

Logcheck Filtering

The logcheck utility will generate a lot of warnings and after a while I got annoyed at the amount of emails sent from what looked to me pretty much uninteresting. So I modified the file /etc/logcheck/ignore.d.server/pacemaker to filter out some of the false-positive warnings:

(:source:)

^\w{3} [ :0–9]{11} [._[:alnum:]-]+ pengine: \[[0–9]+\]: ERROR: create_notification_boundaries: Creating boundaries for.*$

^\w{3} [ :0–9]{11} [._[:alnum:]-]+ corosync\[[0–9]+\]:   \[TOTEM \] Incrementing problem counter for seqid [0–9]+ iface [._[:alnum:]-]+ to \[1 of 10\]$

^\w{3} [ :0–9]{11} [._[:alnum:]-]+ corosync\[[0–9]+\]:   \[TOTEM \] Decrementing problem counter for iface [._[:alnum:]-]+ to \[1 of 10\]$

^\w{3} [ :0–9]{11} [._[:alnum:]-]+ corosync\[[0–9]+\]:   \[TOTEM \] Incrementing problem counter for seqid [0–9]+ iface [._[:alnum:]-]+ to \[2 of 10\]$

^\w{3} [ :0–9]{11} [._[:alnum:]-]+ corosync\[[0–9]+\]:   \[TOTEM \] Decrementing problem counter for iface [._[:alnum:]-]+ to \[2 of 10\]$

^\w{3} [ :0–9]{11} [._[:alnum:]-]+ corosync\[[0–9]+\]:   \[TOTEM \] ring 0 active with no faults$

^\w{3} [ :0–9]{11} [._[:alnum:]-]+ corosync\[[0–9]+\]:   \[TOTEM \] ring 1 active with no faults$

^\w{3} [ :0–9]{11} [._[:alnum:]-]+ lrmd: \[[0–9]+\]: WARN: G_SIG_dispatch: Dispatch function for SIGCHLD was delayed.*$

(:sourceend:)
How STONITH WorksFrom http://ourobengr.com/ha
In Pacemaker clusters, all operations (start/monitor/stop) have a timeout. 

If this timeout elapses prior to the completion of the operation, the operation is considered failed.
An HA resource is started on one node.
If the start succeeds, the resource will be monitored indefinitely.
If the start fails, the resource will be stopped, then re-started on either the current node, or another node.
While the resource is being monitored, if the monitor ever fails, the resource will be stopped, then re-started on either the current node, or another node.
If a resource needs to be stopped, and the stop succeeds, the resource is re-started on either the current node, or another node.
If a stop fails, the node will be fenced/STONITHd because this is the only safe thing to do. If you can’t safely know you’ve terminated the HA resource, 

the least-worst course of action is to kill the entire node hard, and right now because the alternative is potential data corruption and/or data loss.
Given this chain of events, it is critically important that creators of resource agents (i.e. the scripts that start, stop and monitor HA resources) ensure that stop operations always succeed, unless the resource cannot actually be stopped.
The best worst case failover time is the least amount of time it will take, under the most adverse conditions, for a highly available resource to fail on one node, restart on another node, and become accessible to client systems
See also http://www.mail-archive.com/pacemaker@oss.clusterlabs.org/msg04312.html
 Quick Resume: Provisioning a Xen Guest to PacemakerIn the following, replace X by the appropriate thing, ie, 1, 2 or higher. So if you want a new xen host living on top of logical volume xen_lv4, then replace X by 4. You’ll then have a drbd device /dev/drbd4 using resource file /etc/drbd.d/r4.res. 
One then creates a physical volume on the drbd block device that device will be the back device for a new volume group vg4. 

One this VG one then creates 2 logical volumes, one for the guest disk and swap. The pacemaker resource will be named resXen4. 

The new guest hostname and IP are entirely up to you. 

The procedure should take less than 20 minutes, the bulk of the time being spent in the xen-create-image section.
Create a new 24GB logical volume xen_lvX on the vg_xen volume group on both nodes:

~# lvcreate --size 24G --name xen_lvX vg_xen 
  Logical volume "xen_lvX" created

Add a drbd resource rX that will live on top of the new logical volume.

DO NOT COPY AN ALREADY EXISTING RESOURCE CONFIG FILE IN PLACE IN /etc/drbd.d like something along cp /etc/drbd.d/r0.res /etc/drbd.d/new.res. DO NOT DO THIS!! This will completely confuse pacemaker as it continually monitors the drbd resources: DRBD will detect a conflict between the resources and in the best of case one node will be STONITHed and the cluster won’t restart automatically until the stonith node is back online with the drbd resources resynced (I dunno why). Rather, create it from scratch or make a copy of an existing resource file to a temporary directory and when absolutely sure that the resource config is kosher, copy it back in the drbd resource config dir.

Make sure the resource file rX.res is identical on both nodes. Also double verify that the resource is properly referencing the right drbd and backing devices in the top 3 lines of /etc/drbd.d/rX.res resource definition file:

resource rX {
    device /dev/drbdX;
    disk /dev/vg_xen/xen_lvX;

Finally make sure the port number for the duplication link are not in use in any other resource file:

    on puck {
        address 10.0.0.1:7790;
    }
    on helena {
        address 10.0.0.2:7790;
    }

Initialize the drbd block device. Do the following on both nodes:

~# dd if=/dev/zero of=/dev/vg_xen/xen_lvX bs=512 count=1024
~# drbdadm create-md rX

~# drbdadm attach rX
~# drbdadm syncer rX
~# drbdadm connect rX

The last 3 steps, drbdadm attach/syncer/connect can be combined in just one: drbdadm up rX.

Finally, perform the next step on only one of the node.

~# drbdadm -- --overwrite-data-of-peer primary rX

Create physical volume on the drbd block device /dev/drbdX on the node who has the primary role:

~# pvcreate /dev/drbdX

Create a Volume Group on the new Physical Volume:

~# vgcreate vgX /dev/drbdX

Create 2 logical volumes that will hold the guest’s disk (20GB) and swap (4GB). Note that you might have to check the values of the logical extents available as their might be round-off errors.

~# lvcreate -l 5120 -n xenX-disk vgX 
~# lvcreate -l 1023 -n xenX-swap vgX

Create the new guest. Don’t create the swap file (,—noswap below) we’ll do it manually later. Note that one has to specify the option —image-dev=/dev/vgX/xenX-disk. This is necessary as otherwize the xen-create-image command will try to create a logical volume for the disk image (by default its name will be $hostname-disk) in the volume group as specified by —lvm=xenX-disk. But since we already did it (and there are no more free extents available in the VG) the command will exit with an error message.

~# xen-create-image --hostname=XXX.bic.mni.mcgill.ca --ip=132.206.178.XXX --arch=amd64 \
   --role=udev --lvm=xenX-disk --noswap --image-dev=/dev/vgX/xenX-disk --output=/etc/xen

Verify that the guest config file is well in /etc/xen
Boot the new guest with the xm-tools and test that all is ok:

~# xm create -c /etc/xen/XXX.bic.mni.mcgill.ca.cfg

Once satisfied that the guest is healthy, shut it down:

~# xm shutdown XXX.bic.mni.mcgill.ca

Add the swap and update the guest config file:

~# mkswap /dev/vgX/xenX-swap

Make the guest config file refers to the new swap file just created. Copy it to the peer and restart the guest, login and add the swap to the fstab and tell the guest to enable it with swapon -a:

disk        = [
                  'phy:/dev/vgX/xenX-disk,xvda2,w',
                  'phy:/dev/vgX/xenX-swap,xvda1,w',
              ]

verify that the guest can migrate to the peer node. On the primary node running the guest:

~# xm shutdown XXX.bic.mni.mcgill.ca
~# vgchange -a n vgX
~# drbdadm secondary rX

On the peer node (still in secondary drbd state), promote drbd in primary state, make the VG available and start the guest. Don’t forget to copy the guest config file over to the peer!

~# drbdadm primary rX
~# vgchange -a y vgX
~# xm create -c /etc/xen/XXX.bic.mni.mcgill.ca.cfg

Once satisfied that the guest is healthy, shut it down, turn off access to the LV and shutdown the drbd block device. From now on, pacemaker will take the relay. Make absolutely sure that drbd is resource unconfigured and stopped on both node before having pacemaker manage the new guest.

~# xm shutdown XXX.bic.mni.mcgill.ca
~# vgchange -a n vgX
~# drbdadm down rX

update the pacemaker CIB using the crm shell.

Make sure to replace the hostname and config file for the guest in the Xen resource primitive definition resXenX below.

(:source:)

~# crm configure 
primitive resDRBDrX ocf:linbit:drbd 	params drbd_resource=“rX” 	op start interval=“0” timeout=“240s” 	op stop interval=“0” timeout=“100s” 	op monitor interval=“30s” role=“Master” timeout=“240s” 	op monitor interval=“31s” role=“Slave” timeout=“240s” 	meta migration-threshold=“3” failure-timeout=“120s”

primitive resLVMX ocf:heartbeat:LVM         params volgrpname=“vgX”         op monitor interval=“10s” timeout=“60s”         op start interval=“0” timeout=“60s”         op stop interval=“0” timeout=“60s”         meta migration-threshold=“3” failure-timeout=“60s”

primitive resXenX ocf:heartbeat:Xen 	params xmfile=“/etc/xen/XXX.bic.mni.mcgill.ca.cfg” name=“XXX.bic.mni.mcgill.ca” 	op monitor interval=“20s” timeout=“60s” 	op start interval=“0” timeout=“90s” 	op stop interval=“0” timeout=“60s” 	op migrate_from interval=“0” timeout=“120s” 	op migrate_to interval=“0” timeout=“120s” 	meta migration-threshold=“3” failure-timeout=“120s”

ms msDRBDrX resDRBDrX 	meta notify=“true” interleave=“true”

location locPingXenX resXenX 	rule -inf: not_defined pingd or pingd lte 0

colocation colLVMX-on-DRBDMasterX inf: resLVMX msDRBDrX:Master

colocation colXenX-on-LVMX inf: resXenX resLVMX

order ordDRBDrX-before-LVMX inf: msDRBDrX:promote resLVMX:start

order ordLVMX-before-XenX inf: resLVMX resXenX

verify

commit

quit

(:sourceend:)

Cluster and Node Upgrades

There are 3 ways to upgrade the cluster stack:

complete cluster shutdown and upgrade.
rolling upgrade, doing one node at a time.
disconnect and reattach, by leaving the resources un-managed during the upgrade.

I’ll just discuss the 2nd variant.

Cluster Rolling Upgrade

In this case one goes from node to node, one at a time, and perform the following sequence of actions:

Put the node in standby mode. Some resources might migrate.
Shutdown the cluster stack. Stonith might shoot it in the head if enabled. Restart if necessary.
Upgrade Pacemaker and maybe also the cluster stack (Corosync). Might as well update/upgrade the underlying OS while at it.
Restart the cluster stack.
Put back the node in online mode.
Repeat for every other node.

Local Guests Setup

Name	CPU	RAM	Weight	Cap	Service	IP
Dom0	2*	2G**	512	0	pacemaker	132.206.178.60,61
matsya	2	2G	256	0	munin nagios ntp2	132.206.178.240
kurma	2	2G	256	0	smtp	132.206.178.241
varaha	*	*	*	0		132.206.178.242
narasimha	1	1G	256	0	ntp1 ypmaster	132.206.178.243
shadow	1	1G	256	0	bind ntp3	132.206.178.7

* pinned cpus
** non-ballooning memory

20141025. varaha has been decommissioned as there seems to be a memory problem on helene whereby only 10GB of memory is currently available, precluding the migration of all the DomUs to helena.

The credit scheduler assigns each domain a weight, and optionally a cap. The weight indicates the relative CPU allocation of a domain — if the CPU is scarce, a domain with a weight of 512 will receive twice as much CPU time as a domain with a weight of 256 (the default.) The cap sets an absolute limit on the amount of time a domain can receive, expressed in hundredths of a CPU (note that this number can exceed 100 on multiprocessor hosts.)

The scheduler transforms the weight into a credit allocation for each VCPU, using a separate accounting thread. As a VCPU runs, it consumes credits. Once the VCPU runs out of credits, it only runs when other, more thrifty VCPUs have finished executing. Periodically, the accounting thread goes through and gives everybody more credits.

In the present setup, the Dom0 credit scheduling weight is set at 512 — twice as much as the guest default value of 256 — and since this is not a configurable setting it is set at boot time in the privileged domain. The file /proc/xen/capabilities only exits on a host running xen.

(:source:)

!/bin/sh -e
rc.local
This script is executed at the end of each multiuser runlevel.
Make sure that the script will “exit 0” on success or any other
value on error.
In order to enable or disable this script just change the execution
bits.
By default this script does nothing.
For munin ip_ plugin stats

iptables -A INPUT -d 132.206.178.61
iptables -A OUTPUT -s 132.206.178.61

iptables -A INPUT -d 10.0.0.1
iptables -A OUTPUT -s 10.0.0.1

iptables -A INPUT -d 192.168.1.18
iptables -A OUTPUT -s 192.168.1.18

Xen credit scheduling weight for the Dom0 (twice the default guest value)
Only do this if we are the privileged (control) domain.

if [ -e /proc/xen/capabilities]; then

   xm sched-credit -d 0 -w 512

exit 0
(:sourceend:)

Xen Guest Specific Local Configuration

Some guests have to be locally tuned so that the services they provide are accessible. So here I list the specific locallly tuned stuff.

Narasimha and NIS

In order for this guest to NIS serve the Andrew Reader Cluster (hidden behind the 192.168.86.xxx private network) some network leg work has to be performed.

In /etc/network/interfaces

An network interface on 192.168.86.xxx has to be enabled:

(:source:)

This file describes the network interfaces available on your system
and how to activate them. For more information, see interfaces(5).
The loopback network interface

auto lo
iface lo inet loopback

The primary network interface

auto eth0
iface eth0 inet static

address 132.206.178.243
gateway 132.206.178.1
netmask 255.255.255.0
 broadcast 132.206.178.255

# post-up  ethtool -K eth0 tx off

The commented out line above will disable TCP checksumming which
might resolve problems for some users. It is disabled by default
1. eth0:0 - 192.168.86.xxx to allow access to Andrew Reader cluster.

auto eth0:0
iface eth0:0 inet static

  address 192.168.86.243
  netmask 255.255.255.0

(:sourceend:)

Also, a network route has to create using iptables. In /etc/rc.local

(:source:)
!/bin/sh -e

rc.local

This script is executed at the end of each multiuser runlevel.
Make sure that the script will “exit 0” on success or any other
value on error.

In order to enable or disable this script just change the execution
bits.

By default this script does nothing.
eth0:0 - 192.168.86.xxx to allow access to Andrew Reader cluster.
iptables -A INPUT  -d 192.168.86.243/24

iptables -A OUTPUT -d 192.168.86.243/24 
exit 0

(:sourceend:)

ClusterSetup

Pacemaker/Corosync Cluster + DRBD + STONITH + Xen

Packages Installation Base

Power and Physical network connections

Network Configuration

Logical Volume Manager (LVM) Configuration

DRBD Installation and Configuration

Check out sources from the public DRBD source repository

Packages installation and module installation

DRBD Storage configuration

DRBD resource r1

DRBD/Pacemaker fencing ssh keys:

Enabling DRBD resource and initial synchronization

Manually Resolving a Split-Brain Situation

Xen Setup

Xen boot Serial Console Redirection

Network Bridging

Xen /dev and /proc files

Xend and xen-tools Configuration

Create the Guest (LVM Disk Device)

Boot the Guest

Guest Console Login

Xen Block Device, Scheduling, Virtual CPUs and Memory Management

Virtual Block Devices

Scheduling

VCPUs and Memory

Xen resources management

Xen Virtual Block Devices (VBDs) and DRBD

Corosync Installation and Configuration

Create an auth key for nodes communication

Make core files with exec name and pid’s appended on both nodes:

Edit corosync config file

Corosync redundant dual ring config

Corosync Status and Ring Management

IPMI Configuration for STONITH

Pacemaker/CRM (Cluster Resource Manager) Configuration

Cluster Architecture and Internals.

Initial Configuration

Cluster Options and Properties.

LVM, DRBD and Xen

Network connectivity and ping

STONITH and IPMI

Final Setup

Colocation Crazyness

Cluster Resource Management

Cluster Status

crm

lcmc

Changing Node Status

Manually Migrating (Moving) Resources Around

Starting and Stopping Resources

Cluster-Wide Properties and Resources Management

Managing Resources Failovers and Failcounts

Resetting the failcounter

“Cleaning up” a Resource

How to Safely Modify the Cluster Configuration

Cluster Events and System Logs and Logcheck

Stonith Events

Logcheck Filtering

How STONITH Works

Quick Resume: Provisioning a Xen Guest to Pacemaker

Cluster and Node Upgrades

Cluster Rolling Upgrade

Local Guests Setup

Xen Guest Specific Local Configuration

Narasimha and NIS

Menu

Actions

Search