Highly Available iSCSI Storage with SCST, Pacemaker, DRBD and OCFS2 - Part2

22 minute read , Mar 02, 2016

This is continuation of the Highly Available iSCSI Storage with SCST, Pacemaker, DRBD and OCFS2 series. We have setup the HA backing iSCSI storage and now we are going to setup a HA shared storage on the client side.

iSCSI Client (Initiator) Servers Setup

What we have till now is a block device that we can access from our clients via iSCSI over IP network. However, iSCSI is stateful protocol but does not provide any state persistence on restart or state sharing between different sessions. Meaning the data written to the iSCSI device from one client is not visible to the other client connected to the same device via its own iSCSI session. That client needs to close its current session and re-connect to the target to be able to see the data written by the other client. Which in term means we need to provide additional layer on top of iSCSI target able to provide data replication in real time. This can be achieved with cluster aware file systems like GFS2 or OCFS2 for example which provide safe file locking.

We start by installing the client packages on both servers (drbd01 and drbd02) which are running Ubuntu-14.04 for OS:

# aptitude install -y open-iscsi open-iscsi-utils multipath-tools

First we are going to setup the Multipathing. This enables us to mitigate the effect of network card failure on the client side by providing two, or more, different network paths to the same target.Some illustration of this can be seen in the below pictures.

iSCSI Client

Discover the targets and login to the LUN:

root@drbd01:~# iscsiadm -m discovery -t st -p 192.168.0.180
192.168.0.180:3260,1 iqn.2016-02.local.virtual:virtual.vg1
10.20.1.180:3260,1 iqn.2016-02.local.virtual:virtual.vg1

We login to both of them:


root@drbd01:~# iscsiadm -m node -T iqn.2016-02.local.virtual:virtual.vg1 -p 192.168.0.180 --login
Logging in to [iface: default, target: iqn.2016-02.local.virtual:virtual.vg1, portal: 192.168.0.180,3260] (multiple)
Login to [iface: default, target: iqn.2016-02.local.virtual:virtual.vg1, portal: 192.168.0.180,3260] successful.

root@drbd01:~# iscsiadm -m node -T iqn.2016-02.local.virtual:virtual.vg1 -p 10.20.1.180 --login
Logging in to [iface: default, target: iqn.2016-02.local.virtual:virtual.vg1, portal: 10.20.1.180,3260] (multiple)
Login to [iface: default, target: iqn.2016-02.local.virtual:virtual.vg1, portal: 10.20.1.180,3260] successful.

root@drbd01:~# iscsiadm -m node -P 1
Target: iqn.2016-02.local.virtual:virtual.vg1
    Portal: 192.168.0.180:3260,1
        Iface Name: default
    Portal: 10.20.1.180:3260,1
        Iface Name: default

We can check the iSCSI sessions:

root@drbd01:~# iscsiadm -m session -P 1
Target: iqn.2016-02.local.virtual:virtual.vg1
    Current Portal: 192.168.0.180:3260,1
    Persistent Portal: 192.168.0.180:3260,1
        **********
        Interface:
        **********
        Iface Name: default
        Iface Transport: tcp
        Iface Initiatorname: iqn.1993-08.org.debian:01:8d74927a5fe7
        Iface IPaddress: 192.168.0.176
        Iface HWaddress: <empty>
        Iface Netdev: <empty>
        SID: 1
        iSCSI Connection State: LOGGED IN
        iSCSI Session State: LOGGED_IN
        Internal iscsid Session State: NO CHANGE
    Current Portal: 10.20.1.180:3260,1
    Persistent Portal: 10.20.1.180:3260,1
        **********
        Interface:
        **********
        Iface Name: default
        Iface Transport: tcp
        Iface Initiatorname: iqn.1993-08.org.debian:01:8d74927a5fe7
        Iface IPaddress: 10.20.1.16
        Iface HWaddress: <empty>
        Iface Netdev: <empty>
        SID: 2
        iSCSI Connection State: LOGGED IN
        iSCSI Session State: LOGGED_IN
        Internal iscsid Session State: NO CHANGE

and we can see a new block device has been created for each target we logged in (/dev/sdc and /dev/sdd):

root@drbd01:~# fdisk -l /dev/sdc
Disk /dev/sdc: 21.5 GB, 21470642176 bytes
64 heads, 32 sectors/track, 20476 cylinders, total 41934848 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 524288 bytes
Disk identifier: 0xda6e926c
Disk /dev/sdc doesn't contain a valid partition table
 
root@drbd01:~# fdisk -l /dev/sdd
Disk /dev/sdd: 21.5 GB, 21470642176 bytes
64 heads, 32 sectors/track, 20476 cylinders, total 41934848 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 524288 bytes
Disk identifier: 0xda6e926c
Disk /dev/sdd doesn't contain a valid partition table
 
root@drbd01:~# lsscsi
[1:0:0:0]    cd/dvd  QEMU     QEMU DVD-ROM     1.4.  /dev/sr0
[2:0:0:0]    disk    QEMU     QEMU HARDDISK    1.4.  /dev/sda
[2:0:1:0]    disk    QEMU     QEMU HARDDISK    1.4.  /dev/sdb
[3:0:0:0]    disk    SCST_FIO VDISK-LUN01       311  /dev/sdc
[4:0:0:0]    disk    SCST_FIO VDISK-LUN01       311  /dev/sdd

On the server side the sessions state can be seen as follows:

[root@centos01 ~]# scstadmin -list_sessions
 
Collecting current configuration: done.
 
Driver/Target: iscsi/iqn.2016-02.local.virtual:virtual.vg1
 
    Session: iqn.1993-08.org.debian:01:f0dda8483515
 
    Attribute                     Value                                       Writable      KEY
    -------------------------------------------------------------------------------------------
    DataDigest                    None                                        Yes           No
    FirstBurstLength              65536                                       Yes           No
    HeaderDigest                  None                                        Yes           No
    ImmediateData                 Yes                                         Yes           No
    InitialR2T                    No                                          Yes           No
    MaxBurstLength                1048576                                     Yes           No
    MaxOutstandingR2T             1                                           Yes           No
    MaxRecvDataSegmentLength      1048576                                     Yes           No
    MaxXmitDataSegmentLength      262144                                      Yes           No
    active_commands               0                                           Yes           No
    bidi_cmd_count                0                                           Yes           No
    bidi_io_count_kb              0                                           Yes           No
    bidi_unaligned_cmd_count      0                                           Yes           No
    commands                      0                                           Yes           No
    force_close                   <n/a>                                       Yes           No
    initiator_name                iqn.1993-08.org.debian:01:f0dda8483515      Yes           No
    none_cmd_count                1                                           Yes           No
    read_cmd_count                69466                                       Yes           No
    read_io_count_kb              8467233                                     Yes           No
    read_unaligned_cmd_count      2787                                        Yes           No
    reinstating                   0                                           Yes           No
    sid                           10000013d0200                               Yes           No
    thread_pid                    5003 5004 5005 5006 5007 5008 5009 5010     Yes           No
    unknown_cmd_count             0                                           Yes           No
    write_cmd_count               5201                                        Yes           No
    write_io_count_kb             759565                                      Yes           No
    write_unaligned_cmd_count     2122                                        Yes           No
 
    Session: iqn.1993-08.org.debian:01:f0dda8483515_1
 
    Attribute                     Value                                       Writable      KEY
    -------------------------------------------------------------------------------------------
    DataDigest                    None                                        Yes           No
    FirstBurstLength              65536                                       Yes           No
    HeaderDigest                  None                                        Yes           No
    ImmediateData                 Yes                                         Yes           No
    InitialR2T                    No                                          Yes           No
    MaxBurstLength                1048576                                     Yes           No
    MaxOutstandingR2T             1                                           Yes           No
    MaxRecvDataSegmentLength      1048576                                     Yes           No
    MaxXmitDataSegmentLength      262144                                      Yes           No
    active_commands               0                                           Yes           No
    bidi_cmd_count                0                                           Yes           No
    bidi_io_count_kb              0                                           Yes           No
    bidi_unaligned_cmd_count      0                                           Yes           No
    commands                      0                                           Yes           No
    force_close                   <n/a>                                       Yes           No
    initiator_name                iqn.1993-08.org.debian:01:f0dda8483515      Yes           No
    none_cmd_count                1                                           Yes           No
    read_cmd_count                68719                                       Yes           No
    read_io_count_kb              8434073                                     Yes           No
    read_unaligned_cmd_count      2543                                        Yes           No
    reinstating                   0                                           Yes           No
    sid                           40000023d0200                               Yes           No
    thread_pid                    5003 5004 5005 5006 5007 5008 5009 5010     Yes           No
    unknown_cmd_count             0                                           Yes           No
    write_cmd_count               5051                                        Yes           No
    write_io_count_kb             803872                                      Yes           No
    write_unaligned_cmd_count     1873                                        Yes           No
 
    Session: iqn.1993-08.org.debian:01:8d74927a5fe7
 
    Attribute                     Value                                       Writable      KEY
    -------------------------------------------------------------------------------------------
    DataDigest                    None                                        Yes           No
    FirstBurstLength              65536                                       Yes           No
    HeaderDigest                  None                                        Yes           No
    ImmediateData                 Yes                                         Yes           No
    InitialR2T                    No                                          Yes           No
    MaxBurstLength                1048576                                     Yes           No
    MaxOutstandingR2T             1                                           Yes           No
    MaxRecvDataSegmentLength      1048576                                     Yes           No
    MaxXmitDataSegmentLength      262144                                      Yes           No
    active_commands               0                                           Yes           No
    bidi_cmd_count                0                                           Yes           No
    bidi_io_count_kb              0                                           Yes           No
    bidi_unaligned_cmd_count      0                                           Yes           No
    commands                      0                                           Yes           No
    force_close                   <n/a>                                       Yes           No
    initiator_name                iqn.1993-08.org.debian:01:8d74927a5fe7      Yes           No
    none_cmd_count                1                                           Yes           No
    read_cmd_count                93712                                       Yes           No
    read_io_count_kb              12397667                                    Yes           No
    read_unaligned_cmd_count      2476                                        Yes           No
    reinstating                   0                                           Yes           No
    sid                           20000013d0200                               Yes           No
    thread_pid                    5003 5004 5005 5006 5007 5008 5009 5010     Yes           No
    unknown_cmd_count             0                                           Yes           No
    write_cmd_count               31189                                       Yes           No
    write_io_count_kb             10058311                                    Yes           No
    write_unaligned_cmd_count     1831                                        Yes           No
 
    Session: iqn.1993-08.org.debian:01:8d74927a5fe7_1
 
    Attribute                     Value                                       Writable      KEY
    -------------------------------------------------------------------------------------------
    DataDigest                    None                                        Yes           No
    FirstBurstLength              65536                                       Yes           No
    HeaderDigest                  None                                        Yes           No
    ImmediateData                 Yes                                         Yes           No
    InitialR2T                    No                                          Yes           No
    MaxBurstLength                1048576                                     Yes           No
    MaxOutstandingR2T             1                                           Yes           No
    MaxRecvDataSegmentLength      1048576                                     Yes           No
    MaxXmitDataSegmentLength      262144                                      Yes           No
    active_commands               0                                           Yes           No
    bidi_cmd_count                0                                           Yes           No
    bidi_io_count_kb              0                                           Yes           No
    bidi_unaligned_cmd_count      0                                           Yes           No
    commands                      0                                           Yes           No
    force_close                   <n/a>                                       Yes           No
    initiator_name                iqn.1993-08.org.debian:01:8d74927a5fe7      Yes           No
    none_cmd_count                1                                           Yes           No
    read_cmd_count                93665                                       Yes           No
    read_io_count_kb              12370128                                    Yes           No
    read_unaligned_cmd_count      2617                                        Yes           No
    reinstating                   0                                           Yes           No
    sid                           30000023d0200                               Yes           No
    thread_pid                    5003 5004 5005 5006 5007 5008 5009 5010     Yes           No
    unknown_cmd_count             0                                           Yes           No
    write_cmd_count               30986                                       Yes           No
    write_io_count_kb             10179922                                    Yes           No
    write_unaligned_cmd_count     1964                                        Yes           No

All done.

Multipathing

Then we create the main Multipath configuration file /etc/multipath.conf:

defaults {
    user_friendly_names    yes
    # Use 'mpathn' names for multipath devices
 
    path_grouping_policy    multibus
    # Place all paths in one priority group
 
    path_checker    readsector0
    # Method to determine the state of a path
 
    polling_interval    3
    # How often (in seconds) to poll state of paths
 
    path_selector    "round-robin 0"
    # Algorithm to determine what path to use for next I/O operation
 
    failback    immediate
    # Failback to highest priority path group with active paths
 
    features    "0"
    no_path_retry    1
}
blacklist {
    wwid    0QEMU_QEMU_HARDDISK_drive-scsi0
    wwid    0QEMU_QEMU_HARDDISK_drive-scsi1
    devnode "^(ram|raw|loop|fd|md|dm-|sr|scd|st)[0-9]*"
    devnode "^(hd|xvd|vd)[a-z]*"
    devnode "ofsctl"
    devnode "^asm/*"
}
multipaths {
  multipath {
    wwid    23238363932313833
    # alias here can be anything descriptive for your LUN
    alias    mylun
  }
}

We find the WWID (World Wide Identifier) of the disks as follows:

root@drbd01:~# /lib/udev/scsi_id --whitelisted --device=/dev/sda
0QEMU    QEMU HARDDISK   drive-scsi0
 
root@drbd01:~# /lib/udev/scsi_id --whitelisted --device=/dev/sdb
0QEMU    QEMU HARDDISK   drive-scsi1
 
root@drbd01:~# /lib/udev/scsi_id --whitelisted --device=/dev/sdc
23238363932313833
 
root@drbd01:~# /lib/udev/scsi_id --whitelisted --device=/dev/sdd
23238363932313833

The sda and sdb are our system disks the VM is running of thus we only want the iSCSI devices considered by multiptah, which explains the above config. Now after restarting the multipathd daemon:

root@drbd01:~# service multipath-tools restart

we can see:

root@drbd01:~# multipath -v2
Feb 26 14:14:12 | sdc: rport id not found
Feb 26 14:14:12 | sdd: rport id not found
create: mylun (23238363932313833) undef SCST_FIO,VDISK-LUN01    
size=20G features='0' hwhandler='0' wp=undef
`-+- policy='round-robin 0' prio=1 status=undef
  |- 3:0:0:0 sdc 8:32 undef ready running
  `- 4:0:0:0 sdd 8:48 undef ready running
 
root@drbd01:~# multipath -ll
mylun (23238363932313833) dm-0 SCST_FIO,VDISK-LUN01    
size=20G features='1 queue_if_no_path' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=1 status=active
  |- 3:0:0:0 sdc 8:32 active ready running
  `- 4:0:0:0 sdd 8:48 active ready running

The Multipath tool also created it’s own mapper device:

root@drbd01:~# ls -l /dev/mapper/mylun
lrwxrwxrwx 1 root root 7 Feb 26 14:14 /dev/mapper/mylun -> ../dm-0

Now that we have the multipath setup we want to reduce the failover timeout which is 120sec by default:

root@drbd01:~# iscsiadm -m node -T iqn.2016-02.local.virtual:virtual.vg1 | grep node.session.timeo.replacement_timeout
node.session.timeo.replacement_timeout = 120
node.session.timeo.replacement_timeout = 120

so we have faster failover upon failuer detetction:

root@drbd01:~# iscsiadm -m node -T iqn.2016-02.local.virtual:virtual.vg1 -o update -n node.session.timeo.replacement_timeout -v 10
root@drbd01:~# iscsiadm -m node -T iqn.2016-02.local.virtual:virtual.vg1 | grep node.session.timeo.replacement_timeout
node.session.timeo.replacement_timeout = 10
node.session.timeo.replacement_timeout = 10

and also set the initiator to auto connect to the target on reboot:

root@drbd01:~# iscsiadm -m node -T iqn.2016-02.local.virtual:virtual.vg1 -o update -n node.startup -v automatic

Testing path failover

To test we create a file system first on top of the multipath device and mount it:

root@drbd01:~# mkfs.ext4 /dev/mapper/mylun
root@drbd01:~# mkdir /share
root@drbd01:~# mount /dev/mapper/mylun /share -o _netdev
root@drbd01:~# cat /proc/mounts | grep mylun
/dev/mapper/mylun /share ext4 rw,relatime,stripe=128,data=ordered 0 0

Then we create the following test script multipath_test.sh:

#!/bin/bash
interval=1
while true; do
    ts=`date "+%Y.%m.%d-%H:%M:%S"`
    echo $ts > /share/file-${ts}
    echo "/share/file-${ts}...waiting $interval second(s)"
    sleep $interval
done 

that will keep creating files in the mount point in a loop. Now we start the script in one terminal and monitor the multipath state in another:

root@drbd01:~# multipath -ll
mylun (23238363932313833) dm-0 SCST_FIO,VDISK-LUN01    
size=20G features='1 queue_if_no_path' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=1 status=active
  |- 3:0:0:0 sdc 8:32 active ready running
  `- 4:0:0:0 sdd 8:48 active ready running

We bring down one of the multipath interfaces:

root@drbd01:~# ifdown eth2

and check the status again:

root@drbd01:~# multipath -ll
mylun (23238363932313833) dm-0 SCST_FIO,VDISK-LUN01    
size=20G features='1 queue_if_no_path' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=1 status=active
  |- 3:0:0:0 sdc 8:32 active ready  running
  `- 4:0:0:0 sdd 8:48 failed faulty running

We can see it is in failed state but that did not affect the script at all:

root@drbd01:~# bash multipath_test.sh
/share/file-2016.02.26-14:41:23...waiting 1 second(s)
/share/file-2016.02.26-14:41:24...waiting 1 second(s)
/share/file-2016.02.26-14:41:25...waiting 1 second(s)
/share/file-2016.02.26-14:41:26...waiting 1 second(s)
...
/share/file-2016.02.26-14:43:29...waiting 1 second(s)
/share/file-2016.02.26-14:43:30...waiting 1 second(s)
/share/file-2016.02.26-14:43:31...waiting 1 second(s)
/share/file-2016.02.26-14:43:32...waiting 1 second(s)
/share/file-2016.02.26-14:43:33...waiting 1 second(s)
^C
root@drbd01:~#

It kept running and created 130 files:

root@drbd01:~# ls -ltrh /share/file-2016.02.26-14* | wc -l
130

in the period of 130 seconds it was running. Now we bring back eth2:

root@drbd01:~# ifup eth2
 
root@drbd01:~# multipath -ll
mylun (23238363932313833) dm-0 SCST_FIO,VDISK-LUN01    
size=20G features='1 queue_if_no_path' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=0 status=active
  |- 3:0:0:0 sdc 8:32 active ready running
  `- 4:0:0:0 sdd 8:48 active undef running

At the end, we can set the iscsi client and multipath to autostart in /etc/iscsi/iscsid.conf:

...
#node.startup = manual
node.startup = automatic
...

and update the run levels:

root@drbd01:~# update-rc.d -f open-iscsi remove
root@drbd01:~# update-rc.d open-iscsi start 20 2 3 4 5 . stop 20 0 1 6 .
root@drbd01:~# update-rc.d open-iscsi enable

For the target login:

root@drbd01:~# iscsiadm -m node -T iqn.2016-02.local.virtual:virtual.vg1 -o update -n node.startup -v automatic
root@drbd01:~# iscsiadm -m node -T iqn.2016-02.local.virtual:virtual.vg1 | grep node.startup
node.startup = automatic
node.startup = automatic

and then if we want auto mounting the file system we add it to /etc/fstab file:

...
/dev/mapper/mylun    /share    ext4    _netdev,noatime    0 0

Of course, in our case we are not going to do that since we’ll have Pacemaker take care of this.

Corosync and Pacemaker

Install the cluster stack packages on both servers (drbd01 and drbd02):

# aptitude install -y heartbeat pacemaker corosync fence-agents openais cluster-glue resource-agents openipmi ipmitool

First we generate Corosync authentication key. To insure we have enough entropy (since this is inside VM) we install haveged first, run corosync-keygen and then we copy the key over to the second server:

root@drbd02:~# aptitude install haveged
root@drbd02:~# service haveged start
root@drbd02:~# corosync-keygen -l
root@drbd02:~# scp /etc/corosync/authkey drbd02:/etc/corosync/authkey

Then we configure Corosync with 2 rings as per usual /etc/corosync/corosync.conf:

totem {
    version: 2
 
    # How long before declaring a token lost (ms)
    token: 3000
 
    # How many token retransmits before forming a new configuration
    token_retransmits_before_loss_const: 10
 
    # How long to wait for join messages in the membership protocol (ms)
    join: 60
 
    # How long to wait for consensus to be achieved before starting a new round of membership configuration (ms)
    consensus: 3600
 
    # Turn off the virtual synchrony filter
    vsftype: none
 
    # Number of messages that may be sent by one processor on receipt of the token
    max_messages: 20
 
    # Stagger sending the node join messages by 1..send_join ms
    send_join: 45
 
    # Limit generated nodeids to 31-bits (positive signed integers)
    clear_node_high_bit: yes
 
    # Disable encryption
    secauth: off
 
    # How many threads to use for encryption/decryption
    threads: 0
 
    # Optionally assign a fixed node id (integer)
    # nodeid: 1234
 
    # CLuster name, needed for DLM or DLM wouldn't start
    cluster_name: iscsi
 
    # This specifies the mode of redundant ring, which may be none, active, or passive.
    rrp_mode: active
 
    interface {
          ringnumber: 0
          bindnetaddr: 10.10.1.19
          mcastaddr: 226.94.1.1
          mcastport: 5404
    }
    interface {
            ringnumber: 1
            bindnetaddr: 192.168.0.177
            mcastaddr: 226.94.41.1
            mcastport: 5405
    }
    transport: udpu
}
nodelist {
    node {
        ring0_addr: 10.10.1.17
        ring1_addr: 192.168.0.176
        nodeid: 1
    }
    node {
        ring0_addr: 10.10.1.19
        ring1_addr: 192.168.0.177
        nodeid: 2
    }
}
quorum {
    provider: corosync_votequorum
    two_node: 1
}
amf {
    mode: disabled
}
service {
     # Load the Pacemaker Cluster Resource Manager
     # if 0: start pacemaker
     # if 1: don't start pacemaker
     ver:       1
     name:      pacemaker
}
aisexec {
        user:   root
        group:  root
}
logging {
        fileline: off
        to_stderr: yes
        to_logfile: no
        to_syslog: yes
    syslog_facility: daemon
        debug: off
        timestamp: on
        logger_subsys {
                subsys: subsys: QUORUM
                debug: off
                tags: enter|leave|trace1|trace2|trace3|trace4|trace6
        }
} 

Enable the service /etc/default/corosync:

# start corosync at boot [yes|no]
START=yes

and start it up:

root@drbd02:~# service corosync start

Make sure it starts on reboot:

root@drbd02:~# update-rc.d corosync defaults
 System start/stop links for /etc/init.d/corosync already exist.

root@drbd02:~# update-rc.d corosync enable
update-rc.d: warning:  start runlevel arguments (none) do not match corosync Default-Start values (2 3 4 5)
update-rc.d: warning:  stop runlevel arguments (none) do not match corosync Default-Stop values (0 1 6)
 Enabling system startup links for /etc/init.d/corosync ...
 Removing any system startup links for /etc/init.d/corosync ...
   /etc/rc0.d/K01corosync
   /etc/rc1.d/K01corosync
   /etc/rc2.d/S19corosync
   /etc/rc3.d/S19corosync
   /etc/rc4.d/S19corosync
   /etc/rc5.d/S19corosync
   /etc/rc6.d/K01corosync
 Adding system startup for /etc/init.d/corosync ...
   /etc/rc0.d/K01corosync -> ../init.d/corosync
   /etc/rc1.d/K01corosync -> ../init.d/corosync
   /etc/rc6.d/K01corosync -> ../init.d/corosync
   /etc/rc2.d/S19corosync -> ../init.d/corosync
   /etc/rc3.d/S19corosync -> ../init.d/corosync
   /etc/rc4.d/S19corosync -> ../init.d/corosync
   /etc/rc5.d/S19corosync -> ../init.d/corosync

Then we start pacemaker and check the status:

root@drbd02:~# service pacemaker start
 
root@drbd02:~# crm status
Last updated: Mon Feb 29 15:08:16 2016
Last change: Mon Feb 29 13:50:28 2016 via cibadmin on drbd01
Stack: corosync
Current DC: drbd01 (1) - partition with quorum
Version: 1.1.10-42f2063
2 Nodes configured
6 Resources configured
 
Online: [ drbd01 drbd02 ]

We make sure Pacemaker starts after open-iscsi and multiptah which are S20:

root@drbd02:~# update-rc.d -f pacemaker remove
root@drbd02:~# update-rc.d pacemaker start 50 1 2 3 4 5 . stop 01 0 6 .
root@drbd02:~# update-rc.d pacemaker enable

OCFS2

Now, this part was really painful to setup due to completely broken OCFS2 cluster stack in Ubuntu-14.04. Install the needed packages on both nodes:

# aptitude install -y ocfs2-tools ocfs2-tools-pacemaker dlm

and we disable all these services from start-up since they are going to be under cluster control:

root@drbd02:~# update-rc.d dlm disable
root@drbd02:~# update-rc.d ocfs2 disable
root@drbd02:~# update-rc.d o2cb disable

For the DLM daemon dlm_controld to start we must have set the cluster_name parameter in totem for Corosync as shown above.

Unfortunately the DLM sysinit script has a bug too so we have to create a new one. Backup the existing one first:

root@drbd02:~# mv /etc/init.d/dlm /etc/init.d/dlm.default

and replace is with the dlm init script I have made available for download.

Then we can finally create the DLM resource in Pacemaker and let it manage it:

# crm configure
primitive p_controld ocf:pacemaker:controld \
    op monitor interval="60" timeout="60" \
    op start interval="0" timeout="90" \
    op stop interval="0" timeout="100" \
    params daemon="dlm_controld" \
    meta target-role="Started"
commit

Next we create the O2CB cluster config in /etc/ocfs2/cluster.conf:

cluster:
    node_count = 2
    name = iscsi
 
node:
    ip_port = 7777
    ip_address = 10.10.1.17
    number = 0
    name = drbd01
    cluster = iscsi
  
node:
    ip_port = 7777
    ip_address = 10.10.1.19
    number = 1
    name = drbd02
    cluster = iscsi 

We can always change these parameters later, add new node etc. Enable the service similar like we did for DLM by editing the settings as shown below in /etc/default/o2cb file:

...
O2CB_ENABLED=true
O2CB_BOOTCLUSTER=iscsi
... 

Now we need to add the Pacemaker resources for the cluster management.The cluster stack for OCFS2 in Ubuntu-14.04 is broken. The usual resource definition in Pacemaker does not work:

primitive p_o2cb ocf:pacemaker:o2cb \
    op monitor interval="60" timeout="60" \
    op start interval="0" timeout="90" \
    op stop interval="0" timeout="100" \
    params stack="pcmk" daemon_timeout="10"

due to a bug in the O2CB OCF agent. Because of that we have to use the LSB startup script in Pacemaker like this:

primitive p_o2cb lsb:o2cb \
    op monitor interval="60" timeout="60" \
    op start interval="0" timeout="90" \
    op stop interval="0" timeout="100" \
    meta target-role="Started"

The second bug is in the OCF file system agent where it only checks if the cluster type is cman thus in case of Ubuntu whit Corosync the agent fails. To fix the second bug, we edit /usr/lib/ocf/resource.d/heartbeat/Filesystem and find:

...
        if [ "X$HA_cluster_type" = "Xcman" ]; then
...

line and replace cman with corosync:

...
        if [ "X$HA_cluster_type" = "Xcorosync" ]; then
...

See the following bug reports for more details:

So, at the end we create the following resources:

primitive p_o2cb lsb:o2cb \
    op monitor interval="60" timeout="60" \
    op start interval="0" timeout="90" \
    op stop interval="0" timeout="100" \
    meta target-role="Started"
primitive p_iscsi_fs ocf:heartbeat:Filesystem \
    params device="/dev/mapper/mylun" directory="/share" fstype="ocfs2" options="_netdev,noatime,rw,acl,user_xattr" \
    op monitor interval="20" timeout="40" \
    op start interval="0" timeout="60" \
    op stop interval="0" timeout="60" \
    meta target-role="Stopped"
commit

Notice that we create the OCFS2 file system resource as Stopped since we still haven’t created the file system. We have to do it this way since the primitive needs to be under Pacemaker control when the file system gets created but should not be running. Actually, since we are not using the Pacemaker OCF agent here this does not matter but just in case.

We now create the file system on the iSCSI multipath device we created before:

root@drbd01:~# mkfs.ocfs2 -b 4K -C 32K -N 4 -L ISCSI /dev/mapper/mylun
mkfs.ocfs2 1.6.4
Cluster stack: classic o2cb
Label: ISCSI
Features: sparse backup-super unwritten inline-data strict-journal-super xattr
Block size: 4096 (12 bits)
Cluster size: 32768 (15 bits)
Volume size: 21470642176 (655232 clusters) (5241856 blocks)
Cluster groups: 21 (tail covers 10112 clusters, rest cover 32256 clusters)
Extent allocator size: 8388608 (2 groups)
Journal size: 134217728
Node slots: 4
Creating bitmaps: done
Initializing superblock: done
Writing system files: done
Writing superblock: done
Writing backup superblock: 3 block(s)
Formatting Journals: done
Growing extent allocator: done
Formatting slot map: done
Formatting quota files: done
Writing lost+found: done
mkfs.ocfs2 successful

Finally we create colocation, clone and order resources so the services run on both nodes thus the final configuration looks like:

root@drbd02:~# crm configure show
node $id="1" drbd01
node $id="2" drbd02
primitive p_controld ocf:pacemaker:controld \
    op monitor interval="60" timeout="60" \
    op start interval="0" timeout="90" \
    op stop interval="0" timeout="100" \
    params daemon="dlm_controld"
primitive p_fs_ocfs2 ocf:heartbeat:Filesystem \
    params device="/dev/mapper/mylun" directory="/share" fstype="ocfs2" options="_netdev,noatime,rw,acl,user_xattr" \
    op monitor interval="20" timeout="40" \
    op start interval="0" timeout="60" \
    op stop interval="0" timeout="60" \
    meta is-managed="true"
primitive p_o2cb lsb:o2cb \
    op monitor interval="60" timeout="60" \
    op start interval="0" timeout="90" \
    op stop interval="0" timeout="100"
clone cl_dlm p_controld \
    meta globally-unique="false" interleave="true"
clone cl_fs_ocfs2 p_fs_ocfs2 \
    meta globally-unique="false" interleave="true" ordered="true"
clone cl_o2cb p_o2cb \
    meta globally-unique="false" interleave="true"
colocation cl_fs_o2cb inf: cl_fs_ocfs2 cl_o2cb
colocation cl_o2cb_dlm inf: cl_o2cb cl_dlm
order o_dlm_o2cb inf: cl_dlm:start cl_o2cb:start
order o_o2cb_ocfs2 inf: cl_o2cb cl_fs_ocfs2
property $id="cib-bootstrap-options" \
    dc-version="1.1.10-42f2063" \
    cluster-infrastructure="corosync" \
    stonith-enabled="false" \
    no-quorum-policy="ignore" \
    last-lrm-refresh="1456740232"

and crm status shows all is up and running on both nodes:

root@drbd01:~# crm status
Last updated: Mon Feb 29 21:44:50 2016
Last change: Mon Feb 29 21:41:53 2016 via crmd on drbd01
Stack: corosync
Current DC: drbd01 (1) - partition with quorum
Version: 1.1.10-42f2063
2 Nodes configured
6 Resources configured
 
Online: [ drbd01 drbd02 ]
 
 Clone Set: cl_dlm [p_controld]
     Started: [ drbd01 drbd02 ]
 Clone Set: cl_o2cb [p_o2cb]
     Started: [ drbd01 drbd02 ]
 Clone Set: cl_fs_ocfs2 [p_fs_ocfs2]
     Started: [ drbd01 drbd02 ]

and see the mount point on both nodes:

root@drbd02:~# cat /proc/mounts | grep share
/dev/mapper/mylun /share ocfs2 rw,noatime,_netdev,heartbeat=local,nointr,data=ordered,errors=remount-ro,atime_quantum=60,coherency=full,user_xattr,acl 0 0

We can further check the OCFS2 services state as well:

root@drbd01:~# service o2cb status
Driver for "configfs": Loaded
Filesystem "configfs": Mounted
Stack glue driver: Loaded
Stack plugin "o2cb": Loaded
Driver for "ocfs2_dlmfs": Loaded
Filesystem "ocfs2_dlmfs": Mounted
Checking O2CB cluster iscsi: Online
Heartbeat dead threshold = 31
  Network idle timeout: 30000
  Network keepalive delay: 2000
  Network reconnect delay: 2000
Checking O2CB heartbeat: Active
 
root@drbd01:~# service ocfs2 status
Active OCFS2 mountpoints:  /share
 
root@drbd01:~# mounted.ocfs2 -f
Device                FS     Nodes
/dev/sdc              ocfs2  drbd01, drbd02
/dev/mapper/mylun     ocfs2  drbd01, drbd02
/dev/sdd              ocfs2  drbd01, drbd02

In case of errors in crm we can cleanup the resources:

# crm_resource -r all --cleanup

or restart pacemaker service if needed.

We can now test the clustered file system by creating a test file on one of the nodes and checking if we can see it on the other node:

root@drbd02:~# echo rteergreg > /share/test
 
root@drbd01:~# ls -l /share/
total 0
drwxr-xr-x 2 root root 3896 Feb 29 12:37 lost+found
-rw-r--r-- 1 root root   10 Feb 29 13:06 test

So we created file on one node and can see the same on the second one.

Fail-over Testing

I did some testing to check what happens on the clients when the backend storage fails over. I modified the test script fail_test.sh slightly to write the current time stamp in a file on the share:

#!/bin/bash
interval=1
while true; do
    ts=`date "+%Y.%m.%d-%H:%M:%S"`
    echo $ts >> /share/file-${HOSTNAME}.log
    echo "/share/file-${ts}...waiting $interval second(s)"
    sleep $interval
done

and ran it on both clients drbd01 and drbd02. Then while the script was writing to a file on the share I rebooted the storage master:

[root@centos01 ~]# crm status
Last updated: Wed Mar  2 11:46:03 2016
Last change: Wed Mar  2 11:42:12 2016
Stack: classic openais (with plugin)
Current DC: centos02 - partition with quorum
Version: 1.1.11-97629de
2 Nodes configured, 2 expected votes
12 Resources configured
 
Online: [ centos01 centos02 ]
 
Full list of resources:
 
 Master/Slave Set: ms_drbd_vg1 [p_drbd_vg1]
     Masters: [ centos02 ]
     Slaves: [ centos01 ]
 Resource Group: g_vg1
     p_lvm_vg1    (ocf::heartbeat:LVM):    Started centos02
     p_target_vg1    (ocf::scst:SCSTTarget):    Started centos02
     p_lu_vg1_lun1    (ocf::scst:SCSTLun):    Started centos02
     p_ip_vg1    (ocf::heartbeat:IPaddr2):    Started centos02
     p_ip_vg1_2    (ocf::heartbeat:IPaddr2):    Started centos02
     p_portblock_vg1    (ocf::heartbeat:portblock):    Started centos02
     p_portblock_vg1_unblock    (ocf::heartbeat:portblock):    Started centos02
     p_portblock_vg1_2    (ocf::heartbeat:portblock):    Started centos02
     p_portblock_vg1_2_unblock    (ocf::heartbeat:portblock):    Started centos02
     p_email_admin    (ocf::heartbeat:MailTo):    Started centos02

reboot centos02 server:

[root@centos02 ~]# reboot

and could see the cluster detecting the failure and moving the resources to the still running centos01 node:

[root@centos01 ~]# crm status
Last updated: Wed Mar  2 11:46:15 2016
Last change: Wed Mar  2 11:42:12 2016
Stack: classic openais (with plugin)
Current DC: centos02 - partition with quorum
Version: 1.1.11-97629de
2 Nodes configured, 2 expected votes
12 Resources configured
 
Online: [ centos01 centos02 ]
 
Full list of resources:
 
 Master/Slave Set: ms_drbd_vg1 [p_drbd_vg1]
     Masters: [ centos02 ]
     Slaves: [ centos01 ]
 Resource Group: g_vg1
     p_lvm_vg1    (ocf::heartbeat:LVM):    Started centos02
     p_target_vg1    (ocf::scst:SCSTTarget):    Started centos02
     p_lu_vg1_lun1    (ocf::scst:SCSTLun):    Stopped
     p_ip_vg1    (ocf::heartbeat:IPaddr2):    Stopped
     p_ip_vg1_2    (ocf::heartbeat:IPaddr2):    Stopped
     p_portblock_vg1    (ocf::heartbeat:portblock):    Stopped
     p_portblock_vg1_unblock    (ocf::heartbeat:portblock):    Stopped
     p_portblock_vg1_2    (ocf::heartbeat:portblock):    Stopped
     p_portblock_vg1_2_unblock    (ocf::heartbeat:portblock):    Stopped
     p_email_admin    (ocf::heartbeat:MailTo):    Stopped 

and after short time:

[root@centos01 ~]# crm status
Last updated: Wed Mar  2 11:46:40 2016
Last change: Wed Mar  2 11:46:18 2016
Stack: classic openais (with plugin)
Current DC: centos01 - partition WITHOUT quorum
Version: 1.1.11-97629de
2 Nodes configured, 2 expected votes
12 Resources configured
 
Online: [ centos01 ]
OFFLINE: [ centos02 ]
 
Full list of resources:
 
 Master/Slave Set: ms_drbd_vg1 [p_drbd_vg1]
     Masters: [ centos01 ]
     Stopped: [ centos02 ]
 Resource Group: g_vg1
     p_lvm_vg1    (ocf::heartbeat:LVM):    Started centos01
     p_target_vg1    (ocf::scst:SCSTTarget):    Started centos01
     p_lu_vg1_lun1    (ocf::scst:SCSTLun):    Started centos01
     p_ip_vg1    (ocf::heartbeat:IPaddr2):    Started centos01
     p_ip_vg1_2    (ocf::heartbeat:IPaddr2):    Started centos01
     p_portblock_vg1    (ocf::heartbeat:portblock):    Started centos01
     p_portblock_vg1_unblock    (ocf::heartbeat:portblock):    Started centos01
     p_portblock_vg1_2    (ocf::heartbeat:portblock):    Started centos01
     p_portblock_vg1_2_unblock    (ocf::heartbeat:portblock):    Started centos01
     p_email_admin    (ocf::heartbeat:MailTo):    Started centos01

so the whole transition took around 6 seconds. I checked the clients files then and could see the same, the I/O was suspended for 6 seconds and after that the script went on writing to the file it had opened:

root@drbd01:~# more /share/file-drbd01.log
2016.03.02-11:45:44
2016.03.02-11:45:45
2016.03.02-11:45:46
.
.
.
2016.03.02-11:46:01
2016.03.02-11:46:02
2016.03.02-11:46:03
2016.03.02-11:46:09
2016.03.02-11:46:10
2016.03.02-11:46:11
2016.03.02-11:46:12

Load Testing

Just some simple dd tests with clear system and disk cache to get idea about the storage speed and limitations.

root@drbd01:~# echo 3 > /proc/sys/vm/drop_caches
root@drbd01:~# dd if=/dev/zero of=/share/test.img bs=1024K count=1500 oflag=direct conv=fsync && sync;sync
1500+0 records in
1500+0 records out
1572864000 bytes (1.6 GB) copied, 80.8872 s, 19.4 MB/s
 
root@drbd01:~# echo 3 > /proc/sys/vm/drop_caches
root@drbd01:~# time cat /share/test.img > /dev/null
real    0m36.464s
user    0m0.023s
sys    0m1.762s
 
root@drbd01:~# echo 3 > /proc/sys/vm/drop_caches
root@drbd01:~# dd if=/share/test.img of=/dev/null iflag=nocache oflag=nocache,sync
3072000+0 records in
3072000+0 records out
1572864000 bytes (1.6 GB) copied, 33.1996 s, 47.4 MB/s

Then, trying to simulate real live workload, I started 10 simultaneous processes on both servers:

root@drbd01:~# for i in $(seq 1 10); do { dd if=/share/test.img of=/dev/null iflag=nocache oflag=nocache,sync &}; done
root@drbd02:~# for i in $(seq 1 10); do { dd if=/share/test2.img of=/dev/null iflag=nocache oflag=nocache,sync &}; done

reading one of the big files. When finished the processes reported throughput of 7.4MB/s in this case.

[1]   Done                    dd if=/share/test.img of=/dev/null iflag=nocache oflag=nocache,sync
[2]   Done                    dd if=/share/test.img of=/dev/null iflag=nocache oflag=nocache,sync
[3]   Done                    dd if=/share/test.img of=/dev/null iflag=nocache oflag=nocache,sync
[4]   Done                    dd if=/share/test.img of=/dev/null iflag=nocache oflag=nocache,sync
[5]   Done                    dd if=/share/test.img of=/dev/null iflag=nocache oflag=nocache,sync
[6]   Done                    dd if=/share/test.img of=/dev/null iflag=nocache oflag=nocache,sync
[7]   Done                    dd if=/share/test.img of=/dev/null iflag=nocache oflag=nocache,sync
[8]   Done                    dd if=/share/test.img of=/dev/null iflag=nocache oflag=nocache,sync
[9]-  Done                    dd if=/share/test.img of=/dev/null iflag=nocache oflag=nocache,sync
[10]+  Done                    dd if=/share/test.img of=/dev/null iflag=nocache oflag=nocache,sync
 
3072000+0 records out
1572864000 bytes (1.6 GB) copied, 212.11 s, 7.4 MB/s
3072000+0 records in
3072000+0 records out

Have in mind though that the whole stack is running on 4 nested VM’s when taking the results in account.

This article is Part 2 in a 2-Part Series Highly Available iSCSI Storage with SCST, Pacemaker, DRBD and OCFS2.

Leave a Comment