ZFS storage with OmniOS and iSCSI

21 minute read , Aug 29, 2016

The following setup of iSCSI shared storage on cluster of OmniOS servers was later used as ZFS over iSCSI storage in Proxmox PVE, see Adding ZFS over iSCSI shared storage to Proxmox. It was inspired by the excellent work from Saso Kiselkov and his stmf-ha project, please see the References section at the bottom of this page for details.

OmniOS is an open source continuation of OpenSolaris (discontinued by Oracle when they acquired Sun Microsystems back in 2010) that builds on IllumOS project, the OpenSolaris reincarnation. ZFS and iSCSI, or COMSTAR (Common Multiprotocol SCSI Target), have been part of Solaris for very long time bringing performance and stability to the storage solution.

For the setup I’m using two VM’s, omnios01 and omnios02, connected via two networks, public 192.168.0.0/24 and private 10.10.1.0/24 one configured on the hypervisor.

OmniOS installation and initial setup

Download the current stable OmniOS iso, and launch a VM in Proxmox. Start it up and install accepting the defaults.

Change GRUB default timeout on boot from 30 to 5 seconds:

root@omnios01:/root# vi /rpool/boot/grub/menu.lst
...
timeout 5
...

Try telling OmniOS we have 2 virtual cpu’s:

root@omnios01:/root# eeprom boot-ncpus=2
root@omnios01:/root# psrinfo -vp
The physical processor has 1 virtual processor (0)
  x86 (GenuineIntel F61 family 15 model 6 step 1 clock 1900 MHz)
        Common KVM processor

when we have 1 CPU (socket) with 2 cores.

Then configure networking:

root@omnios01:/root# ipadm create-if e1000g0
root@omnios01:/root# ipadm create-addr -T static -a local=192.168.0.141/24 e1000g0/v4
root@omnios01:/root# route -p add default 192.168.0.1
root@omnios01:/root# echo 'nameserver 192.168.0.1' >> /etc/resolv.conf
root@omnios01:/root# cp /etc/nsswitch.dns /etc/nsswitch.conf
root@omnios01:/root# ipadm show-addr
ADDROBJ           TYPE     STATE        ADDR
lo0/v4            static   ok           127.0.0.1/8
e1000g0/v4        static   ok           192.168.0.141/24
lo0/v6            static   ok           ::1/128

Secondary interface:

root@omnios01:/root# ipadm create-if e1000g1
root@omnios01:/root# ipadm create-addr -T dhcp e1000g1/dhcp
root@omnios01:/root# ipadm show-addr
ADDROBJ           TYPE     STATE        ADDR
lo0/v4            static   ok           127.0.0.1/8
e1000g0/v4        static   ok           192.168.0.141/24
e1000g1/dhcp      dhcp     ok           10.10.1.13/24
lo0/v6            static   ok           ::1/128

If we want to enable jumbo frames and we have a switch that supports it:

root@omnios01:/root# dladm set-linkprop -p mtu=9000 e1000go

Configure the hosts file:

  • on omnios01

    127.0.0.1       omnios01
    10.10.1.12      omnios02
    
  • on omnios02

    127.0.0.1       omnios02
    10.10.1.13      omnios01
    

Configure SSH to allow both ssh key and password login for root user:

root@omnios01:/root# cat /etc/ssh/sshd_config | grep -v ^# | grep .
HostKey /etc/ssh/ssh_host_rsa_key
HostKey /etc/ssh/ssh_host_ecdsa_key
HostKey /etc/ssh/ssh_host_ed25519_key
PermitRootLogin yes
StrictModes yes
RSAAuthentication yes
PubkeyAuthentication yes
AuthorizedKeysFile      .ssh/authorized_keys
HostbasedAuthentication no
IgnoreRhosts yes
PasswordAuthentication yes 
PermitEmptyPasswords no
ChallengeResponseAuthentication no
GSSAPIAuthentication no 
UsePAM yes 
PrintMotd no
TCPKeepAlive yes
UseDNS no
Subsystem       sftp    /usr/libexec/sftp-server
AllowUsers root

and restart ssh service:

root@omnios01:/root# svcadm restart svc:/network/ssh:default

Next check if the STMF service is running:

root@omnios01:/root# svcs -l stmf
fmri         svc:/system/stmf:default
name         STMF
enabled      true
state        online
next_state   none
state_time   25 August 2016 05:06:30 AM UTC
logfile      /var/svc/log/system-stmf:default.log
restarter    svc:/system/svc/restarter:default
dependency   require_all/none svc:/system/filesystem/local:default (online)

and if not enable it:

root@omnios01:/root# svcadm enable stmf

Then enable COMSTAR iSCSI target service from the GUI or console:

root@omnios01:/root# svcadm enable -r svc:/network/iscsi/target:default
root@omnios01:/root# svcs -l iscsi/target
fmri         svc:/network/iscsi/target:default
name         iscsi target
enabled      true
state        online
next_state   none
state_time   25 August 2016 05:06:31 AM UTC
logfile      /var/svc/log/network-iscsi-target:default.log
restarter    svc:/system/svc/restarter:default
dependency   require_any/error svc:/milestone/network (online)
dependency   require_all/none svc:/system/stmf:default (online)

If the services are missing we need to install the storage-server package:

# pkg install group/feature/storage-server
# svcadm enable stmf

The following 3 SATA (have to be on SATA bus for VM’s, not sure why) disks, apart from the root one, have been attached to each VM:

root@omnios01:/root# format
Searching for disks...done

AVAILABLE DISK SELECTIONS:
       0. c2t0d0 <QEMU-HARDDISK-1.4.2 cyl 2085 alt 2 hd 255 sec 63>
          /pci@0,0/pci1af4,1100@7/disk@0,0
       1. c2t1d0 <QEMU-HARDDISK-1.4.2-10.00GB>
          /pci@0,0/pci1af4,1100@7/disk@1,0
       2. c2t2d0 <QEMU-HARDDISK-1.4.2-10.00GB>
          /pci@0,0/pci1af4,1100@7/disk@2,0
       3. c2t3d0 <QEMU-HARDDISK-1.4.2-10.00GB>
          /pci@0,0/pci1af4,1100@7/disk@3,0
Specify disk (enter its number): ^C
root@omnios01:/root#

They will be used to create a new zfs pool named pool1 from these 3x10GB disks using RAIDZ1 mirror that I will then use in my ZFS over iSCSI setup in the PVE cluster.

iSCSI HA

HA packages and stmf-ha setup

Install pre-built HA packages (HeartBeat, Cluster Glue, Pacemaker, OCF Agents) from the bundle created by Saso Kiselkov at (http://zfs-create.blogspot.com.au):

root@omnios01:/root# wget http://37.153.99.61/HA.tar.bz2
root@omnios01:/root# tar -xjvf HA.tar.bz2
root@omnios01:/root# cd HA/prebuilt_packages
root@omnios01:/root# gunzip *.gz
root@omnios01:/root# for PKG in *.pkg ; do pkgadd -d $PKG ; done
root@omnios01:/root# vi ~/.profile 
[...]
export PYTHONPATH=/opt/ha/lib/python2.6/site-packages
export PATH=/opt/ha/bin:/opt/ha/sbin:$PATH
export OCF_ROOT=/opt/ha/lib/ocf
export OCF_AGENTS=/opt/ha/lib/ocf/resource.d/heartbeat

root@omnios01:/root# pkg install ipmitool
root@omnios01:/root# pkg install git
root@omnios01:/root# git clone https://github.com/skiselkov/stmf-ha.git
Cloning into 'stmf-ha'...
remote: Counting objects: 72, done.
remote: Total 72 (delta 0), reused 0 (delta 0), pack-reused 72
Unpacking objects: 100% (72/72), done.
Checking connectivity... done.

root@omnios01:/root# cp stmf-ha/heartbeat/ZFS /opt/ha/lib/ocf/resource.d/heartbeat/
root@omnios01:/root# chmod +x /opt/ha/lib/ocf/resource.d/heartbeat/ZFS
root@omnios01:/root# perl -pi -e 's/#DEBUG=0/DEBUG=1/' /opt/ha/lib/ocf/resource.d/heartbeat/ZFS
root@omnios01:/root# mkdir -p /opt/ha/lib/ocf/lib/heartbeat/helpers
root@omnios01:/root# cp stmf-ha/heartbeat/zfs-helper /opt/ha/lib/ocf/lib/heartbeat/helpers/
root@omnios01:/root# chmod +x /opt/ha/lib/ocf/lib/heartbeat/helpers/zfs-helper
root@omnios01:/root# cp stmf-ha/stmf-ha /usr/sbin/
root@omnios01:/root# chmod +x /usr/sbin/stmf-ha
root@omnios01:/root# cp stmf-ha/manpages/stmf-ha.1m /usr/share/man/man1m/
root@omnios01:/root# man stmf-ha

Fix annoying ps command error for crm:

root@omnios01:/root# perl -pi -e 's#ps -e -o pid,command#ps -e -o pid,comm#' /opt/ha/lib/python2.6/site-packages/crm/utils.py

Fix the IPaddr OCF agent, get patched one from Vincenco’s site see Use pacemaker and corosync on Illumos (OmniOS) to run a HA active/passive cluster for details:

root@omnios01:/root# cp /opt/ha/lib/ocf/resource.d/heartbeat/IPaddr /opt/ha/lib/ocf/resource.d/heartbeat/IPaddr.default
root@omnios01:/root# wget -O /opt/ha/lib/ocf/resource.d/heartbeat/IPaddr https://gist.githubusercontent.com/vincepii/6763170efa5050d2d73d/raw/bfc0e7df7dda9c673b4e0888240581f7963ff1b6/IPaddr

Configure HeartBeat

Create the config file, we can edit the example on the project site.

Based on Saso’s config from the git repo with serial link between nodes for heart beat I ended up with the following /opt/ha/etc/ha.d/ha.cf config file:

# Master Heartbeat configuration file
# This file must be identical on all cluster nodes

# GLOBAL OPTIONS
use_logd        yes             # Logging done in separate process to
                                # prevent blocking on disk I/O
baud            38400           # Run the serial link at 38.4 kbaud
realtime        on              # Enable real-time scheduling and lock
                                # heartbeat into memory to prevent its
                                # pages from ever being swapped out

apiauth cl_status gid=haclient uid=hacluster

# NODE LIST SETUP
# Node names depend on the machine's host name. To protect against
# accidental joins from nodes that are part of other zfsstor clusters
# we do not allow autojoins (plus we use shared-secret authentication).
node            omnios01
node            omnios02
autojoin        none
auto_failback   off

# COMMUNICATION CHANNEL SETUP
#mcast   e1000g0    239.51.12.1 694 1 0     # management network
#mcast   e1000g1    239.51.12.1 694 1 0     # dedicated NIC between nodes
mcast   e1000g0    239.0.0.43 694 1 0
bcast   e1000g1    # dedicated NIC between nodes

# STONITH/FENCING IN CASE OR REAL NODES
# Use ipmi to check power status and reboot nodes
#stonith_host    omnios01 external/ipmi omnios02 192.168.0.141 <ipmi_admin_username> <ipmi_admin_password> lan
#stonith_host    omnios02 external/ipmi omnios01 192.168.0.142 <ipmi_admin_username> <ipmi_admin_password> lan

# NODE FAILURE DETECTION
keepalive       2       # Heartbeats every 2 second
warntime        5       # Start issuing warnings after 5 seconds
deadtime        15      # After 15 seconds, a node is considered dead
initdead        60      # Hold off declaring nodes dead for 60 seconds
                        # after Heartbeat startup.

# Enable the Pacemaker CRM
crm                     on
#compression             bz2
#traditional_compression yes

To find the list of available STONITH devices run:

root@omnios02:/root# stonith -L
apcmaster
apcmastersnmp
apcsmart
baytech
cyclades
drac3
external/drac5
external/dracmc-telnet
external/hetzner
external/hmchttp
external/ibmrsa
external/ibmrsa-telnet
external/ipmi
external/ippower9258
external/kdumpcheck
external/libvirt
external/nut
external/rackpdu
external/riloe
external/sbd
external/ssh
external/vcenter
external/vmware
external/xen0
external/xen0-ha
ibmhmc
meatware
null
nw_rpc100s
rcd_serial
rps10
ssh
suicide
wti_mpc
wti_nps
root@omnios02:/root#

and add it to configuration if you have one.

Create the authentication file:

root@omnios01:/root# (echo -ne "auth 1\n1 sha1 "; openssl rand -rand /dev/random -hex 16 2> /dev/null) > /opt/ha/etc/ha.d/authkeys

Grant sudo access to the hacluster user (on both nodes):

root@omnios01:/root/HA# visudo
[...]
hacluster    ALL=(ALL) NOPASSWD: ALL

Create the logd config file:

root@omnios01:/root/HA# cat /opt/ha/etc/logd.cf
#       File to write debug messages to
#       Default: /var/log/ha-debug
debugfile /var/log/ha-debug

#
#
#       File to write other messages to
#       Default: /var/log/ha-log
logfile        /var/log/ha-log

#
#
#       Octal file permission to create the log files with
#       Default: 0644
logmode        0640


#
#
#       Facility to use for syslog()/logger 
#   (set to 'none' to disable syslog logging)
#       Default: daemon
logfacility    daemon


#       Entity to be shown at beginning of a message
#       generated by the logging daemon itself
#       Default: "logd"
#entity logd


#       Entity to be shown at beginning of _every_ message
#       passed to syslog (not to log files).
#
#       Intended for easier filtering, or safe blacklisting.
#       You can filter on logfacility and this prefix.
#
#       Message format changes like this:
#       -Nov 18 11:30:31 soda logtest: [21366]: info: total message dropped: 0
#       +Nov 18 11:30:31 soda common-prefix: logtest[21366]: info: total message dropped: 0
#
#       Default: none (disabled)
#syslogprefix linux-ha


#       Do we register to apphbd
#       Default: no
#useapphbd no

#       There are two processes running for logging daemon
#               1. parent process which reads messages from all client channels 
#               and writes them to the child process 
#  
#               2. the child process which reads messages from the parent process through IPC
#               and writes them to syslog/disk


#       set the send queue length from the parent process to the child process
#
#sendqlen 256 

#       set the recv queue length in child process
#
#recvqlen 256

and enable the service:

root@omnios01:/root/HA# svcadm enable ha_logd

Finally start the HA service on both servers omnios01 and omnios02 and check the status:

root@omnios02:/root# /opt/ha/lib/heartbeat/heartbeat
heartbeat[3153]: 2016/08/26_06:30:47 info: Enabling logging daemon 
heartbeat[3153]: 2016/08/26_06:30:47 info: logfile and debug file are those specified in logd config file (default /etc/logd.cf)
heartbeat[3153]: 2016/08/26_06:30:47 info: Pacemaker support: on
heartbeat[3153]: 2016/08/26_06:30:47 info: **************************
heartbeat[3153]: 2016/08/26_06:30:47 info: Configuration validated. Starting heartbeat 3.0.5
root@omnios02:/root# 

and verify the cluster state:

root@omnios02:/root# crm status
============
Last updated: Fri Aug 26 06:30:54 2016
Stack: Heartbeat
Current DC: omnios02 (641f06f8-65a9-44fd-80f4-96b87e9c4062) - partition with quorum
Version: 1.0.11-6e010d6b0d49a6b929d17c0114e9d2d934dc8e04
2 Nodes configured, unknown expected votes
0 Resources configured.
============

Online: [ omnios01 omnios02 ]

root@omnios02:/root#

After we confirm it is working fine we can kill the above started process and enable the service:

root@omnios01:/root# svcadm enable heartbeat
root@omnios01:/root# svcs -a | grep heart
online          7:48:21 svc:/application/cluster/heartbeat:default

Next we set some parameters for 2 node cluster, ie disable quorum and stonith since this is in vm’s:

root@omnios01:/root# crm configure property no-quorum-policy=ignore
root@omnios01:/root# crm configure property stonith-enabled="false"
root@omnios01:/root# crm configure property stonith-action=poweroff

and set some values for resource stickiness (default zero, will move immediately) and migration threshold (default none, will try forever on the same node):

root@omnios01:/root# crm configure rsc_defaults resource-stickiness=100
root@omnios01:/root# crm configure rsc_defaults migration-threshold=3
 
root@omnios01:/root# crm configure show
node $id="11dc182d-5096-cd7c-acc6-eb3b3493f314" omnios01
node $id="641f06f8-65a9-44fd-80f4-96b87e9c4062" omnios02
property $id="cib-bootstrap-options" \
        dc-version="1.0.11-6e010d6b0d49a6b929d17c0114e9d2d934dc8e04" \
        cluster-infrastructure="Heartbeat" \
        no-quorum-policy="ignore" \
        stonith-enabled="false" \
        last-lrm-refresh="1472435153" \
        stonith-action="poweroff"
rsc_defaults $id="rsc-options" \
        resource-stickiness="100" \
        migration-threshold="3"
root@omnios01:/root# 

Create the first resource, the cluster VIP address:

root@omnios01:/root# crm configure
crm(live)configure# primitive p_pool1_VIP ocf:heartbeat:IPaddr \
>         params ip="10.10.1.205" cidr_netmask="24" nic="e1000g1" \
>         op monitor interval="10s" \
>         meta target-role="Started"
crm(live)configure# verify
crm(live)configure# commit
crm(live)configure# exit

and check the status again:

root@omnios01:/root# crm status
============
Last updated: Fri Aug 26 10:56:23 2016
Stack: Heartbeat
Current DC: omnios02 (641f06f8-65a9-44fd-80f4-96b87e9c4062) - partition with quorum
Version: 1.0.11-6e010d6b0d49a6b929d17c0114e9d2d934dc8e04
2 Nodes configured, unknown expected votes
1 Resources configured.
============

Online: [ omnios01 omnios02 ]

 p_pool1_VIP    (ocf::heartbeat:IPaddr):        Started omnios01
root@omnios01:/root#

and if we check the links on the server we can see the VIP:

root@omnios01:/root# ipadm show-addr
ADDROBJ           TYPE     STATE        ADDR
lo0/v4            static   ok           127.0.0.1/8
e1000g0/v4        static   ok           192.168.0.141/24
e1000g1/cr        static   ok           10.10.1.205/24
lo0/v6            static   ok           ::1/128

In case we want to preserve the primary IP of the e1000g1 interface instead overwriting it with the VIP one we can create a VNIC and use it for the VIP:

root@omnios01:/root# dladm create-vnic -l e1000g1 VIP1
root@omnios01:/root# dladm show-link
LINK        CLASS     MTU    STATE    BRIDGE     OVER
e1000g0     phys      1500   up       --         --
e1000g1     phys      1500   up       --         --
VIP1        vnic      1500   up       --         e1000g1

crm(live)configure# primitive p_pool1_VIP ocf:heartbeat:IPaddr \
         params ip="10.10.1.205" cidr_netmask="24" nic="VIP1" \
         op monitor interval="10s" \
         meta target-role="Started"

root@omnios01:/root# crm configure show
node $id="11dc182d-5096-cd7c-acc6-eb3b3493f314" omnios01 \
        attributes standby="off" online="on"
node $id="641f06f8-65a9-44fd-80f4-96b87e9c4062" omnios02
primitive p_pool1_VIP ocf:heartbeat:IPaddr \
        params ip="10.10.1.205" cidr_netmask="24" nic="VIP1" \
        op monitor interval="10s" \
        meta target-role="Started"
property $id="cib-bootstrap-options" \
        dc-version="1.0.11-6e010d6b0d49a6b929d17c0114e9d2d934dc8e04" \
        cluster-infrastructure="Heartbeat" \
        no-quorum-policy="ignore" \
        stonith-enabled="false" \
        last-lrm-refresh="1472435153" \
        stonith-action="poweroff"
root@omnios01:/root# 

root@omnios01:/root# ipadm show-addr
ADDROBJ           TYPE     STATE        ADDR
lo0/v4            static   ok           127.0.0.1/8
e1000g0/v4        static   ok           192.168.0.141/24
e1000g1/dhcp      dhcp     ok           10.10.1.13/24
VIP1/cr           static   ok           10.10.1.205/24
lo0/v6            static   ok           ::1/128

which is the way I ended up doing it.

Now we can create our ZFS pool:

root@omnios01:/root# zpool create -m /pool1 -o autoexpand=on -o autoreplace=on -o cachefile=none pool1 raidz c2t1d0 c2t2d0 c2t3d0
root@omnios01:/root# zpool status pool1
  pool: pool1
 state: ONLINE
  scan: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        pool1       ONLINE       0     0     0
          raidz1-0  ONLINE       0     0     0
            c2t1d0  ONLINE       0     0     0
            c2t2d0  ONLINE       0     0     0
            c2t3d0  ONLINE       0     0     0

errors: No known data errors
root@omnios01:/root#

and set some parameters like lz4 compression:

root@omnios01:/root# zpool set feature@lz4_compress=enabled pool1
root@omnios01:/root# zfs set compression=lz4 pool1
root@omnios01:/root# zfs set atime=off pool1
root@omnios01:/root# zfs list pool1
NAME    USED  AVAIL  REFER  MOUNTPOINT
pool1  5.33G  13.9G  28.0K  /pool1

after that we have the following state:

root@omnios01:/root# zfs get all pool1
NAME   PROPERTY              VALUE                  SOURCE
pool1  type                  filesystem             -
pool1  creation              Mon Aug 29  5:56 2016  -
pool1  used                  5.33G                  -
pool1  available             13.9G                  -
pool1  referenced            28.0K                  -
pool1  compressratio         1.12x                  -
pool1  mounted               yes                    -
pool1  quota                 none                   default
pool1  reservation           none                   default
pool1  recordsize            128K                   default
pool1  mountpoint            /pool1                 local
pool1  sharenfs              off                    default
pool1  checksum              on                     default
pool1  compression           lz4                    local
pool1  atime                 off                    local
pool1  devices               on                     default
pool1  exec                  on                     default
pool1  setuid                on                     default
pool1  readonly              off                    default
pool1  zoned                 off                    default
pool1  snapdir               hidden                 default
pool1  aclmode               discard                default
pool1  aclinherit            restricted             default
pool1  canmount              on                     default
pool1  xattr                 on                     default
pool1  copies                1                      default
pool1  version               5                      -
pool1  utf8only              off                    -
pool1  normalization         none                   -
pool1  casesensitivity       sensitive              -
pool1  vscan                 off                    default
pool1  nbmand                off                    default
pool1  sharesmb              off                    default
pool1  refquota              none                   default
pool1  refreservation        none                   default
pool1  primarycache          all                    default
pool1  secondarycache        all                    default
pool1  usedbysnapshots       0                      -
pool1  usedbydataset         28.0K                  -
pool1  usedbychildren        5.33G                  -
pool1  usedbyrefreservation  0                      -
pool1  logbias               latency                default
pool1  dedup                 off                    default
pool1  mlslabel              none                   default
pool1  sync                  standard               default
pool1  refcompressratio      1.00x                  -
pool1  written               28.0K                  -
pool1  logicalused           6.00G                  -
pool1  logicalreferenced     13.5K                  -
pool1  filesystem_limit      none                   default
pool1  snapshot_limit        none                   default
pool1  filesystem_count      none                   default
pool1  snapshot_count        none                   default
pool1  redundant_metadata    all                    default
root@omnios01:/root# 

root@omnios01:/root# zfs mount
rpool/ROOT/omnios               /
rpool/export                    /export
rpool/export/home               /export/home
rpool                           /rpool
pool1                           /pool1

root@omnios01:/root# mount | grep pool1
/pool1 on pool1 read/write/setuid/devices/nonbmand/exec/xattr/atime/dev=42d0012 on Mon Aug 29 05:56:18 2016

Next step is to copy over the stmf-ha config file so pacemaker can take control over COMSTAR resources:

root@omnios01:/root# cp stmf-ha/samples/stmf-ha-sample.conf /pool1/stmf-ha.conf

Now we can create the resource in pacemaker:

primitive p_zfs_pool1 ocf:heartbeat:ZFS \
  params pool="pool1" \
  op start timeout="90" \
  op stop timeout="90"
colocation col_pool1_with_VIP inf: p_zfs_pool1 p_pool1_VIP
order o_pool1_before_VIP inf: p_zfs_pool1 p_pool1_VIP

After committing the changes we need to start the resource on the node we created the pool on, in this case omnios01:

root@omnios01:/root# crm resource start p_zfs_pool1

after which we can see:

root@omnios01:/root# crm status
============
Last updated: Mon Aug 29 03:39:39 2016
Stack: Heartbeat
Current DC: omnios02 (641f06f8-65a9-44fd-80f4-96b87e9c4062) - partition with quorum
Version: 1.0.11-6e010d6b0d49a6b929d17c0114e9d2d934dc8e04
2 Nodes configured, unknown expected votes
2 Resources configured.
============

Online: [ omnios01 omnios02 ]

 p_pool1_VIP    (ocf::heartbeat:IPaddr):        Started omnios01
 p_zfs_pool1    (ocf::heartbeat:ZFS):   Started omnios01
root@omnios01:/root# 

Now we can create ZFS over iSCSI resource in Proxmox using the VIP address as portal. I created a vm with id of 109 in Proxmox which resulted with the pool1/vm-109-disk-1 zvol being created on the OmniOS cluster.

The last step is enabling the compression on the VM root device after we have created it so we can benefit from this feature:

root@omnios01:/root# zfs set compression=lz4 pool1/vm-109-disk-1
root@omnios01:/root# zfs get all pool1/vm-109-disk-1
NAME                 PROPERTY                  VALUE                             SOURCE
pool1/vm-109-disk-1  type                      volume                            -
pool1/vm-109-disk-1  creation                  Mon Aug 29  6:13 2016             -
pool1/vm-109-disk-1  used                      5.33G                             -
pool1/vm-109-disk-1  available                 13.9G                             -
pool1/vm-109-disk-1  referenced                5.33G                             -
pool1/vm-109-disk-1  compressratio             1.12x                             -
pool1/vm-109-disk-1  reservation               none                              default
pool1/vm-109-disk-1  volsize                   6G                                local
pool1/vm-109-disk-1  volblocksize              64K                               -
pool1/vm-109-disk-1  checksum                  on                                default
pool1/vm-109-disk-1  compression               lz4                               local
pool1/vm-109-disk-1  readonly                  off                               default
pool1/vm-109-disk-1  copies                    1                                 default
pool1/vm-109-disk-1  refreservation            none                              default
pool1/vm-109-disk-1  primarycache              all                               default
pool1/vm-109-disk-1  secondarycache            all                               default
pool1/vm-109-disk-1  usedbysnapshots           0                                 -
pool1/vm-109-disk-1  usedbydataset             5.33G                             -
pool1/vm-109-disk-1  usedbychildren            0                                 -
pool1/vm-109-disk-1  usedbyrefreservation      0                                 -
pool1/vm-109-disk-1  logbias                   latency                           default
pool1/vm-109-disk-1  dedup                     off                               default
pool1/vm-109-disk-1  mlslabel                  none                              default
pool1/vm-109-disk-1  sync                      standard                          default
pool1/vm-109-disk-1  refcompressratio          1.12x                             -
pool1/vm-109-disk-1  written                   5.33G                             -
pool1/vm-109-disk-1  logicalused               6.00G                             -
pool1/vm-109-disk-1  logicalreferenced         6.00G                             -
pool1/vm-109-disk-1  snapshot_limit            none                              default
pool1/vm-109-disk-1  snapshot_count            none                              default
pool1/vm-109-disk-1  redundant_metadata        all                               default
pool1/vm-109-disk-1  org.illumos.stmf-ha:lun   1                                 local
pool1/vm-109-disk-1  org.illumos.stmf-ha:guid  600144F721dca2888ba402e411ee3af1  local
root@omnios01:/root# 

Get the I/O stats for the pool:

root@omnios02:/root# zpool iostat -v
               capacity     operations    bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
pool1        216K  29.7G      0      0      0      1
  raidz1     216K  29.7G      0      0      0      1
    c2t1d0      -      -      0      0      6      5
    c2t2d0      -      -      0      0      5      5
    c2t3d0      -      -      0      0      5      5
----------  -----  -----  -----  -----  -----  -----
rpool       5.74G  10.1G      0      3    123  23.9K
  c2t0d0s0  5.74G  10.1G      0      3    123  23.9K
----------  -----  -----  -----  -----  -----  -----

We can also see a COMSTAR target has been created:

root@omnios01:/root# itadm list-target -v
TARGET NAME                                                  STATE    SESSIONS 
iqn.2010-08.org.illumos:stmf-ha:pool1                        online   0        
        alias:                  -
        auth:                   none (defaults)
        targetchapuser:         -
        targetchapsecret:       unset
        tpg-tags:               default

and the LUN for the Proxmox VM:

root@omnios01:/root# sbdadm list-lu
Found 1 LU(s)
              GUID                    DATA SIZE           SOURCE
--------------------------------  -------------------  ----------------
600144f721dca2888ba402e411ee3af1  6442450944           /dev/zvol/rdsk/pool1/vm-109-disk-1

root@omnios01:/root# stmfadm list-lu -v
LU Name: 600144F721DCA2888BA402E411EE3AF1
    Operational Status: Online
    Provider Name     : sbd
    Alias             : /dev/zvol/rdsk/pool1/vm-109-disk-1
    View Entry Count  : 1
    Data File         : /dev/zvol/rdsk/pool1/vm-109-disk-1
    Meta File         : not set
    Size              : 6442450944
    Block Size        : 512
    Management URL    : not set
    Vendor ID         : SUN     
    Product ID        : COMSTAR         
    Serial Num        : not set
    Write Protect     : Disabled
    Writeback Cache   : Disabled
    Access State      : Active

root@omnios01:/root# zfs list -rH -t volume pool1 
pool1/vm-109-disk-1     3.87G   15.3G   3.87G   -

Install napp-it ZFS appliance (optional)

In this case we don’t really need napp-it, we just need to launch 2 x OmniOS instances and install and configure the HA. Napp-it can help though for managing snapshots, clones, backups, rollbacks etc. for which having a web GUI should help a lot.

root@omnios01:/root# wget -O - www.napp-it.org/nappit | perl

and then connect to the web UI http://serverip:81 when finished. Reboot after installation of napp-it then update napp-it (Menu About -> Update) or run:

root@omnios01:/root# pkg update

Moving the resources from one node to another manually

We put the node the resource is running on into standby mode:

root@omnios01:/root# crm node attribute omnios01 set standby on

root@omnios01:/root# crm status
============
Last updated: Mon Aug 29 02:10:59 2016
Stack: Heartbeat
Current DC: omnios02 (641f06f8-65a9-44fd-80f4-96b87e9c4062) - partition with quorum
Version: 1.0.11-6e010d6b0d49a6b929d17c0114e9d2d934dc8e04
2 Nodes configured, unknown expected votes
1 Resources configured.
============

Node omnios01 (11dc182d-5096-cd7c-acc6-eb3b3493f314): standby
Online: [ omnios02 ]

root@omnios01:/root# crm status
============
Last updated: Mon Aug 29 02:11:03 2016
Stack: Heartbeat
Current DC: omnios02 (641f06f8-65a9-44fd-80f4-96b87e9c4062) - partition with quorum
Version: 1.0.11-6e010d6b0d49a6b929d17c0114e9d2d934dc8e04
2 Nodes configured, unknown expected votes
1 Resources configured.
============

Node omnios01 (11dc182d-5096-cd7c-acc6-eb3b3493f314): standby
Online: [ omnios02 ]

 p_pool1_VIP    (ocf::heartbeat:IPaddr):        Started omnios02

and after couple of seconds we can see the VIP has moved to omnios02.

root@omnios02:/root# ipadm show-addr
ADDROBJ           TYPE     STATE        ADDR
lo0/v4            static   ok           127.0.0.1/8
e1000g0/v4        static   ok           192.168.0.142/24
e1000g1/dhcp      dhcp     ok           10.10.1.12/24
VIP1/cr           static   ok           10.10.1.205/24
lo0/v6            static   ok           ::1/128

Another test with all resources created:

root@omnios01:/root# crm status
============
Last updated: Tue Aug 30 07:10:25 2016
Stack: Heartbeat
Current DC: omnios02 (641f06f8-65a9-44fd-80f4-96b87e9c4062) - partition with quorum
Version: 1.0.11-6e010d6b0d49a6b929d17c0114e9d2d934dc8e04
2 Nodes configured, unknown expected votes
2 Resources configured.
============

Online: [ omnios01 omnios02 ]

 p_pool1_VIP    (ocf::heartbeat:IPaddr):        Started omnios01
 p_zfs_pool1    (ocf::heartbeat:ZFS):   Started omnios01
root@omnios01:/root#

root@omnios01:/root# crm node attribute omnios01 set standby on

root@omnios01:/root# crm status
============
Last updated: Tue Aug 30 07:14:22 2016
Stack: Heartbeat
Current DC: omnios02 (641f06f8-65a9-44fd-80f4-96b87e9c4062) - partition with quorum
Version: 1.0.11-6e010d6b0d49a6b929d17c0114e9d2d934dc8e04
2 Nodes configured, unknown expected votes
2 Resources configured.
============

Node omnios01 (11dc182d-5096-cd7c-acc6-eb3b3493f314): standby
Online: [ omnios02 ]

 p_pool1_VIP    (ocf::heartbeat:IPaddr):        Started omnios02
 p_zfs_pool1    (ocf::heartbeat:ZFS):   Started omnios02
root@omnios01:/root#

To bring the node online again we run:

root@omnios01:/root# crm node attribute omnios01 set standby off

Then we can check the status again:

root@omnios01:/root# crm status
============
Last updated: Mon Aug 29 02:14:42 2016
Stack: Heartbeat
Current DC: omnios02 (641f06f8-65a9-44fd-80f4-96b87e9c4062) - partition with quorum
Version: 1.0.11-6e010d6b0d49a6b929d17c0114e9d2d934dc8e04
2 Nodes configured, unknown expected votes
1 Resources configured.
============

Online: [ omnios01 omnios02 ]

root@omnios01:/root# crm status
============
Last updated: Mon Aug 29 02:14:48 2016
Stack: Heartbeat
Current DC: omnios02 (641f06f8-65a9-44fd-80f4-96b87e9c4062) - partition with quorum
Version: 1.0.11-6e010d6b0d49a6b929d17c0114e9d2d934dc8e04
2 Nodes configured, unknown expected votes
1 Resources configured.
============

Online: [ omnios01 omnios02 ]

 p_pool1_VIP    (ocf::heartbeat:IPaddr):        Started omnios01

and after couple of seconds we can see that omnios01 is back online and the VIP has moved back to omnios01. After setting resource-stickiness=100 though the resources will stay on omnios02.

Please note that I’m NOT using a shared storage for the cluster hence the ZFS resource failover can NOT work.

Repeat the same and create pool2 on omnios02

Create another VIF on both nodes:

root@omnios02:/root# dladm create-vnic -l e1000g1 VIP2
root@omnios02:/root# dladm show-link
LINK        CLASS     MTU    STATE    BRIDGE     OVER
e1000g0     phys      1500   up       --         --
e1000g1     phys      1500   up       --         --
VIP1        vnic      1500   up       --         e1000g1
VIP2        vnic      1500   up       --         e1000g1

Create the pool on omnios02:

root@omnios02:/root# zpool create -f -m /pool2 -o autoexpand=on -o autoreplace=on -o cachefile=none pool2 raidz c2t1d0 c2t2d0 c2t3d0

Configure Pacemaker:

primitive p_pool2_VIP ocf:heartbeat:IPaddr \
         params ip="10.10.1.206" cidr_netmask="24" nic="VIP2" \
         op monitor interval="10s" \
         meta target-role="Started"
primitive p_zfs_pool2 ocf:heartbeat:ZFS \
  params pool="pool2" \
  op start timeout="90" \
  op stop timeout="90"
colocation col_pool2_with_VIP inf: p_zfs_pool2 p_pool2_VIP
order o_pool2_before_VIP inf: p_zfs_pool2 p_pool2_VIP

The result:

root@omnios02:/root# crm status
============
Last updated: Wed Feb 22 04:37:07 2017
Stack: Heartbeat
Current DC: omnios02 (641f06f8-65a9-44fd-80f4-96b87e9c4062) - partition with quorum
Version: 1.0.11-6e010d6b0d49a6b929d17c0114e9d2d934dc8e04
2 Nodes configured, unknown expected votes
4 Resources configured.
============

Online: [ omnios01 omnios02 ]

 p_pool1_VIP    (ocf::heartbeat:IPaddr):        Started omnios01
 p_zfs_pool1    (ocf::heartbeat:ZFS):   Started omnios01
 p_pool2_VIP    (ocf::heartbeat:IPaddr):        Started omnios02
 p_zfs_pool2    (ocf::heartbeat:ZFS):   Started omnios02

Check for the second VIP on omnios02:

root@omnios02:/root# ipadm show-addr
ADDROBJ           TYPE     STATE        ADDR
lo0/v4            static   ok           127.0.0.1/8
e1000g0/v4        static   ok           192.168.0.142/24
e1000g1/dhcp      dhcp     ok           10.10.1.13/24
VIP2/cr           static   ok           10.10.1.206/24
lo0/v6            static   ok           ::1/128

After reboot of omnios02 node the COMSTAR target was also created:

root@omnios02:/root# itadm list-target -v
TARGET NAME                                                  STATE    SESSIONS 
iqn.2010-08.org.illumos:stmf-ha:pool2                        online   0        
        alias:                  -
        auth:                   none (defaults)
        targetchapuser:         -
        targetchapsecret:       unset
        tpg-tags:               default 

Now we can use this pool for ZFS-over-iSCSI in Proxmox too. In real world scenario, where both head nodes are connected to 2 x JBOD SAS enclousers lets say for full redundancy, when one of the head nodes goes down its hosted pool will be migrated to the other head with no impact on the clients apart from the short pause during failover and VIP migration.

Just a note rgarding Proxmox, to use these pools we need to generate SSH key for passwordless access from Proxmox nodes to OmniOS nodes, for example:

root@proxmox01:/etc/pve/priv/zfs# ssh-keygen -t rsa -b 2048 -f 10.10.1.206_id_rsa -N ''

and add the /etc/pve/priv/zfs/10.10.1.206_id_rsa.pub key to the authorized_keys file of the root user on the OmniOS servers.

References

APPENDIX

At the end some commands related to ZFS and COMSTAR that I find useful.

COMSTAR COMMANDS

Install COMSTAR

# pkg install group/feature/storage-server
# svcadm enable stmf
# svcadm enable -r svc:/network/iscsi/target:default
# itadm create-target iqn.2010-09.org.napp-it:tgt1
# itadm list-target -v
# stmfadm offline-target iqn.2010-09.org.napp-it:tgt1
# itadm delete-target iqn.2010-09.org.napp-it:tgt1

TPG (Target Portal Group)

# itadm create-tpg TPGA 10.10.1.205 10.20.1.205
# itadm list-tpg -v
# itadm modify-target -t PTGA,TPGB iqn.2010-09.org.napp-it:tgt1

LUN

# zpool create sanpool mirror c2t3d0 c2t4d0 
# zfs create -V 10g sanpool/vol1
# stmfadm create-lu /dev/zvol/rdisk/sanpool/vol1
# stmfadm list-lu -v

e.g. 
root@omnios01:/root# stmfadm list-lu -v
LU Name: 600144F721DCA2888BA402E411EE3AF1
    Operational Status: Online
    Provider Name     : sbd
    Alias             : /dev/zvol/rdsk/pool1/vm-109-disk-1
    View Entry Count  : 1
    Data File         : /dev/zvol/rdsk/pool1/vm-109-disk-1
    Meta File         : not set
    Size              : 6442450944
    Block Size        : 512
    Management URL    : not set
    Vendor ID         : SUN     
    Product ID        : COMSTAR         
    Serial Num        : not set
    Write Protect     : Disabled
    Writeback Cache   : Disabled
    Access State      : Active

TG (Target Group)

# stmfadm create-tg targets-0
# stmfadm add-tg-member -g targets-0 iqn.2010-09.org.napp-it:tgt1

HG (Host Group)

# stmfadm create-hg host-a <WWN space delimited number(s) of the initiator device (iSCSI,HBA etc.)>
# stmfadm add-hg-member -g host-a <WWN number of another initiator device (iSCSI,HBA etc.)>

LUN accsess rights via View

LUN is available to all:

# stmfadm add-view 600144F721DCA2888BA402E411EE3AF1
# stmfadm list-view -l 600144F721DCA2888BA402E411EE3AF1

LUN is available to specific host group:

# stmfadm add-view -h host-a -t 600144F721DCA2888BA402E411EE3AF1

ZFS COMMANDS

Tunable ZFS parameters, most of these can be set in /etc/system:

  # echo "::zfs_params" | mdb -k

Some settings and mostly statistics on ARC usage:

  # echo "::arc" | mdb -k

Solaris memory allocation; “Kernel” memory includes ARC:

  # echo "::memstat" | mdb -k

Stats of VDEV prefetch - how many (metadata) sectors were used from low-level prefetch caches:

  # kstat -p zfs:0:vdev_cache_stats

Set dynamically:

  # echo zfs_prefetch_disable/W0t1 | mdb -kw

Revert to default:

  # echo zfs_prefetch_disable/W0t0 | mdb -kw

Set the following parameter in the /etc/system file:

  set zfs:zfs_prefetch_disable = 1

Limiting ARC cache size (to 32GB in this case) in /etc/system file:

  set zfs:zfs_arc_max = 32212254720

Add device as ZIL/ZLOG, eg. c4t1d0, can be added as a ZFS log device:

  # zpool add pool1 log c4t1d0

If 2 F40 flash modules are available, you can add mirrored log devices:

  # zpool add pool1 log mirror c4t1d0 c4t2d0

Available F20 DOMs or F5100 FMODs can be added as a cache device for reads.

  # zpool add pool1 cache c4t3d0

You can’t mirror cache devices, they will be striped together.

  # zpool add pool1 cache c4t3d0 c4t4d0

Check health of all poools:

  # zpool status -x

Leave a Comment