Ubuntu 14.04/16.04 LTS and ZFS

Ubuntu Linux Specific Guides
Post Reply
User avatar
dedwards
Site Admin
Posts: 70
Joined: Wed Mar 15, 2006 8:28 pm
Contact:

Ubuntu 14.04/16.04 LTS and ZFS

Post by dedwards » Mon Jun 08, 2015 7:37 am

Prerequisites in order to install ZFS on Ubuntu 14.04 LTS or 16.04 LTS

1. 64-bit capable CPU
2. ECC RAM
3. 64-Bit Ubuntu 14.04 LTS or 16.04 LTS installation


Ubuntu 14.04 LTS Install ZFS packages

In the first command below, we gave ourselves access to the apt-add-repository command, which makes it much simpler to safely add PPAs to our repository list. Then we added the PPA, updated our source list to reflect that, and installed the package itself.

Code: Select all

sudo apt-get update
sudo apt-get install python-software-properties
sudo apt-add-repository ppa:zfs-native/stable
sudo apt-get update
sudo apt-get install ubuntu-zfs
Load the ZFS module:

Code: Select all

modprobe zfs
Ubuntu 16.04 LTS Install ZFS pacages

Ubuntu 16.04 LTS comes with built-in support for ZFS, so it's just a matter of installing and enabling ZFS

Code: Select all

sudo apt-get install zfsutils-linux zfs-initramfs
sudo modprobe zfs
Create Zpool

First get a listing of all the disk device names you will be using by using this command:

Code: Select all

fdisk -l|more 
You should get a listing like below:

Code: Select all

Disk /dev/sda: 4000.8 GB, 4000787030016 bytes
255 heads, 63 sectors/track, 486401 cylinders, total 7814037168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk identifier: 0x00000000


Disk /dev/sdb: 4000.8 GB, 4000787030016 bytes
255 heads, 63 sectors/track, 486401 cylinders, total 7814037168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk identifier: 0x00000000


Disk /dev/sdc: 4000.8 GB, 4000787030016 bytes
255 heads, 63 sectors/track, 486401 cylinders, total 7814037168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk identifier: 0x00000000

Disk /dev/sdd: 4000.8 GB, 4000787030016 bytes
255 heads, 63 sectors/track, 486401 cylinders, total 7814037168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk identifier: 0x00000000

Disk /dev/sde: 4000.8 GB, 4000787030016 bytes
255 heads, 63 sectors/track, 486401 cylinders, total 7814037168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk identifier: 0x00000000


Disk /dev/sdf: 4000.8 GB, 4000787030016 bytes
255 heads, 63 sectors/track, 486401 cylinders, total 7814037168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk identifier: 0x00000000


Disk /dev/sdg: 32.0 GB, 32017047552 bytes
255 heads, 63 sectors/track, 3892 cylinders, total 62533296 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x0005ae32
In my particular case, I will be using all the 4000.8 GB drives in my zpool so, I will be using the following devices:

Code: Select all

/dev/sda
/dev/sdb
/dev/sdc
/dev/sdd
/dev/sde
/dev/sdf
I will not be using the 32GB /dev/sdg since it's my boot drive. So now that we have the devices names, let's get a listing of all the drives in the system by wwn ID (this is the preferred method of adding drives in your zpool just in case the /dev/sdx assignment ever changes in your system. Additionally, the WWN ID is usually printed on the actual drive itself just in case you have to replace it later, you will know exactly which one it is):

Code: Select all

ls -l /dev/disk/by-id
You should get a listing like below:

Code: Select all

lrwxrwxrwx 1 root root  9 Jun  8 10:48 ata-TOSHIBA_MD04ACA400_15O8K1NGFSBA -> ../../sda
lrwxrwxrwx 1 root root  9 Jun  8 10:48 ata-TOSHIBA_MD04ACA400_15PDKBNIFSAA -> ../../sde
lrwxrwxrwx 1 root root  9 Jun  8 10:48 ata-TOSHIBA_MD04ACA400_15Q1KCFMFSAA -> ../../sdf
lrwxrwxrwx 1 root root  9 Jun  8 10:48 ata-TOSHIBA_MD04ACA400_15Q2KETKFSAA -> ../../sdc
lrwxrwxrwx 1 root root  9 Jun  8 10:48 ata-TOSHIBA_MD04ACA400_15Q2KETLFSAA -> ../../sdd
lrwxrwxrwx 1 root root  9 Jun  8 10:48 ata-TOSHIBA_MD04ACA400_15Q3KFGKFSAA -> ../../sdb
lrwxrwxrwx 1 root root  9 Jun  8 10:48 ata-TSSTcorp_DVD+_-RW_TS-H653B -> ../../sr0
lrwxrwxrwx 1 root root  9 Jun  8 10:48 ata-V4-CT032V4SSD2_200118513 -> ../../sdg
lrwxrwxrwx 1 root root 10 Jun  8 10:48 ata-V4-CT032V4SSD2_200118513-part1 -> ../../sdg1
lrwxrwxrwx 1 root root 10 Jun  8 10:48 ata-V4-CT032V4SSD2_200118513-part2 -> ../../sdg2
lrwxrwxrwx 1 root root 10 Jun  8 10:48 ata-V4-CT032V4SSD2_200118513-part5 -> ../../sdg5
lrwxrwxrwx 1 root root  9 Jun  8 10:48 wwn-0x500003960b704511 -> ../../sdf
lrwxrwxrwx 1 root root  9 Jun  8 10:48 wwn-0x500003960b784775 -> ../../sdc
lrwxrwxrwx 1 root root  9 Jun  8 10:48 wwn-0x500003960b784776 -> ../../sdd
lrwxrwxrwx 1 root root  9 Jun  8 10:48 wwn-0x500003960b804868 -> ../../sdb
lrwxrwxrwx 1 root root  9 Jun  8 10:48 wwn-0x500003960ba809f1 -> ../../sda
lrwxrwxrwx 1 root root  9 Jun  8 10:48 wwn-0x500003960bd03569 -> ../../sde
lrwxrwxrwx 1 root root  9 Jun  8 10:48 wwn-0x500a07560bed90f1 -> ../../sdg
lrwxrwxrwx 1 root root 10 Jun  8 10:48 wwn-0x500a07560bed90f1-part1 -> ../../sdg1
lrwxrwxrwx 1 root root 10 Jun  8 10:48 wwn-0x500a07560bed90f1-part2 -> ../../sdg2
lrwxrwxrwx 1 root root 10 Jun  8 10:48 wwn-0x500a07560bed90f1-part5 -> ../../sdg5
Now we match the device names to the corresponding wwn ID, so in my particular case, I will be using the following wwn IDs:

sda --> wwn-0x500003960ba809f1
sdb --> wwn-0x500003960b804868
sdc --> wwn-0x500003960b784775
sdd --> wwn-0x500003960b784776
sde --> wwn-0x500003960bd03569
sdf --> wwn-0x500003960b704511

We will be creating a RAID6 ZFS pool. I prefer RAID6 over RAID5 since it has more resiliency that RAID5 since it can withstand two drive failures before the array goes down. Just for reference, the following RAID levels can be created:

RAID0
RAID1 (mirror)
RAID5 (raidz)
RAID6 (raidz2)

Let's create the RAID6 ZFS pool named array1 using 4K blocksizes (-o ashift=12) vs the default 512 byte:

Code: Select all

sudo zpool create -o ashift=12 -f array1 raidz2 /dev/disk/by-id/wwn-0x500003960ba809f1 /dev/disk/by-id/wwn-0x500003960b804868 /dev/disk/by-id/wwn-0x500003960b784775 /dev/disk/by-id/wwn-0x500003960b784776 /dev/disk/by-id/wwn-0x500003960bd03569 /dev/disk/by-id/wwn-0x500003960b704511
Check the newly created zpool:

Code: Select all

sudo zpool status
should output the following:

Code: Select all

pool: array1
 state: ONLINE
  scan: none requested
config:

        NAME                        STATE     READ WRITE CKSUM
        array1                      ONLINE       0     0     0
          raidz2-0                  ONLINE       0     0     0
            wwn-0x500003960ba809f1  ONLINE       0     0     0
            wwn-0x500003960b804868  ONLINE       0     0     0
            wwn-0x500003960b784775  ONLINE       0     0     0
            wwn-0x500003960b784776  ONLINE       0     0     0
            wwn-0x500003960bd03569  ONLINE       0     0     0
            wwn-0x500003960b704511  ONLINE       0     0     0
Show the zpool listing:

Code: Select all

sudo zpool list
Will output the raw NOT the usable capacity of the zpool since two of our drives are taken up for parity:

Code: Select all

NAME     SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
[code]array1  21.8T   756K  21.7T         -     0%     0%  1.00x  ONLINE  -
[/code]

Running the zfs list command will output the usable capacity of the zpool:

Code: Select all

sudo zfs list

NAME     USED  AVAIL  REFER  MOUNTPOINT
array1   480K  14.3T   192K  /array1
Disable the ZIL (ZFS Intent Log) or disable sync writes

This cannot be stressed enough. If you intent on turning off ZIL you absolutely must have UPS battery backup that will gracefully shutdown your server when the battery runs out. If you don't, your data will get fucked!!!!

If you intent to use your ZFS pool to store virtual machines or databases, you should not turn off the ZIL but instead use an SSD for the SLOG to boost performance (Explained below)

If you intent to use your ZFS pool for NFS which issues sync writes by default, then you should turn OFF ZIL. What if you want to store virtual machines on NFS then? Then you simply set the async flag for your NFS share, duh!!

Turn off ZIL support for synchronous writes with the following command:

Code: Select all

sudo zfs set sync=disabled array1

Create ZFS Dataset (Filesystem)

Unlike traditional disks and volume managers, space in ZFS is not preallocated. With traditional file systems, after all of the space is partitioned and assigned, there is no way to add an additional file system without adding a new disk. With ZFS, new file systems can be created at any time. Each dataset has properties including features like compression, deduplication, caching, and quotas, as well as other useful properties like readonly, case sensitivity, network file sharing, and a mount point. Datasets can be nested inside each other, and child datasets will inherit properties from their parents. Each dataset can be administered, delegated, replicated, snapshotted, jailed, and destroyed as a unit.

Let's create some datasets on the newly created zfs pool. In ZFS, filesystems look like folders under the zfs pool. We could simply create folders, but then we would lose the ability to create snapshots or set properties such as compression, deduplication, quotas etc.

In my particular case, I need some of the ZFS pool for iSCSI target. So, I'm going to create a iscsi dataset:

Code: Select all

sudo zfs create array1/iscsi
Running df -h will output the following. Notice the array1/iscsi dataset that was created:

Code: Select all

Filesystem      Size  Used Avail Use% Mounted on
/dev/sdg1        14G  2.3G   11G  18% /
none            4.0K     0  4.0K   0% /sys/fs/cgroup
udev            7.9G  4.0K  7.9G   1% /dev
tmpfs           1.6G  684K  1.6G   1% /run
none            5.0M     0  5.0M   0% /run/lock
none            7.9G     0  7.9G   0% /run/shm
none            100M     0  100M   0% /run/user
array1           15T  128K   15T   1% /array1
array1/iscsi     15T  128K   15T   1% /array1/iscsi
You can create as many datasets as you need in the ZFS pool and set properties.

If I want to enable compression on the newly created dataset, I would issue the following command:

Code: Select all

sudo zfs set compression=on array1/iscsi
To turn off compression use the following command:

Code: Select all

sudo zfs set compression=off array1/iscsi
Important note. Simply setting compression=on defaults the compression algorithm to lzjb. It's recommended to use the lz4 algorithm. That is easily set by issuing the following command:

Code: Select all

sudo zfs set compression=lz4 array1/iscsi
If I wanted to set a quota, I would issue the following command:

Code: Select all

sudo zfs set quota=200G array1/iscsi
To remove the quota, use the following command:

Code: Select all

sudo zfs set quota=none array1/iscsi
If you want to destroy your dataset you issue the following command:

Code: Select all

sudo zfs destroy array1/iscsi
Create ZFS Volume dataset

A volume is a special type of dataset. Rather than being mounted as a file system, it is exposed as a block device under /dev/zdX where X is the number of the volume dataset starting at "0" for the the first volume dataset. For example, the first volume dataset will be /dev/zd0, the second volume dataset would be /dev/zd1 and so on. This allows the volume to be used for other file systems, to back the disks of a virtual machine, or to be exported using protocols like iSCSI.

A volume can be formatted with any file system, or used without a file system to store raw data. To the user, a volume appears to be a regular disk. Putting ordinary file systems on these zvols provides features that ordinary disks or file systems do not normally have. For example, using the compression property on a 250 MB volume allows creation of a compressed FAT file system.

The command below creates a 1tb volume under the array1 ZFS pool:

Code: Select all

zfs create -s -V 1tb array1/iscsivol
The same commands to set compression, quota etc.. apply to volume datasets as regular file system datasets.

Add Cache Drive to Zpool

If you happen to have an SSD drive, you can utilize that drive as a cache drive (L2ARC Cache) for your Zpool. The idea behind it is data read from the SSD drive will have significantly faster access times than traditional spinning disks. So for instance, if you were to add a 250GB SSD drive, then 250GB of the most frequently accessed data will be kept in the cache. Now, it goes without saying that in case of power failure, any data that was kept in the cache and wasn't written to the spinning disks would be lost so a good battery backup is an absolute must.

Identify your SSD drive wwn ID as described above. So, assuming the wwn ID for your SSD drive is wwn-0x50025388500f8522. So, we'll add it to our previously created Zpool like below:

Code: Select all

zpool add array1 cache -f /dev/disk/by-id/wwn-0x50025388500f8522
Check the Zpool status:

Code: Select all

sudo zpool status

Code: Select all

pool: array1
 state: ONLINE
  scan: none requested
config:

        NAME                        STATE     READ WRITE CKSUM
        array1                      ONLINE       0     0     0
          raidz2-0                  ONLINE       0     0     0
            wwn-0x500003960ba809f1  ONLINE       0     0     0
            wwn-0x500003960b804868  ONLINE       0     0     0
            wwn-0x500003960b784775  ONLINE       0     0     0
            wwn-0x500003960b784776  ONLINE       0     0     0
            wwn-0x500003960bd03569  ONLINE       0     0     0
            wwn-0x500003960b704511  ONLINE       0     0     0
        cache
          wwn-0x50025388500f8522    ONLINE       0     0     0
As you can see the drive has been added as cache.

Add Log Drives (ZIL) to Zpool

ZIL (ZFS Intent Log) drives can be added to a ZFS pool to speed up the write capabilities of any level of ZFS RAID. It writes the metadata for a file to a very fast SSD drive to increase the write throughput of the system. When the physical spindles have a moment, that data is then flushed to the spinning media and the process starts over. We have observed significant performance increases by adding ZIL drives to our ZFS configuration. One thing to keep in mind is that the ZIL should be mirrored to protect the speed of the ZFS system. If the ZIL is not mirrored, and the drive that is being used as the ZIL drive fails, the system will revert to writing the data directly to the disk, severely hampering performance. Alternatively, you can always remove the bad drive and add another one as a ZIL drive.

If the ZIL drive fails you will lost a few seconds of data. If that's acceptable to you, then a mirror is not necessary. If you are going to be storing MISSION CRITICAL data where even a few seconds of lost data will cost significant sums of money, adding ZIL drives in mirror configuration is a MUST!!!!

If you are going to be using two SSD drives in mirror mode, identify the SSD drives by wwn ID as described above and then add them to your array in mirror mode like below:

Code: Select all

zpool add array1 log mirror -f /dev/disk/by-id/wwn-0x50025388500f8668 /dev/disk/by-id/wwn-0x50025388500ffg12
If you are going to be using one SSD drives, identify the SSD drive by wwn ID as described above and then add it to your array like below:

Code: Select all

zpool add array1 log -f /dev/disk/by-id/wwn-0x50025388500f5af8
Check the Zpool status:

Code: Select all

sudo zpool status

Code: Select all

pool: array1
 state: ONLINE
  scan: none requested
config:

        NAME                        STATE     READ WRITE CKSUM
        array1                      ONLINE       0     0     0
          raidz2-0                  ONLINE       0     0     0
            wwn-0x500003960ba809f1  ONLINE       0     0     0
            wwn-0x500003960b804868  ONLINE       0     0     0
            wwn-0x500003960b784775  ONLINE       0     0     0
            wwn-0x500003960b784776  ONLINE       0     0     0
            wwn-0x500003960bd03569  ONLINE       0     0     0
            wwn-0x500003960b704511  ONLINE       0     0     0
        logs
          wwn-0x50025388500f5af8    ONLINE       0     0     0
        cache
          wwn-0x50025388500f8522    ONLINE       0     0     0
As you can see it has been added as a ZIL drive.

Destroy ZFS zpool

If you want to destroy your zpool you issue the following command which will force it:

Code: Select all

sudo zpool destroy -f array1
Issues with zpool status on Ubuntu 14.04 after reboot

An issue I've ran into is after a reboot if you do a zpool status, your zpool will show the devicenames vs the device IDs like below:

Code: Select all

sudo zpool status

Code: Select all

  pool: array1
 state: ONLINE
  scan: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        array1      ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            sdb     ONLINE       0     0     0
            sdd     ONLINE       0     0     0
          mirror-1  ONLINE       0     0     0
            sdc     ONLINE       0     0     0
            sdg     ONLINE       0     0     0
          mirror-2  ONLINE       0     0     0
            sde     ONLINE       0     0     0
            sdi     ONLINE       0     0     0
        logs
          sdh       ONLINE       0     0     0
        cache
          sdf       ONLINE       0     0     0

errors: No known data errors
This only seems to be a cosmetic issue because issuing the zdb command shows the device IDs like it's supposed to:

Code: Select all

sudo zdb

Code: Select all

array1:
    version: 5000
    name: 'array1'
    state: 0
    txg: 200
    pool_guid: 12136950353410592998
    errata: 0
    hostid: 2831217162
    hostname: 'nas3'
    vdev_children: 4
    vdev_tree:
        type: 'root'
        id: 0
        guid: 12136950353410592998
        children[0]:
            type: 'mirror'
            id: 0
            guid: 7548278309220334221
            metaslab_array: 39
            metaslab_shift: 35
            ashift: 12
            asize: 4000771997696
            is_log: 0
            create_txg: 4
            children[0]:
                type: 'disk'
                id: 0
                guid: 2562845451665823060
                path: '/dev/disk/by-id/ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E5N8KZ7N-part1'
                whole_disk: 1
                create_txg: 4
            children[1]:
                type: 'disk'
                id: 1
                guid: 291777340882840666
                path: '/dev/disk/by-id/ata-WDC_WD40EFRX-68WT0N0_WD-WCC4EDYVYU2J-part1'
                whole_disk: 1
                create_txg: 4
        children[1]:
            type: 'mirror'
            id: 1
            guid: 8578547322301695916
            metaslab_array: 37
            metaslab_shift: 35
            ashift: 12
            asize: 4000771997696
            is_log: 0
            create_txg: 4
            children[0]:
                type: 'disk'
                id: 0
                guid: 2041375668167635066
                path: '/dev/disk/by-id/ata-WDC_WD40EFRX-68WT0N0_WD-WCC4ENPN3V47-part1'
                whole_disk: 1
                create_txg: 4
            children[1]:
                type: 'disk'
                id: 1
                guid: 15162795176142751617
                path: '/dev/disk/by-id/ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E6YRLCHH-part1'
                whole_disk: 1
                create_txg: 4
        children[2]:
            type: 'mirror'
            id: 2
            guid: 302043060234775242
            metaslab_array: 35
            metaslab_shift: 35
            ashift: 12
            asize: 4000771997696
            is_log: 0
            create_txg: 4
            children[0]:
                type: 'disk'
                id: 0
                guid: 5285723468079384932
                path: '/dev/disk/by-id/ata-WDC_WD40EFRX-68WT0N0_WD-WCC4EDYVY6HT-part1'
                whole_disk: 1
                create_txg: 4
            children[1]:
                type: 'disk'
                id: 1
                guid: 5203540854438335529
                path: '/dev/disk/by-id/ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E1VUAYCJ-part1'
                whole_disk: 1
                create_txg: 4
        children[3]:
            type: 'disk'
            id: 3
            guid: 1510858814325079212
            path: '/dev/disk/by-id/ata-Samsung_SSD_840_EVO_250GB_S1DDNEAF407950E-part1'
            whole_disk: 1
            metaslab_array: 49
            metaslab_shift: 31
            ashift: 13
            asize: 250045005824
            is_log: 1
            create_txg: 62
    features_for_read:
        com.delphix:hole_birth
        com.delphix:embedded_data
You MAY be able to fix the issue by issuing the following commands:

Code: Select all

zpool export array1
zpool import -d /dev/disk/by-id/ array1
zpool set cachefile= array1
update-initramfs -k all -u
Reboot the machine and do a zpool status. Again, this is only a cosmetic issue and it shouldn't affect anything.

Zpool Status Failure Notifications

The following script will run every hour to check every Zpool's status and it will notify you in case a Zpool encounters a problem such as a failed drive.

First of all, install sendemail package if not already installed:

Code: Select all

apt-get install sendemail
Next, create a script in /etc/cron.hourly/ named zpoolstatus

Code: Select all

vi /etc/cron.hourly/zpoolstatus
Paste the following, adjust the email addresses and smtp server and save the file:

Code: Select all

#!/bin/bash
TO=to@domain.tld
CC=from@domain.tld
FROM=from@domain.tld
SMTPSERVER=server.domain.tld

zpool status -x | grep 'all pools are healthy'

if [ $? -ne 0 ]; then
        /bin/date > /tmp/zfs.stat
        echo >> /tmp/zfs.stat
        /bin/hostname >> /tmp/zfs.stat
        echo >> /tmp/zfs.stat
        /sbin/zpool status -x >> /tmp/zfs.stat
        /usr/bin/sendemail -f $FROM -t $TO -cc $CC -u "ZFS Disk failure in server : `hostname`" -m "ZFS Disk failure in server: `hostname`. Please see attachment for details"  -s $SMTPSERVER -a /tmp/zfs.stat
fi
Make the file executable:

Code: Select all

chmod +x /etc/cron.hourly/zpoolstatus
Verify that the file will run every hour by running the following command:

Code: Select all

sudo run-parts --report --test /etc/cron.hourly
should give you the following output, if blank, check your script again. Ensure the script does not have .sh extension on it or it will not work:

Code: Select all

/etc/cron.hourly/zpoolstatus

That's it!
Post Reply