How to rebuild a ZFS pool
I recently discovered that my server’s storage pool had much more capacity than it should, because I created a stripe
pool across two disks instead of a mirror
as I had originally intended.
Unfortunately, the only way to correct this is to destroy the pool and start over. With ZFS, it’s actually not as painful as it sounds, and with ZFS’s built-in checksumming and scrubbing features you can help ensure that your data transfers without a hitch.
I’ll admit, I’ve had to rebuild or recreate pools a few times now for various reasons. It almost feels like a right of passage as a sysadmin.
Here’s some tips to help you do this process safely and thoroughly. Please consider this a starting point for your own project… there might be some areas where you need to adapt these steps to better suit your individual circumstances. And hey, this is meant to be a friendly guide of tips and suggestions from one storage geek to another. So, be careful… and don’t sue me if you nuke your pool!
What you’ll need:
A backup storage device with enough capacity to store the entire pool’s data.
In my case, I had two spare drives laying around that were the same capacity as my pool, so I made a new temporary ZFS mirror
pool using them.
A working UPS.
Don’t tell me you’re running your system solely off the mains now, are ya?
A second copy of your data wouldn’t hurt!
I did an rsync
of all the actual files in my pool to another external drive, for yet one more layer of redundancy in case something messed up. No such thing as being too careful here!
Before you begin:
I assume you have intermediate knowledge of ZFS.
This guide doesn’t cover the deeper nuances of ZFS exhaustively. For instance, I assume you understand what tank
is.
I’ll do my best to walk you through the process below. Before you dig deep, you might want to consider reading FreeBSD Mastery: ZFS by Michael W Lucas, or at a bare minimum the ZFS section of the FreeBSD Handbook.
Decide on how many snapshots you want to keep.
I decided to not save all of my pool’s daily snapshots, and instead transfer a single point-in-time migration snapshot. Decide if you want to retain your daily snapshots or if this single snapshot is sufficient.
The transfer will take lots of time.
It took about 5½ hours to transfer about 3.1 TB of data out of the pool, and another 5½ hours to transfer it back. Make sure that when you do this process you aren’t rushed and have ample time to do the job slowly and carefully.
ECC RAM helped give me confidence in the transfer job.
During the transfer, htop
reported that the anonymous ARC cache was being used heavily to shuttle data around. I could hear the drive activity and confirm that there was a lot of read pre-fetching going on, and that when the copy commands ended the ARC took a few moments to “empty out” before the target drive stopped working.
I know there’s a lot of debate out there about ECC RAM not being necessary for ZFS, but in this case with so much data flying back and forth through RAM, having ECC helped reassure me that that random bit flips wouldn’t happen and introduce issues. (I’m sure a ZFS expert could weigh in here and reassure me more about checksum validations that might happen during data transport… If you are one, please contact me!)
I avoided using USB drives and went direct SATA → SATA for speed and stability.
I suppose there’s nothing wrong with using an external USB drive if you must, but if this is your primary storage pool then at least plug that drive into a UPS! Your goal is to minimize external issues and safeguard your data during this delicate process.
This pool only had datasets, but volumes work basically the same way.
If you need to export data from volumes, the process is mostly the same. The instructions below focus on datasets only, but you could easily adapt this process to include volumes.
Here’s the process
Stop services and cronjobs
First, stop anything running on the server that might allow access to the source pool. This includes file shares, FTP servers, external SSH connections by people other than you, etc.
Also, consider stopping any cronjobs on the system. Since this process may take several hours, you don’t want your nightly backup job or a weekly scrub kicking off in the middle of things.
Collect data
Let’s collect some data that will be handy to have as we work. Save the results of these commands somewhere.
Start out by exporting a list of all datasets, their sizes, and mountpoints.
zfs list -r tank
Export a list of just the dataset names to help build scripts to go over all the datasets. Also useful to have if you enjoy making checklists!
zfs list -o name -r tank
Save the result of these commands too. They’ll help provide context for the disk layout and mountpoint names, permissions, etc.
lsblk
ls -lha
of the pool root directoryzpool status tank
- Properties for the pool root
zfs get all tank
and all datasets (run this command for each dataset:zfs get all tank/dataset
). Pipe that output to text files and keep that info for reference.
Set up the new temporary storage
Making another ZFS storage pool and dataset is best because the ZFS exports we’ll do below will almost certainly exceed the maximum filesize of many filesystems.
When building the new temporary storage pool, be sure to label the disks with gpart
labels so that you can be absolutely sure about which disks you’re addressing. (DO NOT rely on /dev/
labels like ada0
to address your disks, as those labels may change across reboots.)
Create snapshots
Now it’s time to create snapshots of the datasets to capture the data as it exists at this point in time. For each dataset, run:
zfs snapshot tank/dataset@YYYY-MM-DD-MIGRATION
You can adapt that snapshot naming scheme to something meaningful to you. I like using the YYYY-MM-DD
format because it sorts nicely, and then use a name in ALL CAPS
so this important snapshot stands out when you’re looking through a list of hundreds of snapshots.
Transfer snapshots to the temporary storage
PRO TIP: Are you SSH’ing in to the box you’re working on? If so, from here on out you should be running your commands in a
tmux
session so that the jobs you’re running are protected from an accidental network interruption or your terminal going to sleep.
Each snapshot now needs to be sent to the temporary storage drive. For each dataset, you’ll want to run:
zfs send tank/dataset@YYYY-MM-DD-MIGRATION > /mnt/temporary/dataset.zfs
As this process may take hours to run, consider batching up all of those commands in a shell script. This will allow you to leave the house, go to sleep, etc. while the job runs dutifully in the background.
Verify everything was copied
Use the list of datasets you wrote down earlier in the first steps of this process. Did you transfer all of them to the temporary storage location? Did you see any errors during the zfs send
jobs? Do you want to send a second copy of the data somewhere, just in case? You could scrub
the temporary location if you really want, but that’s probably overkill…
The point is, take your time here. Double- and triple-check everything, because we’re about to destroy the source pool. There’s no going back after that!
Destroy and rebuild the pool
First, you’ll want to export the source pool, to tell ZFS to stop using it.
zpool export tank
If that command gives you any guff about the pool being in use, make sure you’re not in a working directory inside the pool, that all services that might be using the pool are stopped, etc. Sometimes also the pool just takes a minute to free up before it can be exported, so try again after a minute.
If you’ve passed entire disks to ZFS, you’ll need to clear their labels to dissociate the disks from the pool.
Run this command on each drive, being very careful to supply the proper /dev/
names (or, try using GPT labels, etc.)
zpool labelclear /dev/x
zpool labelclear /dev/y
Once the labels are cleared from the drives, you’re ready to create the new pool. Follow the steps appropriate to the pool setup you want.
In my case, I wanted a mirror
of two disks, so I ran:
zpool create tank mirror /dev/label/x /dev/label/y
Before you proceed, check the status of the newly-created pool.
zpool status tank
Were you trying to make a mirror
? Be sure right now, right this very instant, that the output looks like this:
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
mirror-0 ONLINE 0 0 0 <--
ada0 ONLINE 0 0 0
ada1 ONLINE 0 0 0
Make sure that line marked with the <--
arrow is there, otherwise you have a stripe
! This is what bit me earlier… it’s very easy to miss.
After doing a sanity check on the pool architecture, set up any pool-specific settings, like: mountpoint, ACL settings, and other ZFS properties you want to configure at the pool level.
Transfer snapshots from the temporary storage
Now it’s time to repopulate the new pool with the snapshots you sent to the temporary storage. You’ll need to run this command for each dataset, so consider making a shell script, and make sure you’re executing it from within tmux
if you’re working remotely.
cat /mnt/temporary/dataset.zfs | zfs receive tank/dataset
Once all the data is transferred, you’ll want to check and set:
- Dataset properties
- Mountpoints
- Owners, groups, and permissions
You can use the information exported in the first few steps to ensure everything is set up as you had it.
Verify the pool data
At this point, I ran a scrub
on the new pool to make sure everything was in order. Scrubbing my pool took 4½ hours, so this is another opportunity to let the system run unattended until it’s done.
Re-start services, cronjobs, and the server
Make sure you restart any services you shut off and any cronjobs you commented out earlier.
Once that’s done, give the system a reboot. It’s not absolutely necessary, but it helps validate that you have everything in order so you can prove that the next boot will succeed.
Remove the temporary storage drives and hang on to them for a bit until you are sure everything’s working.
All that’s left is to test your system, make sure the various services are working, and then you’re done – one more pool rebuild is now under your belt!