Skip to content

RAID6 Mismatch Count Issue

Last updated on November 12, 2019

Before reading, please understand that this article is not meant to be a step-by-step guide on configuring a RAID Array. These are simply my general notes and experiences tackling various technical issues in my Home Lab. The goal is for any/all interested persons to take away proper troubleshooting processes and problem solving skills from my experience. This article is not to be used as a literal guide, as technology changes and not every computer system is identical.

– Michael Giesen

Background

I recently stood-up and configured a new server for my Home Lab. I wanted a way to level-up my DevOps / Linux skills outside of my day job. One of the very first projects I wanted to tackle was creating my own NextCloud server – and to finally free myself from the grips of Google.

I really liked the goal of setting up a Home Lab to wean off of big-box solutions like Google, Microsoft, etc. and also to reclaim ownership of my data. This was partly politically motivated, but mostly was a sort of thought exercise to prove that ‘anyone’ could take back control of their own data.

RAID6 was decided because of the use of two distributed parity drives. This means that of the 6 drives I’ve installed into my server if two of those drives fail I won’t lose any data. In fact, when I replace those failed drives with new drives the system will automatically recreate the “lost” data on those drives.

Initial Configuration

With a total of six 2 TB drives installed on a fresh copy of CentOS 7, I was ready to begin. The first step was to install the mdadm tool on CentOS 7 to configure the RAID6 Array.

sudo yum install mdadm

After installing the tool I verified that all of the installed drives were registering correctly by running:

lsblk

From the output, I was able to verify that all disk /dev/sda – /dev/sdg are all present and unpartitioned. The next step is to create a partition for each drive in the RAID6 Array, in this case, we’ll be partitioning all 6 drives. To do this, we’ll be using fdisk.

sudo fdisk /dev/sda

Once the fdisk menu was displayed I entered ‘P‘ to create a Primary Partition, and then went with the default parameters provided by the tool. I then entered ‘L’ to see a list of the available partition options and choose option 19 which was listed as LINUX RAID. Then I typed ‘W’ to write these parameters to the disk and create the LINUX RAID partition.

I repeated this process for disks /dev/sda thru /dev/sdf (6 total). After each disk had the applicable partition, I ran lsblk again to verify the partitions were registered with the Operating System.

The next step was to create the actual RAID device and dictate the raid level we wanted (RAID6 in this case). I did that using the following command:

sudo mdadm --create /dev/md0 --level=6 --raid-devices=6 /dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1 /dev/sdf1

After the array was created I ran the command below, which does two things:

  1. Provides you a real-time report on the RAID Array sync status
  2. Confirms that the RAID Array has been set up and is functioning by syncing the disks

The above mentioned command verifies your RAID Array is correctly set up:

sudo mdadm --detail /dev/md0

We can clearly see from the output that all 6 drives are Active and that they’re being synced.

OK – so the array is created; now I need to create a FileSystem for the RAID Array to utilize by running the following command:

sudo mkfs.ext4 /dev/md0

Now I need to mount the created FileSystem under /mnt/raid6 and verify it worked using these commands:

mkdir /mnt/raid6

mount /dev/md0 /mnt/raid6

ls -l /mnt/raid6

The last item that needs to be done is to add an entry to /etc/fstab to automount the device anytime there is a restart or power cycle.

sudo vi /etc/fstab

Added the following entry:

/dev/md0 /mnt/raid6 ext4 defaults 0 0

It’s worth noting that by default RAID Arrays don’t have a config file. You need to make and save one manually. I was able to do this with the following commands:

sudo mdadm --detail --scan --verbose >> /etc/mdadm.conf

sudo mdadm --detail /dev/md0

Error Mismatches Reported

By a sheer stroke of luck, after installing and configuring the RAID 6 Array I used my momentum and installed NetData on the new server. I’ve come to really love the software, it’s 100% free and open and provides real-time server monitoring.

You can literally install it with a single command line:

bash <(curl -Ss https://my-netdata.io/kickstart.sh)

Once I had NetData configured for my system, I was immediately notified with a ‘warning’ that I had a high Mismatch Count on the newly created RAID6 Array. Not good.

Remember, we configured this as an RAID6 Array, which means that all data sent to /mt/raid6 should be 100% uniform, and everything should be a mirror copy spread across all 6 disks. Anytime there is a ‘mismatch count’ present the system is telling you that one or more drives are about to fail, or that the data being written to the disks is being corrupted in some way – ultimately resulting in needing to replace one or more drives.

After a wave of panic and confusion washed over me, I started the troubleshooting process. Nothing really made sense since these were all brand new 2.0 Terabyte drives being used; there weren’t refurbished or used in any way before installation.

I had a theory. Part of what a RAID Array does is check for errors. The system goes out and queries the disk. It goes to each disk to check literally every bit. If there is a ‘1’ the system says “ok yes, there is data present here,” and if there is a ‘0’ the system goes “nope, no data here.”

I theorized that since these were brand new drives with no data at all, the system went out to the disks and saw there was literally nothing to report – not even a zero. So the system got confused and began to report thousands and thousands of Mismatch errors. To give some context, when System Admins see a mismatch count above one there is a reason to panic! Hence the aforementioned wave of panic and confusion.

Solution

First and foremost, I wouldn’t have been able to find a solution as quickly as I did if it wasn’t for a blog post that Will Storey made documenting an eerily similar issue when he set up his RAID Array. It’s people like him that willfully share content that help people like me succeed. Thanks Will.

To prove my theory that the newly created RAID6 Array was falsely reporting Error Mismatches due to fresh drives I decided to ‘zero out’ the entire array.

The rationale being, if I create a file to use all of the remaining disk space (7.6TBs), that would allow the system to correctly verify the state of the array. The first thing I did was issue the following command on /mnt/raid6:

sudo cat /dev/zero > bigfile.txt

Since this is literally filling up 7.6TBs of data with zeros, I let it run overnight. The whole process took about 18 hours to complete. Once the RAID6 Array was full I deleted the file with the following command:

sudo rm -rf /mnt/raid6/bigfile.txt

Next, I manually initiated a ‘sync’ to the RAID6 Array using the following command:

sudo echo check > /sys/block/md0/md/sync_action

Once the re-sync was complete, it was time to test my theory and run the mismatch command to verify how many mismates existed on the system:

cat /sys/block/md0/md/mismatch_cnt

ZERO MISMATCHES!

Conclusion

Though the experience was good to see and solve this issue – it’s mildly concerning that mdadm isn’t intelligent enough to know how to process data mismatches.

That being said, if you’re suddenly seeing an extremely high mismatch count, especially after a recent configuration I would save some money and investigate before blindly replacing drives.

Leave a Reply

Your email address will not be published. Required fields are marked *