When I started at EPS, the Notorious Roger Mack was working to deploy our first EPS Hyper-V Cluster. This cluster utilized Windows Server 2012 (Win 2012) and consisted of three new Lenovo Servers and two (not-so-new) SuperMicro Servers. The Lenovo boxes acted as hosts and the SuperMicros acted as network drives. Together, these five boxes housed all our local systems at EPS, save our Firewall, Zone Controller, and a few smaller systems.
The cluster worked flawlessly until about 2020. In fact, we could still be utilizing it today if we wanted to, but there were three main problems: outdated operating system (OS), hard drive read/write speed, and network speed. To upgrade our cluster, all three of these issues needed to be addressed. This was no easy task.
In 2020, something big happened, and no, I’m not talking about COVID. The pandemic caused a huge hardware shortage, disrupting the supply chain. To upgrade the cluster properly, I needed to replace the older SuperMicro boxes and upgrade our network, but getting the necessary hardware was nearly impossible.
Plus, there was the issue of the OS. Win 2012 was already a few years old when the first cluster was built. Typically, server-side OSes last about eight years, which meant we were due for an upgrade around 2020. Microsoft released Server 2016 four years after Server 2012, and unexpectedly, Server 2019 followed shortly after. On the surface, 2016 and 2019 seemed identical, but 2019 was much faster, more secure, and designed for hybrid infrastructures. So, I aimed to upgrade from 2012 to 2019. If only it were that simple.
There was no direct upgrade path from 2012 to 2019; we had to go through 2016 first, doubling the upgrade time. Plus, this wouldn’t solve the hard drive and network speed issues. Considering the complex OS upgrade and hardware shortage, I decided to dismantle the cluster instead of rebuilding it in 2020.
Dismantling a cluster was new to me, so I hit Google. Most searches returned how to set up or manage a cluster, but not how to dismantle one. I had to learn the right terminology: it’s not a VM; it’s a Role, not a Host; it’s a Node, not a network drive; it’s Cluster Shared Volumes (CSVs). Once I got the vocabulary down, I moved all the Roles to one node and ejected the other two nodes from the cluster. I reformatted the two nodes with Win 2019 and stand-alone Hyper-V, upgraded each with six new 2 Terabytes (TB) drives, providing 6 TBs of storage after raiding. I then ejected all roles from the last node, moved each role (now standard VMs) to the upgraded servers, and reformatted the last node, upgrading the drives and removing all traces of the cluster.
The VMs were much faster, but backups still took forever due to the older SuperMicros and a network capped at 1 gigabit (gig) speed. While the VMs benefited from faster read/write speeds, backups took almost a day, which wasn’t feasible for daily backups.
To resolve this, I upgraded the network to 10 gigs, requiring new network interface cards (NICs) on all three Lenovo boxes and a new network switch for backup traffic. The SuperMicros already had 10 gig NICs. By the end of 2021, the new switch was in place, handling all our backup traffic.
Despite these improvements, some vulnerabilities persisted. We had backups of every VM but no backup of those backups. Ideally, you should have multiple backups in different locations. In 2022, I ordered a new Synology Network-attached Storage Device (NAS) with six 16 TB HDDs, providing over 64 TBs of storage after raiding. This new NAS handled local backups, versioning, and cloud backups, adding redundancy.
While everything was working as planned, we still wanted to easily update both hosts and VMs. Shutting down every VM on a host just to update it was not ideal, especially with most VMs in high use during the day. So, it was time to get the Hyper-V Cluster back up.
To rebuild the cluster, I only needed one more device: another Synology NAS, but without hard drives. I already had eighteen 2TB drives from when I dismantled the first cluster. These SSDs were more than capable of being used in a NAS environment, saving money.
Rebuilding the cluster was relatively easy by then. I knew everything I needed about maintaining, dismantling, and building a Hyper-V Cluster. Despite a few mistakes, we now have an up-to-date, fully functional Hyper-V Cluster running on the three Lenovo Servers as hosts, a Synology NAS as the CSVs, and another Synology NAS for backup storage. I’m even still using one of the old SuperMicros as a replication server.