I recently got involved in a fun project. It involves building out a site-redundant infrastructure.
We have two sites (and a small third site to be used as a quorum). The datacenter network will be built on Nexus switches, the compute uses UCS servers and some NetApp arrays provide storage.
Now – while this sounds like the regular stack deployed by loads of companies all around the world we have the limitation of not having bottomless pockets so there are few challenges here.
- The network will be deployed with EVPN+VXLAN for stretching L2 between the sites. Yes…we can talk about how bad L2 stretching is until our faces are red – but we have multiple applications deployed in another environment that need to be migrated and not all of them have the luxury of being able to run in a redundant manner.
- Each site will only have two switches. Here is where the fun starts. Have you tried to find a EVPN+VXLAN design from Cisco that is not based on spine-leaf :-)?
- We have dual 100G interconnects.
- The third site (used as a quorum for the DC) has two Catalyst switches, and one dark fiber connection to each site.
- Since the third site is the quorum (i.e – the storage in the datacenters uses it to determine which site survives in case one of the datacenters fails) routing needs to recover quickly in case of a failure.
So – we have the basic network requirements down.
The compute part has fairly basic requirements
- We need to have three virtualization clusters in the datacenters.
- One cluster only lives in site 1 (dev/test, production where we have multiple servers where application provides redundancy instead of depending on hypervisor redundancy).
- One cluster that is stretched between the sites for generic servers (production).
- One cluster that is stretched between the sites for a certain application that has specific license requirements (test/staging/production – can also run certain other application in case we have extra compute capacity free in this cluster).
- There will be a single server running a hypervisor in site 3 running a management application, and a quorum application. There is redundant power at site 3 (and the datacenters).
- I am aware that the hypervisor is non-redundant here, but the chances of the hardware crashing AND one of the datacenters going down at the same time is something that we are willing to take a chance on.
Alright – we have the basic compute requirements down as well. Understand that we are limited by a budget, so we have to make compromises in some places – not all physical servers can run all workloads due to software licensing, etc.
The storage – again is pretty simple
- The storage needs to provide a certain amount of IOPS/throughput.
- The storage needs to support stretched volumes between two sites with automatic recovery using a mediator/quorum over IP.
- The storage needs to provide certain amount of capacity.
- The storage needs to support iSCSI, and have a clear roadmap for NVMe/TCP (if not already implemented).
- The storage solution must already have a install base here in Iceland – we have limited amount of resources so we need to be able to have local support in case we have a crisis situation.
- We currently have some CIFS shares running on a array at a partner site, so either the storage will provide the shares as well, or we need to migrate these to Windows File Servers
Alright – we have our basic requirements down for everything. I’m leaving out a lot of details here (software used, backup system design, etc) on purpose, I am fully aware of that! 🙂
Most of the hardware will be here early next week so we can start the deployment.
Hopefully we can start configuring the network in our lab no later than next Tuesday. The plan is to finish the basic configuration there (multi-site EVPN-VXLAN configuration, connectivity to the third site), and then go on and rack the equipment in the datacenters. We can get temporary internet connectivity through the third site while I deploy the hypervisors and storage. Then we can go ahead and connect the new datacenter network with our current networks and start to migrate the services between the sites.
In the past I have setup similar environments, but EVPN-VXLAN is somewhat new to me so I have had to spend some time learning that. However I need to do that anyway since I am going to be involved in a large project later this year anyway so I am very happy to be able to get a head start in this project.
I’m hoping I can share a bit more details of this implementation during the implementation phase – this is a fun project and I’m happy I was able to take a part in it!