During my spare time I support the IT infrastructure for a small company located in the east part of Iceland. Late last year I decided it was time to refresh the infrastructure so I spent some time figuring out what would be the best way.
We wanted to have the system hosted locally since they have, in the past, lost connectivity so moving everything into a cloud-hosted environment wasn’t an option this time (although I expect that the next time we do a refresh we will move away from the on-premises setup). And since they are located as far away from where I live I wanted to move away from the single server setup that has been in place for a long time. I might have gone a little bit overboard on overdesigning the environment for such a small business but the results have over my expectations.
We did a cost analysis of the current setup, and calculated the cost for the new infrastructure for 5 years and found out that it came down to about the same cost as to host the main application used by the business for three years with the application provider. If we had decided to go that way we would always have had to buy some infrastructure to host basic monitoring tools/supporting applications to monitor the network environment anyway, but instead we can host it on the new environment as well.
I ended up going with the following specifications for the hardware and network infrastructure:
Network:
- 2x Mikrotik CCR2004-1G-12S+2XS (Core routers)
- 2x Mikrotik CRS518-16XS-2XQ-RM (Core switches)
Servers:
- 3x SuperMicro CSE-116AC10-R706WB3 chassis, each with the following specs
- Supermicro MBD-H12SSW-INR-B motherboard
- AMD EPYC 7313 16C CPU
- 128GB RAM
- 2x Samsung PM9A1 512GB boot disks
- 3x Mikron 7450 Pro 1.92TB NVMe disks for CEPH
- Supermicro AOC-S25GC-I4S-O (4x 25G ethernet adapter)
- Dual PSU
The hypervisor I decided to use was Proxmox 7.4 (latest release when I did the setup of the environment). For the backups I used the Proxmox Backup Server, and am running it on a old HPE Proliant ML350 Gen8 server.
For the network I decided to setup the Mikrotik CRS518 in a MLAG setup. Although the MLAG functionality of the RouterOS software isn’t perfect (connectivity is lost for about a minute if one of the switches goes down since the system id of the switches doesn’t stay static like it does on all enterprise switches, but I am sure that Mikrotik will fix that in a later release). The routers then have a 25G link to each switch, setup in a LACP configuration. I created new VLANs for all networks (workstations, servers, infrastructure) and setup VRRP between the core routers for each VLAN for redundancy. A simple access list is configured for the infrastructure VLANs limits access to the infrastructure. I have thought about adding a firewall running on the virtualization cluster, but at this time I haven’t set one up yet.
The main access switch has 1G link to each core switch configured as a LACP port-channel.
Backups are kept locally, but also replicated to Tuxis, which provides a PBS instance where you can replicate your backups to, meaning that if we have to restore files/VMs quickly (if we need data in the short term) but if we have a disaster we do have a copy of the data with longer retention at Tuxis. But if you feel like it you can also host your own PBS instance anywhere and store your backups there – it is amazing how easy it is to manage PBS instead of some of the backup solutions I’ve seen in the past!
For Internet redundancy we have connections from different providers – the main Internet connection is fronted by a Fortigate 40F firewall which advertises the default route through OSPF to the core routers. Then we have a Mikrotik L41G-2AXD&FG621-EA 4G router with a connection through a different provider that has a static route with a high priority on the core routers which acts as a backup. This has proven to be very stable setup for the Internet connectivity so far.
Here is a high level drawing of the infrastructure:
So far the performance has been great. My biggest worry was that the Ceph performance would be bad enough so that I would have to refactor everything and use ZFS with replication instead. A very limited testing has shown about ~1-2GB/s in writes, and 2+ GB/s in reads. Each node has only 3 OSDs (I thought about partitioning the disks and use two OSDs per disk, but after my initial testing I was more than happy with the performance) so things are kept as simple as possible.
There is a old APC SmartUPS 1000 in place that can run the environment for about 7 minutes before it looses power. So far we have only had a single incident where all of the hosts lost power (the power can be somewhat unstable in the area). During the bootup process there was a issue where two out of three hosts didn’t detect at least one out of the three NVMe disks for Ceph so the servers didn’t have the minimal amount of OSDs to boot up and I had to manually restart the hosts to get the disks to appear again. This seems to be a bug in the SuperMicro BIOS, but since then I have upgraded to a newer version and so far I haven’t seen this before (I had already seen it during the setup phase so I wasn’t all that worried when we had the issue). If we see this over and over again I will consider adding a PCIe adapter to handle the NVMe disks.
For the money, I think this environment is great, and with the exception of issue with the NVMe disks, and the MLAG issue with the Mikrotik switches I could not be happier with the result for the money. I rarely have to touch the environment, as of now I still do manual patching of the Proxmox hosts, and the Mikrotik infrastructure. All of the server patching has been automated and I don’t think I will need to touch that any time in the future.
All of the environment is monitored by CheckMK, which is running in a container on a Linux VM. CheckMK monitors the virtual guests, the Proxmox infrastructure, Ceph, hardware and the network infrastructure.
At last I have been playing around with Security Onion to monitor the environment for security events but I am still in the evaluation phase – it looks good as a open source product and seems to have most of the features I would want for such a small environment – the only thing I feel like I would want to add is to have Qualys + Kenna for vulnerability scanning for both OS updates and third party applications.