Yet another infrastructure project

I recently got involved in a fun project. It involves building out a site-redundant infrastructure.

We have two sites (and a small third site to be used as a quorum). The datacenter network will be built on Nexus switches, the compute uses UCS servers and some NetApp arrays provide storage.

Now – while this sounds like the regular stack deployed by loads of companies all around the world we have the limitation of not having bottomless pockets so there are few challenges here.

  1. The network will be deployed with EVPN+VXLAN for stretching L2 between the sites. Yes…we can talk about how bad L2 stretching is until our faces are red – but we have multiple applications deployed in another environment that need to be migrated and not all of them have the luxury of being able to run in a redundant manner.
  2. Each site will only have two switches. Here is where the fun starts. Have you tried to find a EVPN+VXLAN design from Cisco that is not based on spine-leaf :-)?
  3. We have dual 100G interconnects.
  4. The third site (used as a quorum for the DC) has two Catalyst switches, and one dark fiber connection to each site.
  5. Since the third site is the quorum (i.e – the storage in the datacenters uses it to determine which site survives in case one of the datacenters fails) routing needs to recover quickly in case of a failure.

So – we have the basic network requirements down.

The compute part has fairly basic requirements

  1. We need to have three virtualization clusters in the datacenters.
    • One cluster only lives in site 1 (dev/test, production where we have multiple servers where application provides redundancy instead of depending on hypervisor redundancy).
    • One cluster that is stretched between the sites for generic servers (production).
    • One cluster that is stretched between the sites for a certain application that has specific license requirements (test/staging/production – can also run certain other application in case we have extra compute capacity free in this cluster).
  2. There will be a single server running a hypervisor in site 3 running a management application, and a quorum application. There is redundant power at site 3 (and the datacenters).
    • I am aware that the hypervisor is non-redundant here, but the chances of the hardware crashing AND one of the datacenters going down at the same time is something that we are willing to take a chance on.

Alright – we have the basic compute requirements down as well. Understand that we are limited by a budget, so we have to make compromises in some places – not all physical servers can run all workloads due to software licensing, etc.

The storage – again is pretty simple

  1. The storage needs to provide a certain amount of IOPS/throughput.
  2. The storage needs to support stretched volumes between two sites with automatic recovery using a mediator/quorum over IP.
  3. The storage needs to provide certain amount of capacity.
  4. The storage needs to support iSCSI, and have a clear roadmap for NVMe/TCP (if not already implemented).
  5. The storage solution must already have a install base here in Iceland – we have limited amount of resources so we need to be able to have local support in case we have a crisis situation.
  6. We currently have some CIFS shares running on a array at a partner site, so either the storage will provide the shares as well, or we need to migrate these to Windows File Servers

Alright – we have our basic requirements down for everything. I’m leaving out a lot of details here (software used, backup system design, etc) on purpose, I am fully aware of that! 🙂

Most of the hardware will be here early next week so we can start the deployment.

Hopefully we can start configuring the network in our lab no later than next Tuesday. The plan is to finish the basic configuration there (multi-site EVPN-VXLAN configuration, connectivity to the third site), and then go on and rack the equipment in the datacenters. We can get temporary internet connectivity through the third site while I deploy the hypervisors and storage. Then we can go ahead and connect the new datacenter network with our current networks and start to migrate the services between the sites.

In the past I have setup similar environments, but EVPN-VXLAN is somewhat new to me so I have had to spend some time learning that. However I need to do that anyway since I am going to be involved in a large project later this year anyway so I am very happy to be able to get a head start in this project.

I’m hoping I can share a bit more details of this implementation during the implementation phase – this is a fun project and I’m happy I was able to take a part in it!

A tale of a small business environment refresh – part 1

During my spare time I support the IT infrastructure for a small company located in the east part of Iceland. Late last year I decided it was time to refresh the infrastructure so I spent some time figuring out what would be the best way.

We wanted to have the system hosted locally since they have, in the past, lost connectivity so moving everything into a cloud-hosted environment wasn’t an option this time (although I expect that the next time we do a refresh we will move away from the on-premises setup). And since they are located as far away from where I live I wanted to move away from the single server setup that has been in place for a long time. I might have gone a little bit overboard on overdesigning the environment for such a small business but the results have over my expectations.

We did a cost analysis of the current setup, and calculated the cost for the new infrastructure for 5 years and found out that it came down to about the same cost as to host the main application used by the business for three years with the application provider. If we had decided to go that way we would always have had to buy some infrastructure to host basic monitoring tools/supporting applications to monitor the network environment anyway, but instead we can host it on the new environment as well.

I ended up going with the following specifications for the hardware and network infrastructure:

Network:

  • 2x Mikrotik CCR2004-1G-12S+2XS (Core routers)
  • 2x Mikrotik CRS518-16XS-2XQ-RM (Core switches)

Servers:

  • 3x SuperMicro CSE-116AC10-R706WB3 chassis, each with the following specs
    • Supermicro MBD-H12SSW-INR-B motherboard
    • AMD EPYC 7313 16C CPU
    • 128GB RAM
    • 2x Samsung PM9A1 512GB boot disks
    • 3x Mikron 7450 Pro 1.92TB NVMe disks for CEPH
    • Supermicro AOC-S25GC-I4S-O (4x 25G ethernet adapter)
    • Dual PSU

The hypervisor I decided to use was Proxmox 7.4 (latest release when I did the setup of the environment). For the backups I used the Proxmox Backup Server, and am running it on a old HPE Proliant ML350 Gen8 server.

For the network I decided to setup the Mikrotik CRS518 in a MLAG setup. Although the MLAG functionality of the RouterOS software isn’t perfect (connectivity is lost for about a minute if one of the switches goes down since the system id of the switches doesn’t stay static like it does on all enterprise switches, but I am sure that Mikrotik will fix that in a later release). The routers then have a 25G link to each switch, setup in a LACP configuration. I created new VLANs for all networks (workstations, servers, infrastructure) and setup VRRP between the core routers for each VLAN for redundancy. A simple access list is configured for the infrastructure VLANs limits access to the infrastructure. I have thought about adding a firewall running on the virtualization cluster, but at this time I haven’t set one up yet.

The main access switch has 1G link to each core switch configured as a LACP port-channel.

Backups are kept locally, but also replicated to Tuxis, which provides a PBS instance where you can replicate your backups to, meaning that if we have to restore files/VMs quickly (if we need data in the short term) but if we have a disaster we do have a copy of the data with longer retention at Tuxis. But if you feel like it you can also host your own PBS instance anywhere and store your backups there – it is amazing how easy it is to manage PBS instead of some of the backup solutions I’ve seen in the past!

For Internet redundancy we have connections from different providers – the main Internet connection is fronted by a Fortigate 40F firewall which advertises the default route through OSPF to the core routers. Then we have a Mikrotik L41G-2AXD&FG621-EA 4G router with a connection through a different provider that has a static route with a high priority on the core routers which acts as a backup. This has proven to be very stable setup for the Internet connectivity so far.

Here is a high level drawing of the infrastructure:

So far the performance has been great. My biggest worry was that the Ceph performance would be bad enough so that I would have to refactor everything and use ZFS with replication instead. A very limited testing has shown about ~1-2GB/s in writes, and 2+ GB/s in reads. Each node has only 3 OSDs (I thought about partitioning the disks and use two OSDs per disk, but after my initial testing I was more than happy with the performance) so things are kept as simple as possible.

There is a old APC SmartUPS 1000 in place that can run the environment for about 7 minutes before it looses power. So far we have only had a single incident where all of the hosts lost power (the power can be somewhat unstable in the area). During the bootup process there was a issue where two out of three hosts didn’t detect at least one out of the three NVMe disks for Ceph so the servers didn’t have the minimal amount of OSDs to boot up and I had to manually restart the hosts to get the disks to appear again. This seems to be a bug in the SuperMicro BIOS, but since then I have upgraded to a newer version and so far I haven’t seen this before (I had already seen it during the setup phase so I wasn’t all that worried when we had the issue). If we see this over and over again I will consider adding a PCIe adapter to handle the NVMe disks.

For the money, I think this environment is great, and with the exception of issue with the NVMe disks, and the MLAG issue with the Mikrotik switches I could not be happier with the result for the money. I rarely have to touch the environment, as of now I still do manual patching of the Proxmox hosts, and the Mikrotik infrastructure. All of the server patching has been automated and I don’t think I will need to touch that any time in the future.

All of the environment is monitored by CheckMK, which is running in a container on a Linux VM. CheckMK monitors the virtual guests, the Proxmox infrastructure, Ceph, hardware and the network infrastructure.

At last I have been playing around with Security Onion to monitor the environment for security events but I am still in the evaluation phase – it looks good as a open source product and seems to have most of the features I would want for such a small environment – the only thing I feel like I would want to add is to have Qualys + Kenna for vulnerability scanning for both OS updates and third party applications.

RC!

Last year I happened to stumble up on some videos of electric RC cars. I watched couple of those and started reminiscing about the time I had spent back in ~2004 when me and a friend ordered couple of Traxxas Revo’s with the TRX 3.3 engine. It was a blast, but the bad thing about living on this lovely island we call Iceland is that the temperature here is pretty low, and the engine settings really often need some modifications so we spent the better half of all sessions doing tuning before we could start bashing.

But watching those videos of those electric cars was pretty interesting as you could just charge up and start to bash right away! I started researching and ended up ordering a Traxxas Maxx v2. I had a blast bashing it couple of times, but a little later I got the chance to get my hands on a Traxxas TRX4 crawlers (Bronco 2021 body). Now….this is where things started to get interesting. I took it along with me when the family went on a hike and man….I realized I had so much more fun crawling then I did bashing.

So – since then I have built a Vanquish Phoenix VS4-10, and I just finished a Axial SCX10 Pro build (well, I still need to do some work on it, but it is in a driveable condition). Building those kits has been the best hobby I can think of. I’ve searched for a hobby to spend my free time on for a long long time and I think I have finally found it.

There are two issues though. Number 1 – this hobby is a money dump!…..however I am pretty sure that this is a lot cheaper than if I had gone into hobbies like fly fishing or hunting. Number 2 – living on a island in the middle of nowhere with a population of ~400.000 means that access to crawler parts that are not made by Traxxas is very much non-existent. So I need to order pretty much everything except original parts for my TRX4 from abroad. But even if I can find cheap stuff then shipping along with the duty fees always adds a premium to every part so I have to think carefully before making any orders.

But never the less I am lucky enough that there are stores in Germany and Asia that do ship things pretty cheap (and some even ship things pretty fast, thank god for cheap Fedex shipping!) so if I make sure to put together a sizeable order it won’t be anything crazy. I am going to create a page here sharing the stores I primarily use along with a list of my parts just for fun if it helps anyone that is in a similar situation.

Yet another summer is coming to an end…

First post in long time!

Yet another summer is coming to an end. Work starts again tomorrow and things get back to normal.

During the summer I saw that I needed to rebuild a small SMB environment for a friend and I decided on using Mikrotik for the networking (switches, routers), SuperMicro for the servers and Proxmox for the virtualization layer. I’m going to document my process here and find out the good, the bad and the ugly around those three vendors. Can’t wait to get that started but I expect to get the equipment in my hands in the next ~4 weeks or so.

This is going to be somewhat over-designed environment but I am excited to see how those vendors stack up against the enterprise vendors I work with most of the time.

On-premise Kubernetes

For the better part of the year I have been playing around with Kubernetes on-premise. While testing random solutions I didn’t realize what can of worms I just opened! ……Don’t get me wrong – the whole Kubernetes ecosystem is extremely fun to “play” in.

But after trying multiple solutions a colleague of mine pointed me to a project called Rancher. This project is pretty cool!

The project makes the installation extremely easy (yes yes, I sound like a sales person) but this was the most straight-forward product I had seen (and yes, I have seen a few) in this space.

Out of the box the project offers multi-cluster management, support for AKS, EKS and support for other managed solutions as well as a on-premise installation using either RancherOS (a custom Linux distro for running Kubernetes) or using roll-your-own VMs/bare metal instances (using for example CentOS). It can integrate with vSphere to spin up instances…..and they have a decent Active Directory integration for authentication/authorization.

Rancher is deployed on a dedicated Kubernetes cluster (if it is set up for HA) that should just be used for Rancher. Then you can go ahead and add your own clusters from AKS/EKS or on-premise. It is a nice single pane of glass for operating your Kubernetes clusters. If you have environments all over the place it can help you gain better control of the environments as well as offer a single place to interact against for things like deployments.

While I won’t go into details (the documentation simply speaks for itself) I recommend you take a look at this project if you plan to start using Kubernetes for your organization, or even just to play with your own stuff.

And the best part? The project is fully open source. Rancher are also working on a persistent storage solution (Longhorn) and they offer professional services/support if you need some help along the way.

They also have a mini Kubernetes distro called K3s – it is a (very) small instance of Kubernetes that you can run on pretty much anything that can boot Linux and be managed in the same way.

Simply put, this is an amazing project! 🙂

Openconnect and GlobalProtect VPN!

Hi!

Just tried the globalprotect support in openconnect 8 (8.02 in Fedora 29).

Very simplified version:

sudo openconnect --protocol=gp your.vpn.gw.com

Worked liked a treat! Hopefully I can stop using the offical Linux client now.

Now – hopefully NetworkManager-openconnect drops in support for connecting to globalprotect VPN soon! 🙂

Bgrds,
Finnur

Palo Alto GlobalProtect on Fedora

After spending some serious time trying to get GlobalProtect 4.1.2 to work on Fedora 28 (and probably 27 earlier this year) I finally managed to get it working. It is almost embarrassing how easy it was…

  1. Replace /etc/redhat-release and /etc/os-release with info from RHEL 7 or CentOS 7
  2. Profit.

Yep….it’s sucky….but at least it shows that this works. Maybe it is possible to modify some file that lists supported operating systems……will have to look into that later on.

Always read the release notes….and the supported OS lists…..and the error logs. Even better if you do it all in the same evening to puzzle this amazing solution together……

FYI: The error I was getting was: Error: Gateway my.gateway.hostname: The server certificate is invalid. Please contact your IT administrator.

Cisco UCS: vHBA bandwidth

I never really understood how Cisco UCS vHBA are configured in regards to bandwidth (coming from a FC background).

Finally I got it spelled out for me like I was five…..IT IS JUST A ETHERNET PORT (yes yes…I knew that. But I really thought there was some more magic involved). IT JUST SYNCS ON THE SAME SPEED AS THE FEX PORT IT IS CONNECTED TO. That would mean if you have a blade with a VIC, a 6332 FI and a 2304 FEX and you do not have the port expander it will be configured as a 20Gbit port (2x10Gbit) with a single flow maxing out at 10Gbit/s (to the FI…not taking into account break out speeds to your ethernet and storage network from the FI). If you have the port expander for the VIC you get native 40Gbit/s if you are using the 2304 FEX and the 6332 FI (single flow can reach 40Gbit/s from the FEX to the FI).

I had a real duh! moment there.

Now it is out there! Hopefully this can help some poor soul out there. I googled my life away for couple of days and did not find a real answer.

Bgrds,
Finnur

Trying out an iPad Pro 12.9″ for sketching and drawing….but it is awesome!

I recently got my hands on a iPad Pro 12.9″ which I wanted to use for drawing sketches when I am working on some issues or just designs since I always seem to have the need to visualize stuff when I am working (and attaching them into OneNote). My desk often looks like there has been a mass-murder of post-its or notebooks. And those end up in the trash and I end up having to hammer down a drawing in Gliffy (or I don’t…..which isn’t exacly a good thing since I often would like to remember what I was sketching later on).

So – I got an iPad Pro 12.9″, Apple Smart Keyboard and a Apple Pencil.

Now I guess I have to admit I might have laughted histerically of those crazy people buying into the iPad Pro + Apple Pencil hype. More the once. Probably more then three times even….I have owned a iPad in the past (think it was the iPad 3) but I mostly used it for watching TV episodes before going to sleep. And Skype couple of times.

So I spent last night setting everything up. Sat down in my La-Z-Boy and got down to business – getting the iPad enrolled in our MDM, installing the apps I normally use on my laptop (SSH client, RDP client, Outlook, Word, Excel, Powerpoint, VPN client etc). Played around with the pencil.

My SO was sitting in her chair with her Surface Pro 4, doing her nightly surfing. She has been using the SP4 for two years if I remember correctly. She loves that thing. Probably more than me. But less than her cats.

After playing with the iPad for like an hour I start to realize how cool device this actually is. And how useful it is (I had my doubts it would actually be this useful). Start mumbling something about how awesome it is to be able to sketch with the pencil (which is better then any pencil I have tried before). Keep using the iPad. Keep mumbling about how awesome it is. She suddenly looks at me and says: “I told you multiple times – having a tablet with a pencil is extremly useful”. While I have thought about getting a Surface I never acted on that – they are pretty expensive here in the land of ice and snow.

I guess I have to eat my words. The iPad Pro is actually one of the most useful devices I have used for work-related computing. And I can even use it to do more then I actually thought I could do on a iPad. Which is a lot of researching, hammering at those pesky SSH terminals and replying to emails. I might even stop taking my laptop to meetings. And let me tell you – Before last night I would never have told you I would replace my laptop for any task.

Don’t get me wrong – This device will not replace my laptop. But I will probably use it less then before.

Hopefully I can pair a bluetooth mouse with it and use it with my RDP client. Haven’t tried it yet. If that works then I don’t think I will take my laptop with me when I am going on short trips over weekends etc.

I just have to say it. Apple might have created a market of devices that we don’t really need – but the iPad Pro is one brilliant device. While iOS is a little bit limited as a desktop OS it has couple of things going for it LInux and Windows can’t keep up with. The battery life is awesome (at least on this thing and so was the battery life on my old iPhone 6s Plus). I am still on the first charge on this bad boy. And I probably have OST of 8 hours already.

My Lenovo T460s laptop chews through the battery in just 3-4 hours. And that thing is just a year old. But I can probably blame Chrome + Extensions for that 🙂

Bgrds,
Finnur

UniFi Network kit – awesome stuff!

Recently I got fed up with yet another router (with integrated wireless) provided by my ISP. In the last 5-6 years or so I have gone through like 6 of those – having horrible experience with each and all of them. Well…..to be fair – they all worked as expected as a router but the wireless function was just a joke. Most of these were different versions of Thomson provided by Siminn (my ISP at the time). Before all this I had always been using a Linux based router (and even OpenBSD and FreeBSD at some point!) along with a standalone access point but due to limited time and other things going on I didn’t feel like spending time building yet another one (and my second trusty WRT54G had just died) so I just started using the router from my ISP to get things going again.

I moved to another apartment in November and got yet another router from my ISP (but this time I finally got a fiber connection!). Again – the wireless signal was horrible in some of the rooms so I started checking out some new equipment.

My friends have been raving about the stuff from UniFi – the EdgeMax routers and the UniFi APs. So I decided to try it out.

This stuff is brilliant! Luckily I have an old Linux box hooked up in a corner where I can run the UniFi software for the wireless access point. The configuration is very simple and the pricing of those APs (and the routers) is a joke. I got a single UniFi AP-AC-LR and it just rocks.

Then I configured the EdgeRouter….for a kit that only costs 99$ (or something like that) this thing is just awesome. I’m late to the game but this little dude can route 1 MPPS which you don’t normally find in such a small box (or at least when it was released). Guess I won’t have to worry about that on my 100Mbit connection 😉

The GUI is pretty self explanatory and if you have ever worked with a Cisco/Juniper kit you will find your way around the CLI quickly as well.

So – for around ~200$ I finally have a home network that I don’t have to worry about!

For the next phase I am thinking about getting a NGFW…..still debating if I will go with a small Fortigate, Juniper, Palo Alto or just a UniFi Security Gateway – it would be awesome to be able to inspect SSL!

 

Bgrds,
Finnur