Switching to K3s

I want to start this post with: I LOVE MicroK8s at work (we use it for easy, on-prem deployments), but for home use, I have had NO luck with it. The past few days I have spent a fair amount of time spinning up a new K3s cluster, after MicroK8s decided to stop working on me for the 5th time this year.

I get it, homelab-ing is not meant to be easy, it is a learning opurtunity. What I have learned in this case, is that MicroK8s is not good for my workloads. It runs significantly slower than K3s, it feels bloated, and it does not survive node reboots, even in HA mode.

I have used K3s in the past, and ultimately I choose to go with MicroK8s as I was using it at work and figured it was a good opportunity to learn more about it and how it scales. This is a decision that I now regret, as I have nothing but issues with it.

That is not to say that it is not good in general, I think given my setup (both hardware and software)

What is K3s?

K3s is made by Rancher Labs (makers of Rancher, Longhorn, and RancherOS). They have been a big name in the container space for some time now. They have some create software, such as Longhorn (K8s native block storage), Rancher (manager for K8s and Docker), and RancherOS (Docker focused Linux distro).

K3s is a light-weight K8s wrapper that supports Highly Available embedded etcd databases. This means your main etcd database is replicated to all "master" nodes, which makes it fairly fault tolerant.

Ok, so why K3s over MicroK8s?

As mentioned above, K3s offers HA etcd databases, that are replicated on each of its "master" nodes. Where this differs from MicroK8s is, MicroK8s uses a DB called dqlite. It is known to be very temperamental and in fact, K3s used to allow this, and have moved on to etcd for stability reasons.

While both are embedded (the DB runs on each node, and is replicated), there seems to be a major difference in how they handle failures. For example, I patch my nodes and Proxmox servers monthly, which typically requires a reboot. In order to do this in a safe way, I must drain the nodes one at a time, ensuring there are at least 3 (needed for HA). On Wednesday this week (it's Friday), this was exactly what I did.

However this time when the Nodes came back up, suddenly, Longhorn volumes stopped mounting. Pods stopped coming up, and everything crawled to a halt. I figured "oh I guess it just needs time to figure itself out". This was not the case, I left it overnight and it still hadn't fixed itself. At which point, I decided to try draining and rebooting the remaining nodes. This only seemed to make things worse.

After spending hours troubleshooting, I decided to just cut ties with MicroK8s and give K3s a try.

How did I set everything up?

For K3s, I used Ubuntu 22.04 nodes with just NFS installed and a separate 1T volume for Longhorn data.

Once the nodes were all provisioned in Proxmox, I used Techno Tim's Ansible Playbook to spin up my HA K3s cluster. It was rather straight forward and I had everything running in no time. Just follow his video here:

This script will install kube-vip (makes the K8s api HA) and MetalLB (Virtual IP Load Balancer for K8s). With these 2 things, I can set an IP for the API, and 1 IP for the Ingress controller.

I opted to use Nginx Ingress (as this is what I use at work and am comfortable with it). I also opted to install Longhorn again, which allowed me to easily carry over my backups and restore all my volumes.

Once I had my GitLab instance setup, it was just a matter of running my Pipelines and everything was back!

Conclusion

All-in-all, both products have their benefits, but I will be using K3s for the time being, as it seems to meet my needs perfectly. Time will tell! (If you are seeing this, it is working :) )