TL;DR I talk about my experiences learning Kubernetes in a tiny company. These are just some personal war stories and rants. I’m not going to get technical here. Reading time under 10min.
I have been trying to use Kubernetes in a bare metal cluster for over a year now. Why is this so hard? How is anyone using this in production, let alone massive production deployments that are reliable? How do you get good at this thing that is so amazingly complicated? Now, I actually have some answers; but it still just amazes me how deep this rabbit hole goes.
You may have noticed right away the first problem here: a bare metal cluster. And, if you’ve played with this tech, you know what this means: you’re on your own. The most common use case for Kubernetes is renting it from a cloud provider like Google or Amazon: 1) they take off a bunch of the lower level tech so that your people don’t have to deal with it, like load balancing when it comes to off-net traffic, or even backups and 2) you get support from them. If something goes South, you have a number you can call, especially if your a big shop, spending lots of money. If you’re big enough, you might have sales engineers on-site that can jump on your problem, or consult on projects. Your people don’t have to be experts with the tech, they just have to know how to handle vendors and use their monitoring to understand when things are going wrong. Bare metal means, however, no cloud; you are rolling your own in a deep sense. You’re responsible for the whole stack from hardware, to networking, to OS, to cluster deployment, to cluster admin. You need to have some very heavy tech-mages on your side to wield this kind of black magic. And, since I’m a team of one, I’m that wizard… And, if I’m honest, I’m punching here way above my weight. Now, I signed up for this kind of thing intentionally. I’m do this stuff because I really like the tech and I want to be close to the machine. I’m totally ok working for a shop of just under 50 employees because it give me freedom to learn all kinds of stuff; but, it can also be so daunting coming at something like Kubernetes. And that brings us to the next challenge after bare metal: training.
My company is so small and my boss so over-worked, there’s no budget for official training and no time for mentoring or even paired design. I have to pro-actively figure out my own training regime; and let’s all admit something here and now: that’s very freaking hard. Most people are really no good at successfully teaching themselves complicated skills, or pushing themselves through a six month program of their own design. Don’t get me wrong, there are some out there who are great at this, but they’re the exception. When you can find pro-active, self-reliant people who are great at training themselves, you have found gold. Never lose that kind of talent! Like I said, I’m passionate about tech and like to have private projects, or read about computing history — I really love understanding the context of a system and how it came together — but it’s still really challenging to stay on task, or, and this is the most important thing, breaking everything down into smaller, do-able chunks. It’s soooo hard! Of course I know how to do all this logically, but keeping up the motivation and staying focused and pushing through the anxiety to try new things that might crash spectacularly? It’s difficult over coming The Fear, or the helplessness while standing in front a complex system that you don’t understand.
And the killer on top of this is Kubernetes’s rate of change as a project. Kubernetes gets a new release every six months. When you’re looking for books, you really have to pay attention to how old they are. And this cutting edge of change is so bad that it’s really difficult to find any training materials at all. Take GatewayAPI! For years Kubernetes used something called Ingress to deal with traffic coming into and out of the cluster. There’s a lot of clusters out there still using Ingress, but the feature is now frozen and won’t get anymore updates; GatewayAPI is the new hotness. Good luck figuring out how to get it working! I ended up changing out our Container Network Interface (CNI) from Flannel to Calico and then to Cilium…and let me tell you about Cilium! No, I won’t actually rant about that, but, man, was that a learning curve. I see that there are respected certs for Kubernetes; but, is the curriculum as dynamic as the tech, or are they just testing on much easier use cases than the one on my plate? How can this Kubernetes Certified Admin cert be worth anything? Or maybe shops that are using this tech have bigger teams devoted to it…like bigger than one guy.
I’m thinking about doing Hightower’s Kubernetes project, “Kubernetes the Hard way“, just to dig into the guts of this thing. I did Linux from Scratch back in 2008; this is where you get all the source code for a VERY basic install of linux, chroot into an isolated part of the OS and use GCC to compile each component piece by piece; and, while it’s compiling, you can read about the component and what it does. Now, I admit that this was pretty streamlined process by them time I got to it. You worked from a LiveCD with all the source on it, and you had a lot of documentation, and there was even an online forum where you could get community “support”: you know? angry people who will start by telling you how stupid you are, and might get around to helping you out…maybe…if they don’t send you a “I googled that for you” link. I worked for a guy in 2007 who did this back when the project was a stack of floppies; freaking respect! So, maybe this Hightower project would be a good way to get under the hood. And, honestly, I think there’s a lot of value in that. When I’m learning something, I want to peel off as much abstraction as I can so I can know what the tools are doing for me; it’s the only hope of troubleshooting when the tools fail. It’s one of the reasons that I still see a lot of value in retro computing. Why not do the same thing here?
And Lucky for me, I already have a pretty good background in Infrastructure. I understand how networking and DNS work; I’ve been involved in a lot of different networking projects over the years. In fact, I’m pretty primed to get really good at Kubernetes since I understand a lot of the concepts it’s providing in virtual. It really IS an incredible technology, and I think I know exactly where it came from.
You see, developers and sys admin people have always hated each other. The devs always want to try something new; they’ve read about some crazy, new thing and they push to have it installed in their environments. They, themselves, are involved in writing brand new software: brand new tools that solve some problem…more make one. And on the other side, you have the sys admins who are tasked with keeping an infrastructure running and usable. If it’s running, don’t touch it! This point was always hit home for me at Speakeasy and then at T-Mobile when the Christmas freeze came; like the first week of December no more changes were allowed in the network. You would go into the Network Operation Center (NOC) and just see green across the board. Not an outage in sight, and the network humming; because no one can touch anything. No changes. So devs want change, and admins don’t. And their jobs are measured by these metrics. If an admin brings down his production network, it can be, in the words of a past trainer I had at T-Mobile, “a career changing event”. So, frustrated with having to work with admins who drag their feet on everything, I suspect that Kubernetes came around to just get rid of them, write them out of the picture. The entire infra has been bundled into a virtual system and you never need to order a server, or ask for a network change, and then wait for a month; and hopefully at the end you got what you asked for. Once you get Kubernetes it IS really powerful. You can do a bunch of really cool stuff without opening a single ticket. But, as a guy who’s done a fair amount of sys admin, it’s just crazy some the things the vendors want you to do. Calico, for instance, has a calicoctl tool that is a binary, not even in version 1.0, and you have to install it in /usr/local/bin with root; and they provide this thing from GitHub. Really? We’re moving this fast, are we?
I have some spare time for the next few weeks, and I think I’m going to use it to break through the wall I’ve been behind. Maybe finally put together a real six month plan to really master this thing. There’s a pretty good course through the Linux Foundation using minikube which is great for sandboxing. Finally finish a few more books. I’ll see what I can do. I still just can’t believe it, though; that people can use this in production. What that must cost from AWS or Google?! More money than my little boutique has, that’s for sure! And our use case is just a little dev environment, running no customer services and not accessible from the Internet. It doesn’t get anymore simple than this. A lot of the challenges are taking off the default safeties because they assume access from the interwebs.
Anyway, we’ll see how far I get before I have to go back into the office.
Post a Comment