LLM Administration

KubeCon made it obvious: AI wasn’t a side track, it was the track.

AI is a core workload and this is likely not changing anytime soon. lol.

And for SREs, DevOps, and platform engineers, that means understanding how to deploy, scale, and operate models is becoming a day-to-day responsibility.

I’ve been intentionally learning this from the infrastructure up, using Amazon EKS as both an enterprise-realistic platform and a lab environment where what I learn will transfer directly to production work.

This will be a learn, iterate, and build process. The end goal is to deploy a Chat UI and have an API to use with coding agents while keeping costs affordable.

Cost-effective learning is driving this work here, as running GPUs on EKS/AWS can be expensive (for individuals). For professional development, I’m not burning through an Enterprise budget doing this work. The good news is everything is deployed via Terraform and GitLab CI and the environment can be destroyed when not in use.

What I’m Exploring

My focus is serving LLMs on Kubernetes not training, serving. That means vLLM, llm-d, and KServe as the orchestration layer.

KServe feels like the right fit because it’s batteries-included, works with GPUs, and production-oriented.

It aligns with how enterprises already think about model serving.

Resources that made sense so far:

How I’m Deploying EKS

Terraform, but intentionally deconstructed. No opinionated “magic” modules—explicit control over networking, node pools, and Karpenter. It’s slower than just spinning up a cluster, but the learning sticks. (Repo is on GitLab if you want to follow along.). There will also be a GitLab CI pipeline to handle deployment after the development is complete. Using a pipeline will allow scheduled Terraform apply and Terraform destroy workflows.

The Wall I Hit: GPU Spot Quotas

To keep costs sane, I’m using GPU Spot instances via Karpenter. Consider this, a g6.xlarge spot instance costs about $0.3541 hourly. That’s not bad at all and includes an Nvidia L4 with 24GB of RAM. More details can be found here

Checked quotas and got this:

aws service-quotas list-service-quotas \
  --service-code ec2 \
  | jq '.Quotas[] | select(.QuotaName | test("Spot") and test("GPU|G |Accelerated"))'

Result:

QuotaName: All G and VT Spot Instance Requests
Value: 0.0

Zero GPU Spot vCPUs allowed. No amount of correct Kubernetes, Karpenter, or GPU Operator configuration works until that account-level gate opens.

Why is AWS doing this? It’s guardrails against crypto-mining and/or accidental spending. GPU instances can be very expensive.

We’ll need to request an increase:

aws service-quotas request-service-quota-increase \
  --service-code ec2 \
  --quota-code L-3819A6DF \
  --desired-value 32

Status: PENDING. and after 24 hours it’s now Status: CASE_OPENED

As you see it’s not instantanoues. Likely someone will need to approve this quota increase. Part two of this post will dive into the nitty gritty of configuration of KServe. I’m planning to include more low-level technical information, possibly multi-model serving, setting up chatbots and agents.

Just overall building.

I felt others might not know about the quotas and it was worth the share for part one.

So We Wait

This is a good reminder that cloud AI work is infrastructure plus quotas plus market reality. Spot GPUs aren’t opt-in by default. Learning this stuff cheaply still requires planning around AWS guardrails.

Once the quota clears, we’ll actually run vLLM + KServe on EKS with GPUs. That’s when the real work starts.

My goal is to deploy OpenWebUI for a chat interface with the models as well as expose API endpoints that can be used with code agents like ClaudeCode or OpenCode. The best part is this workload is still cheaper than owning a GPU at home. I can run this for hours with a cost less than a meal at Chick Fil A. You don’t want to miss part two.

LLM Administration

What I’m Exploring

How I’m Deploying EKS

The Wall I Hit: GPU Spot Quotas

So We Wait

💡 You Might Also Like