Kubernetes has become the de facto standard for container orchestration, but running it in production is very different from following a tutorial. After deploying and managing Kubernetes clusters for over 50 production workloads, here are the lessons we've learned the hard way.
Resource Requests and Limits — Every container should have CPU and memory requests and limits defined. Without requests, the scheduler can't make intelligent placement decisions. Without limits, a single misbehaving pod can take down an entire node. We set requests based on average usage and limits at 2-3x the request.
Pod Disruption Budgets — PDBs ensure that voluntary disruptions (like node upgrades or scaling events) don't take down too many pods simultaneously. For any service that requires high availability, we define PDBs that guarantee at least N-1 pods remain available during disruptions.
Network Policies — By default, every pod in a Kubernetes cluster can communicate with every other pod. This is a significant security risk. We implement network policies that restrict traffic to only the necessary paths, following the principle of least privilege.
Secrets Management — Kubernetes secrets are base64-encoded, not encrypted. For production workloads, we use external secret stores like AWS Secrets Manager or HashiCorp Vault, synced to Kubernetes via the External Secrets Operator. Secrets are never committed to version control.
Horizontal Pod Autoscaling — HPA is essential for handling variable load. We configure scaling based on both CPU utilization and custom metrics (like queue depth or request latency). We also set appropriate cooldown periods to prevent thrashing during traffic spikes.
Cluster Autoscaling — Node-level autoscaling ensures your cluster can grow and shrink based on demand. We configure cluster autoscaler with appropriate min/max node counts and use spot/preemptible instances for non-critical workloads to reduce costs by up to 70%.
Observability Stack — We deploy a standardized observability stack on every cluster: Prometheus and Grafana for metrics, Loki for log aggregation, Jaeger for distributed tracing, and Alertmanager for notifications. This gives teams full visibility into their applications without relying on vendor-specific tools.
Disaster Recovery — We practice disaster recovery regularly. This includes etcd backups, cluster state snapshots, and documented runbooks for common failure scenarios. Every team knows how to recover from a node failure, a control plane issue, or a complete cluster loss.
Kubernetes is a powerful platform, but it requires expertise to run well. If you're planning a Kubernetes deployment or struggling with an existing one, our cloud engineering team can help you build a production-grade platform.
Have questions about this topic or need help implementing these solutions for your business? Our team is here to help.
Get in Touch