d3vly - Software Consulting & Development

Running Kubernetes in production is fundamentally different from running it in development or testing environments. Production clusters must handle real traffic, maintain high availability, protect sensitive data, and scale efficiently while managing costs. In this comprehensive guide, we'll explore the battle-tested best practices that separate production-grade Kubernetes deployments from hobby projects.

Understanding Production Requirements

Before diving into specific practices, it's important to understand what makes production environments unique. Production systems must maintain strict SLAs, handle unpredictable traffic patterns, ensure data security and compliance, and provide comprehensive observability for debugging and optimization.

The complexity of Kubernetes means there are many ways to configure your cluster, but only certain configurations will meet the reliability and security standards required for production workloads.

Resource Management

Proper resource management is the foundation of a stable Kubernetes cluster. Without it, you'll experience unpredictable performance, cascading failures, and difficulty troubleshooting issues.

Setting Resource Requests and Limits

Every container should define both resource requests and limits. Requests tell the scheduler how much CPU and memory the container needs, while limits cap the maximum resources it can consume.

Resource Requests:

Guarantee minimum resources for your application
Used by the scheduler to place pods on appropriate nodes
Should match your application's baseline resource usage
Too low = potential starvation, too high = wasted resources

Resource Limits:

Prevent resource hogging and protect other workloads
CPU limits throttle, memory limits cause OOM kills
Set limits 1.5-2x higher than requests for burstable workloads
For critical services, keep limits close to requests for QoS

Quality of Service Classes

Kubernetes assigns QoS classes based on your resource configuration:

Guaranteed: Requests equal limits - highest priority, best for critical services
Burstable: Requests < limits - good for variable workloads
BestEffort: No requests or limits - lowest priority, killed first under pressure

Resource Quotas and Limit Ranges

Use namespace-level resource quotas to prevent any single team or application from consuming all cluster resources. Implement limit ranges to enforce minimum and maximum resource specifications for containers.

Security Hardening

Security in Kubernetes is multi-layered, spanning the infrastructure, cluster configuration, container images, and application code. A breach at any layer can compromise your entire system.

Role-Based Access Control (RBAC)

RBAC is essential for controlling who can do what in your cluster. Follow the principle of least privilege - grant only the minimum permissions necessary for each user or service account.

Best practices for RBAC:

Never use cluster-admin binding for regular users or applications
Create role bindings at the namespace level when possible
Use service accounts for pod-to-API-server communication
Regularly audit RBAC permissions with tools like rbac-lookup
Implement separate roles for different environments (dev, staging, prod)

Pod Security Standards

Pod Security Standards replace the deprecated Pod Security Policies and define three levels of security controls:

Privileged: Unrestricted policy, use only for trusted system workloads
Baseline: Prevents known privilege escalations, good default for most workloads
Restricted: Heavily restricted, follows current pod hardening best practices

Image Security

Container images are a common attack vector. Implement these practices to secure your image supply chain:

Scan images for vulnerabilities with tools like Trivy, Clair, or Snyk
Use minimal base images (Alpine, distroless) to reduce attack surface
Never run containers as root - use USER directive in Dockerfile
Sign images and use admission controllers to verify signatures
Pull images from private registries with image pull secrets
Regularly update base images to patch security vulnerabilities

Network Policies

By default, all pods in a Kubernetes cluster can communicate with each other. Network policies let you implement micro-segmentation, restricting traffic based on pod labels and namespaces.

Network policy best practices:

Start with a default deny-all policy for each namespace
Explicitly allow only required traffic between services
Restrict ingress and egress traffic separately
Use namespace selectors to isolate environments

Secrets Management

Never store secrets in ConfigMaps or environment variables. Use Kubernetes Secrets with encryption at rest enabled, or better yet, integrate with external secret management solutions like HashiCorp Vault, AWS Secrets Manager, or Azure Key Vault.

High Availability

High availability means your application remains accessible even when individual components fail. Kubernetes provides several mechanisms to achieve this, but they must be configured correctly.

Multi-Replica Deployments

Run at least 2-3 replicas of each critical service. This ensures your application stays available during rolling updates, node failures, or pod evictions.

Considerations for replicas:

Set replica count based on expected traffic and failure tolerance
Use HorizontalPodAutoscaler to scale replicas automatically
Ensure replicas are spread across availability zones
Account for replica overhead in resource planning

Pod Disruption Budgets

Pod Disruption Budgets (PDBs) ensure that a minimum number of replicas remain available during voluntary disruptions like node drains or cluster upgrades.

Configure PDBs to specify either:

minAvailable: Minimum pods that must remain available (e.g., 2)
maxUnavailable: Maximum pods that can be unavailable (e.g., 1)

Health Checks

Kubernetes offers three types of health probes that work together to ensure your application runs smoothly:

Liveness Probes:

Determine if a container is running properly
Failed liveness probes trigger container restarts
Check for deadlocks, infinite loops, or corrupted state

Readiness Probes:

Determine if a container is ready to accept traffic
Failed readiness probes remove the pod from service endpoints
Use during startup or when depending on external services

Startup Probes:

Handle slow-starting containers without affecting runtime checks
Disable liveness and readiness checks until startup succeeds
Essential for legacy applications with long initialization times

Pod Topology Spread

Use topology spread constraints to distribute pods evenly across availability zones, nodes, or other topology domains. This prevents all replicas from running on the same node or zone, which would create a single point of failure.

Deployment Strategies

How you deploy updates significantly impacts availability and risk during releases.

Rolling Updates

The default strategy that gradually replaces old pods with new ones. Configure maxUnavailable and maxSurge to control update speed and resource usage.

Blue-Green Deployments

Run two identical production environments (blue and green). Deploy to the inactive environment, test thoroughly, then switch traffic. Provides instant rollback capability.

Canary Deployments

Gradually roll out changes to a small subset of users before full deployment. Monitor metrics closely and rollback if issues are detected. Use service mesh tools like Istio or Linkerd for fine-grained traffic control.

Monitoring and Observability

You cannot effectively operate what you cannot observe. Comprehensive monitoring is non-negotiable in production Kubernetes environments.

Metrics Collection

Implement the Prometheus stack for metrics collection and visualization:

Prometheus: Time-series database and scraper
Grafana: Visualization and dashboards
AlertManager: Alert routing and grouping
kube-state-metrics: Cluster-level metrics
node-exporter: Node-level hardware metrics

Key Metrics to Monitor

Focus on these critical metrics for cluster and application health:

Cluster metrics: Node CPU, memory, disk usage; pod count; node status
Application metrics: Request rate, error rate, latency (RED method)
Resource metrics: Container CPU/memory usage vs limits; pending pods
Custom business metrics: Orders processed, users active, etc.

Centralized Logging

Aggregate logs from all containers and nodes into a centralized system for easy searching and correlation. Popular options include:

ELK/EFK Stack (Elasticsearch, Fluentd/Logstash, Kibana)
Loki with Grafana (lightweight, cost-effective)
CloudWatch, Stackdriver, or other cloud-native solutions

Distributed Tracing

Implement distributed tracing to understand request flows across microservices. Tools like Jaeger, Zipkin, or cloud-provider solutions show you exactly where time is spent and where failures occur.

Cluster Configuration Best Practices

Node Management

Use node pools with different instance types for different workload types
Enable cluster autoscaler to automatically adjust node count
Regularly update node images and Kubernetes versions
Use taints and tolerations to dedicate nodes for specific workloads

Namespace Organization

Separate environments (dev, staging, prod) into different clusters
Use namespaces to isolate teams or applications within a cluster
Apply resource quotas and network policies at namespace level
Use naming conventions for easy identification

Configuration Management

Use GitOps practices - store all configs in Git
Leverage tools like ArgoCD or Flux for automated deployments
Use Helm charts or Kustomize for templating and reusability
Never apply configs manually in production

Backup and Disaster Recovery

Disaster recovery planning is often overlooked until it's too late.

What to Backup

etcd cluster state (contains all Kubernetes resources)
Persistent volume data
Configuration repositories (your GitOps source of truth)
Secrets and certificates

Backup Tools

Velero for cluster-level backups and migrations
etcd snapshot for control plane backups
Volume snapshot capabilities provided by cloud providers

Recovery Testing

Regularly test your recovery procedures. A backup that can't be restored is worthless. Practice restoring in non-production environments and document the recovery process.

Cost Optimization

Kubernetes can become expensive if not managed properly. Implement these practices to control costs:

Right-size your resource requests - don't over-provision
Use cluster autoscaler to scale nodes based on demand
Leverage spot/preemptible instances for non-critical workloads
Implement horizontal pod autoscaling to match capacity to load
Use tools like Kubecost to track and attribute spending
Clean up unused resources - old volumes, completed jobs, etc.

Common Production Pitfalls

Learn from common mistakes that lead to production incidents:

Pitfall 1: No Resource Limits

Problem: A single misbehaving pod consumes all node resources
Solution: Always set resource requests and limits

Pitfall 2: Missing Health Checks

Problem: Broken pods continue receiving traffic
Solution: Implement liveness and readiness probes

Pitfall 3: Single Replica Services

Problem: Zero downtime during updates or failures
Solution: Run at least 2-3 replicas with PDBs

Pitfall 4: No Monitoring

Problem: Issues discovered by users, not operators
Solution: Implement comprehensive monitoring and alerting

Pitfall 5: Storing Secrets in Code

Problem: Credentials exposed in version control
Solution: Use Kubernetes Secrets or external secret managers

Conclusion

Running Kubernetes in production successfully requires attention to detail across many domains - resource management, security, availability, monitoring, and cost optimization. The practices outlined in this guide represent lessons learned from thousands of production deployments across organizations of all sizes.

Start by implementing the fundamentals: resource limits, health checks, RBAC, and monitoring. Then progressively add more advanced practices like network policies, pod disruption budgets, and GitOps workflows as your experience grows.

Remember that production excellence is a journey, not a destination. Continuously measure, learn, and improve your Kubernetes operations. Stay updated with Kubernetes releases, engage with the community, and always prioritize reliability and security over complexity.

Kubernetes Best Practices for Production