Back to Blog
DevOps

Kubernetes Best Practices for Production

T
Tigran Khachatryan
January 6, 2025
7 min read
Kubernetes Best Practices for Production

Running Kubernetes in production is fundamentally different from running it in development or testing environments. Production clusters must handle real traffic, maintain high availability, protect sensitive data, and scale efficiently while managing costs. In this comprehensive guide, we'll explore the battle-tested best practices that separate production-grade Kubernetes deployments from hobby projects.

Understanding Production Requirements

Before diving into specific practices, it's important to understand what makes production environments unique. Production systems must maintain strict SLAs, handle unpredictable traffic patterns, ensure data security and compliance, and provide comprehensive observability for debugging and optimization.

The complexity of Kubernetes means there are many ways to configure your cluster, but only certain configurations will meet the reliability and security standards required for production workloads.

Resource Management

Proper resource management is the foundation of a stable Kubernetes cluster. Without it, you'll experience unpredictable performance, cascading failures, and difficulty troubleshooting issues.

Setting Resource Requests and Limits

Every container should define both resource requests and limits. Requests tell the scheduler how much CPU and memory the container needs, while limits cap the maximum resources it can consume.

Resource Requests:

  • Guarantee minimum resources for your application
  • Used by the scheduler to place pods on appropriate nodes
  • Should match your application's baseline resource usage
  • Too low = potential starvation, too high = wasted resources

Resource Limits:

  • Prevent resource hogging and protect other workloads
  • CPU limits throttle, memory limits cause OOM kills
  • Set limits 1.5-2x higher than requests for burstable workloads
  • For critical services, keep limits close to requests for QoS

Quality of Service Classes

Kubernetes assigns QoS classes based on your resource configuration:

  • Guaranteed: Requests equal limits - highest priority, best for critical services
  • Burstable: Requests < limits - good for variable workloads
  • BestEffort: No requests or limits - lowest priority, killed first under pressure

Resource Quotas and Limit Ranges

Use namespace-level resource quotas to prevent any single team or application from consuming all cluster resources. Implement limit ranges to enforce minimum and maximum resource specifications for containers.

Security Hardening

Security in Kubernetes is multi-layered, spanning the infrastructure, cluster configuration, container images, and application code. A breach at any layer can compromise your entire system.

Role-Based Access Control (RBAC)

RBAC is essential for controlling who can do what in your cluster. Follow the principle of least privilege - grant only the minimum permissions necessary for each user or service account.

Best practices for RBAC:

  • Never use cluster-admin binding for regular users or applications
  • Create role bindings at the namespace level when possible
  • Use service accounts for pod-to-API-server communication
  • Regularly audit RBAC permissions with tools like rbac-lookup
  • Implement separate roles for different environments (dev, staging, prod)

Pod Security Standards

Pod Security Standards replace the deprecated Pod Security Policies and define three levels of security controls:

  • Privileged: Unrestricted policy, use only for trusted system workloads
  • Baseline: Prevents known privilege escalations, good default for most workloads
  • Restricted: Heavily restricted, follows current pod hardening best practices

Image Security

Container images are a common attack vector. Implement these practices to secure your image supply chain:

  • Scan images for vulnerabilities with tools like Trivy, Clair, or Snyk
  • Use minimal base images (Alpine, distroless) to reduce attack surface
  • Never run containers as root - use USER directive in Dockerfile
  • Sign images and use admission controllers to verify signatures
  • Pull images from private registries with image pull secrets
  • Regularly update base images to patch security vulnerabilities

Network Policies

By default, all pods in a Kubernetes cluster can communicate with each other. Network policies let you implement micro-segmentation, restricting traffic based on pod labels and namespaces.

Network policy best practices:

  • Start with a default deny-all policy for each namespace
  • Explicitly allow only required traffic between services
  • Restrict ingress and egress traffic separately
  • Use namespace selectors to isolate environments

Secrets Management

Never store secrets in ConfigMaps or environment variables. Use Kubernetes Secrets with encryption at rest enabled, or better yet, integrate with external secret management solutions like HashiCorp Vault, AWS Secrets Manager, or Azure Key Vault.

High Availability

High availability means your application remains accessible even when individual components fail. Kubernetes provides several mechanisms to achieve this, but they must be configured correctly.

Multi-Replica Deployments

Run at least 2-3 replicas of each critical service. This ensures your application stays available during rolling updates, node failures, or pod evictions.

Considerations for replicas:

  • Set replica count based on expected traffic and failure tolerance
  • Use HorizontalPodAutoscaler to scale replicas automatically
  • Ensure replicas are spread across availability zones
  • Account for replica overhead in resource planning

Pod Disruption Budgets

Pod Disruption Budgets (PDBs) ensure that a minimum number of replicas remain available during voluntary disruptions like node drains or cluster upgrades.

Configure PDBs to specify either:

  • minAvailable: Minimum pods that must remain available (e.g., 2)
  • maxUnavailable: Maximum pods that can be unavailable (e.g., 1)

Health Checks

Kubernetes offers three types of health probes that work together to ensure your application runs smoothly:

Liveness Probes:

  • Determine if a container is running properly
  • Failed liveness probes trigger container restarts
  • Check for deadlocks, infinite loops, or corrupted state

Readiness Probes:

  • Determine if a container is ready to accept traffic
  • Failed readiness probes remove the pod from service endpoints
  • Use during startup or when depending on external services

Startup Probes:

  • Handle slow-starting containers without affecting runtime checks
  • Disable liveness and readiness checks until startup succeeds
  • Essential for legacy applications with long initialization times

Pod Topology Spread

Use topology spread constraints to distribute pods evenly across availability zones, nodes, or other topology domains. This prevents all replicas from running on the same node or zone, which would create a single point of failure.

Deployment Strategies

How you deploy updates significantly impacts availability and risk during releases.

Rolling Updates

The default strategy that gradually replaces old pods with new ones. Configure maxUnavailable and maxSurge to control update speed and resource usage.

Blue-Green Deployments

Run two identical production environments (blue and green). Deploy to the inactive environment, test thoroughly, then switch traffic. Provides instant rollback capability.

Canary Deployments

Gradually roll out changes to a small subset of users before full deployment. Monitor metrics closely and rollback if issues are detected. Use service mesh tools like Istio or Linkerd for fine-grained traffic control.

Monitoring and Observability

You cannot effectively operate what you cannot observe. Comprehensive monitoring is non-negotiable in production Kubernetes environments.

Metrics Collection

Implement the Prometheus stack for metrics collection and visualization:

  • Prometheus: Time-series database and scraper
  • Grafana: Visualization and dashboards
  • AlertManager: Alert routing and grouping
  • kube-state-metrics: Cluster-level metrics
  • node-exporter: Node-level hardware metrics

Key Metrics to Monitor

Focus on these critical metrics for cluster and application health:

  • Cluster metrics: Node CPU, memory, disk usage; pod count; node status
  • Application metrics: Request rate, error rate, latency (RED method)
  • Resource metrics: Container CPU/memory usage vs limits; pending pods
  • Custom business metrics: Orders processed, users active, etc.

Centralized Logging

Aggregate logs from all containers and nodes into a centralized system for easy searching and correlation. Popular options include:

  • ELK/EFK Stack (Elasticsearch, Fluentd/Logstash, Kibana)
  • Loki with Grafana (lightweight, cost-effective)
  • CloudWatch, Stackdriver, or other cloud-native solutions

Distributed Tracing

Implement distributed tracing to understand request flows across microservices. Tools like Jaeger, Zipkin, or cloud-provider solutions show you exactly where time is spent and where failures occur.

Cluster Configuration Best Practices

Node Management

  • Use node pools with different instance types for different workload types
  • Enable cluster autoscaler to automatically adjust node count
  • Regularly update node images and Kubernetes versions
  • Use taints and tolerations to dedicate nodes for specific workloads

Namespace Organization

  • Separate environments (dev, staging, prod) into different clusters
  • Use namespaces to isolate teams or applications within a cluster
  • Apply resource quotas and network policies at namespace level
  • Use naming conventions for easy identification

Configuration Management

  • Use GitOps practices - store all configs in Git
  • Leverage tools like ArgoCD or Flux for automated deployments
  • Use Helm charts or Kustomize for templating and reusability
  • Never apply configs manually in production

Backup and Disaster Recovery

Disaster recovery planning is often overlooked until it's too late.

What to Backup

  • etcd cluster state (contains all Kubernetes resources)
  • Persistent volume data
  • Configuration repositories (your GitOps source of truth)
  • Secrets and certificates

Backup Tools

  • Velero for cluster-level backups and migrations
  • etcd snapshot for control plane backups
  • Volume snapshot capabilities provided by cloud providers

Recovery Testing

Regularly test your recovery procedures. A backup that can't be restored is worthless. Practice restoring in non-production environments and document the recovery process.

Cost Optimization

Kubernetes can become expensive if not managed properly. Implement these practices to control costs:

  • Right-size your resource requests - don't over-provision
  • Use cluster autoscaler to scale nodes based on demand
  • Leverage spot/preemptible instances for non-critical workloads
  • Implement horizontal pod autoscaling to match capacity to load
  • Use tools like Kubecost to track and attribute spending
  • Clean up unused resources - old volumes, completed jobs, etc.

Common Production Pitfalls

Learn from common mistakes that lead to production incidents:

Pitfall 1: No Resource Limits

Problem: A single misbehaving pod consumes all node resources
Solution: Always set resource requests and limits

Pitfall 2: Missing Health Checks

Problem: Broken pods continue receiving traffic
Solution: Implement liveness and readiness probes

Pitfall 3: Single Replica Services

Problem: Zero downtime during updates or failures
Solution: Run at least 2-3 replicas with PDBs

Pitfall 4: No Monitoring

Problem: Issues discovered by users, not operators
Solution: Implement comprehensive monitoring and alerting

Pitfall 5: Storing Secrets in Code

Problem: Credentials exposed in version control
Solution: Use Kubernetes Secrets or external secret managers

Conclusion

Running Kubernetes in production successfully requires attention to detail across many domains - resource management, security, availability, monitoring, and cost optimization. The practices outlined in this guide represent lessons learned from thousands of production deployments across organizations of all sizes.

Start by implementing the fundamentals: resource limits, health checks, RBAC, and monitoring. Then progressively add more advanced practices like network policies, pod disruption budgets, and GitOps workflows as your experience grows.

Remember that production excellence is a journey, not a destination. Continuously measure, learn, and improve your Kubernetes operations. Stay updated with Kubernetes releases, engage with the community, and always prioritize reliability and security over complexity.

#Kubernetes#DevOps#Production#Best Practices
T

About Tigran Khachatryan

Tigran Khachatryan is a senior software engineer at d3vly with over 8 years of experience in devops. Passionate about sharing knowledge and helping developers build better software.