Common Configuration
Reserve priority and space for Kubernetes resources
Kubelet can be instructed to reserve a certain amount of resources for the system and for Kubernetes components (kubelet itself and Docker etc).
Reserved resources are subtracted from the node's allocatable resources. This improves scheduling and makes resource allocation/usage more transparent.
You can explore how to reserve resources on the official documentation.
Handling long lived connections
https://itnext.io/on-grpc-load-balancing-683257c5b7b3 https://kubernetes.io/blog/2018/11/07/grpc-load-balancing-on-kubernetes-without-tears/
Building container images
Use only base images from trusted image providers
Minimise image sizes/build "scratch" images
Image tags are immutable
If the content of an image changes, the image tag must change too.
Common strategies are to use the Git commit hash or the CI build ID as part of the image tag. Can be combined with semantic versioning.
For example: v1.0.1-bfeda01f
Avoid the latest
tag
Deploying applications
Use an Ingress for routing traffic to your app
Even for simple applications.
Requires installation of an Ingress controller.
Set a PodDisruptionBudgets for your applications
TODO: integrate with current "Set pod disruption budgets" in "Fault tolerance" (application development)
All services that don't need to be accessed from outside the cluster should be ClusterIP
Secure Ingress endpoints with TLS
User management
Use an external identity system for user management
For example, Azure Active Directory, AWS IAM.
Authentication with bearer token and API server validates token with the external service
Additional services
Run your own container registry
TODO: is this a best practice?
If you use Helm, run your own Helm chart registry
Chartmuseum, Artifactory
Pod networking
TODO: integrate with current "Network policies"
Use NetworkPolicy to restrict the communication between Pods
Create a deny-all network policy in all namespaces
Avoid inter-Pod communication across namespaces
Pod security
TODO: integrate with current "Pod security policies" (governance)
Use PodSecurityPolicy to enforce security features in all Pods
- Using PodSecurityPolicies requires the PodSecurityPolicy admission controller to be enabled, but in most Kubernetes deployments it is not enabled. As soon as the PodSecurityPolicy admission controller is enabled, you need appropriate PodSecurityPolicy resources to allow any Pods to be created.
- You also need to grant "use" access to the created PodSecurityPolicies to the service account of the workload or the controller of the workload (you can use
system:serviceaccounts
group which compromises all controller service accounts).
- You also need to grant "use" access to the created PodSecurityPolicies to the service account of the workload or the controller of the workload (you can use
General policies
TODO: integrate with current "Custom policies" (governance)
Use Open Policy Agent (OPA) and Gatekeeper to enfore custom policies on all resources
Only allow compliant Kubernetes resources (of any kind) to be applied to the cluster (compliant with the defined policies).
- Open Policy Agent (OPA): policy engine
- Gatekeeper
- Validating admission control webhook
- Kubernetes Operator for installing, configuring and managing Open Policy Agent policies
Example policies that can be implemented with Gatekeeper:
- Services must not be exposed publicly on the internet
- Allow containers only from trusted container registries
- All containers must have resource limits
- Ingress hostnames must not overlap
- Ingresses must use only HTTPs
Stateful applications
Avoid managing state in Kubernetes, if possible
Use storage services outside the cluster (e.g. Amazon DynamodDB, Amazon S3).
Create a default StorageClass named "default"
Because this name is often the default in Helm charts.
Prefer operators for running stateful applications rather than managing it yourself
Many stateful applications (e.g. databases) have operators that make it easier to run and manage them reliably.
Admission controllers
Use the officially recommended set of admission controllers
Recommended set of admission controllers to be enabled: NamespaceLifecycle, LimitRanger, ServiceAccount, DefaultStorageClass, DefaultTolerationSeconds, MutatingAdmissionWebhook, ValidatingAdmissionWebhook, Priority, ResourceQuota, PodSecurityPolicy
If you use multiple mutating admission webhook, make sure that they don't modify the same fields of a resource
They would undo each other's actions. Furthermore, the order in which admission webhooks run is undefined.
If you use create a mutating admission webhook, also create a validating admission webhook that validates the mutations
Restrict the request to be sent to an admission webhook to the minimum
Scope admission webhooks to specific namespaces
Using the namespaceSelector
field in the MutatingWebhookConfiguration/ValidatingWebhookConfiguration resource.
Always exclude the kube-system namespace from the scope of a custom admission webhook
Restrict RBAC rules for creating admission webhook configuration
Restrict "create" on MutatingWebhookConfiguration/ValidatingWebhookConfiguration resources.
Managing YAML resource manifests
If you deploy to multiple environments, use a templating system
Do not manually maintain copies of the same YAML files for the different environments. Use a templating system like Helm, kustomize, Kapitan, ytt.
Works by defining the common structure of the YAML files once and defining values to substitute for each environment.
Logging
TODO: integrate with current "Logging" (application development) and "Logging setup" (cluster configuration).
Applications write logs to stdout or stderr rather than to files
This allows for the use of node agent based log aggregation systems rather than sidecar container based ones.
Everything a container writes to stdout or stderr is saved by the container runtime at a known location. This means that an agent running on each worker node can collect the logs of all the containers on that node and send them to the central log store.
If the container writes the logs to a file instead, the log file is local to the container and can't be accessed by a node agent. In this case, each Pod would require a sidecar container that collects these logs and sends them to the central log store.
A node agent (for example, deployed as a DaemonSet) is more efficient and easier to manage than a sidecar container for every Pod.
Use a log aggregation system
Cluster components and applications log to their containers, which means that logs are distributed all across the cluster. This makes it difficult to inspect the logs (probably of multiple components) to troubleshoot a problem.
A log aggregation system collects all these logs and saves them at a central place in the cluster so that you can inspect them at a single place.
Some log aggregation tools include: EFK stack (Elasticsearch, Fluentd, Kibana), Loki, DataDog, Sumo Logic, Sysdig, GCP Stackdriver, Azure Monitor, AWS CloudWatch
Use managed log aggregation system rather than a self-hosted one (if possible)
The operational overhead of running your own log aggregation system can be come quite large. This is because for running a log aggregation system, you need to take care of things like persistent storage, backups, archival, log rotation, etc.
With a hosted log aggregation solution, all these things are taken care of for you and there are fewer things that can go wrong.
Some hosted log aggregation systems include: DataDog, Sumo Logic, Sysdig, GCP Stackdriver, Azure Monitor, AWS CloudWatch
Define a log retention and archival strategy
How long to retain logs in the log store and what to do with them afterwards?
As a general rule, retaining logs for 30-45 day in the log store is a reasonable value. After that, if the logs should still be available, you can move them to a cost-efficient archiving storage (e.g. Amazon Glacier).
Take into account any regulatory and internal compliance in your organisation.
Collect logs from both cluster components and applications
Your applications are not the only components that produce logs in your cluster — all the cluster components do as well, and you should collect their logs too.
Here are some of the cluster components whose logs you should also have in your log aggregation system:
- All nodes: kubelet, container runtime
- Master nodes: API server, scheduler, controller mananger
- (Kubernetes auditing (all requests to the API server))
Monitoring
Set up extensive monitoring in your cluster
TODO: what kind of advice to give in this item?
Monitoring is the collection and aggregation of measurements from different components of your cluster. It is extremely important for gaining insight into the internals of your cluster, assessing its health, and detecting and troubleshooting problems, and even preventing them before they occur.
Monitoring consists of two parts:
- Components make measurements and expose them as metrics
- The monitoring system periodically collects these metrics
Some monitoring systems include:
- Self-hosted: Prometheus
- Managed: DataDog, Sumo Logic, Sysdig, Google Stackdriver, Azure Monitor, Azure Monitor for Containers, AWS CloudWatch, AWS Container Insights
What should you monitor?
Exactly what metrics to collect depends on the components in your cluster (i.e. what metrics they expose).
There are some general guidelines as to the types of metrics to collect:
- Infrastructure (e.g. nodes): USE metrics — Usage, Saturation, Errors
- Applications: RED metrics — Rate, Errors, Duration
If you use a self-hosted monitoring system, run it in a dedicated management cluster
You could run the monitoring system in the production cluster itself. However, if there is a problem with the cluster, the monitoring system might be affected as well (and you might need the monitoring system to troubleshoot the problem with the cluster).
To avoid this, you can create a dedicated cluster that only runs management tools (such as the monitoring system) for all your other clusters.
Alerting
Only alert on events that require immediate human intervention
Omit alerts that don't require a human to take action right away. Too many unimportant alerts cause "alert fatigue" and cause important alerts to be ignored.
Focus on alerts that affect your service level objectives (SLOs)
The core of your alerts should be on incidents that negatively affect the service level that you promised to your customers.
Automate remediation of all non-critical alerts
All incidents that don't merit an alert (because they don't require immediate human intervention, or they don't affect the customer experience) should be handled automatically (invest in automation).
Configuration and secrets
TODO: integrate with current "Configuration and secrets"
Separate all configuration from the application code
Configuration should be maintained outside the application code.
This has several benefits. First, changing the configuration does not require recompiling the application. Second, the configuration can be updated when the application is running. Third, the same code can be used in different environments.
In Kubernetes, the configuration can be saved in ConfigMaps, which can then be mounted into containers as volumes are passed in as environment variables.
Save only non-sensitive configuration in ConfigMaps. For sensitive information (such as credentials), use the Secret resource.
Save non-critical configuration in ConfigMaps and critical configuratio in Secrets
Secrets are similar to ConfigMaps but have some special semantics to protect their content (e.g. content of Secrets is not displayed in some kubectl outputs)
Mount Secrets as volumes, not environment variables
The content of Secret resources should be mounted into containers as volumes rather than passed in as environment variables.
This is to prevent that the secret values appear in the command that was used to start the container, which may be inspected by individuals that shouldn't have access to the secret values.
- Injected environment variables are always present and may become artifacts in logs for the entire system.
- Secret-based environment variables should be mounted as a volume (not environment variables). In this way, they're only available to the desired process/container. Not the whole pod.
See tweet
Use PodPresets to automatically mount ConfigMaps or Secrets into containers
Include version in ConfigMap and Secret to ensure configuration changes are reloaded
For example, instead of naming a ConfigMap just config
name it config-v1
. Whenever you update the configuration in the ConfigMap, also update the version (for example, config-v2
). Then, update all references to the ConfigMap (for example, in a Deployment) to the new version (e.g. config-v2
).
This causes all Pods to be restarted, whih ensures that the new configuration is indeed loaded into the Pods, regardless of whether the ConfigMap is mounted as a volume or as environment variables, and, in the latter case, regardless of whether the app watches the configuration file for changes.
You can automate this process with a CD system.
If you don't delete the previous versions of the ConfigMap (e.g. config-v1
), you can easily roll back to a previous configuration.
Role-based access control (RBAC)
TODO: integrate with current "Role-Based Access Control (RBAC) policies" (governance)
Follow the "least privilege" principle for RBAC roles
Don't use "catch all" service accounts for Pods
If a Pod needs to access the Kubernetes API, tailor an RBAC role that allows exactly those operatios that the Pod has to do (and nothing more), assign it to a new service account, and assign this service account to the Pod.
Don't use an exising service account for the Pod that might have an associated role with more permissions than the Pod needs.
Labelling resources
TODO: integrate with current "Tagging resources"
All resources have a common set recommended labels
What?
The Kubernetes documentation recommends a set of common labels to be applied to all resource objects.
Why?
Having a set of common labels allows third-party tools to interoperate with the resources in your cluster. Furthermore, a common set of labels facilitates and standardises manual management of the resources.
How?
The recommended labels are:
app.kubernetes.io/name
: the name of the applicationapp.kubernetes.io/instance
: a unique name identifying the instance of an applicationapp.kubernetes.io/version
: the current version of the application (e.g., a semantic version, revision hash, etc.)app.kubernetes.io/component
: the component within the architectureapp.kubernetes.io/part-of
: the name of a higher level application this one is part ofapp.kubernetes.io/managed-by
: the tool being used to manage the operation of an application
Here is an example of applying these labels to a StatefulSet resource:
Note that depending on the application, only some of these labels may be defined (for example, app.kubernetes.io/name
and app.kubernetes.io/instance
), but they should be applied to all resource objects.
References
Use independent versions for containers, Pods, and apps
E.g. if Pod specification changes, update only the Pod and app version, but not the container image version, etc.
Maintain a "release" for top-level resources
Resources that compromise an entire app (e.g. Deployment, StatefulSet) should have a release
label.
This label should change every time the application is deployed (the release should be updated even if the version of the app stays the same).
Resource management
TODO: integrate with current "Resource utilisation"
Define resource requests for all containers
What?
You can set a resource request (CPU and memory) for each container of a Pod in the Pod specification.
Why?
The request for a resource is the minimum amount of resource the container needs for running. It influences the scheduling decision of the Kubernetes scheduler (the scheduler schedules a pod only to a node that has enough free resources to accommodate the requests of all the containers of the pod).
By default, no resource requests are set, which means that the scheduler schedules a pod to any node, no matter how many free resources it has.
Setting requests for memory and CPU for each container of a Pod ensures that the Pod is able to run properly on the node it is scheduled to.
How?
Set the pod.spec.containers[].resources.requests
field of a Pod specification:
References
Define resource limits according to the desired Quality of Service (QoS) class
The limit for a resource is the maximum amount of the resource that the container is allowed to use (you can think of it as an upper cap for resource usage bursts). What happens when a container reaches the limit, depends on the type of resource:
- For memory, the container is killed and restarted
- For CPU, the container is throttled (TODO: what does "throttled" mean exactly?)
By default, no resource limits are set, which means that there's no limit of how much resources a pod may use on a node.
Setting a resource limits prevents Pods from monopolising all the resources on node.
Given a value for the resource request, the value for the resoruce limit determines the Quality of Service (QoS) class of the Pod:
- Best-effort: no requests and limits, lowest priority, Pod is killed first
- Burstable: limits are higher than requests, middle-priority, killed if no Best-effort Pods exist
- Guaranteed: limits equal requests, highest priority, killed only if no Best-effort and Burstable Pods exist
Don't define CPU limits
TODO: how does this affect the QoS class? Where's the source that this is a best practice?
Create a LimitRange for each namespace
Create a ResourceQuota for each namespace
Use PriorityClass to define the order in which Pods are scheduled and evicted
Health checks
TODO: integrate with current "Health checks"
Set initialDelaySeconds to appropriate value for all liveness and readiness probes
Advanced scheduling
Use Pod Affinity if certain Pods should be colocated
Use Pod Anti-affinity if certain Pods should not be colocated
Use Node Affinity or Node Selector if a Pod should run only on a subset of nodes
Use Taints and Tolerations to reserve nodes for certain Pods
Application lifecycle
TODO: integrate with current "Graceful shutdown" (rename it "Application lifecycle")
Container listens for SIGTERM signal and shuts down gracefully
Container implements preStop lifecycle hook to shut down gracefully
This is an alternative to listening for the SIGTERM signal.
Set the termination grace period if gracefully shutting down takes very long
Container implements postStart lifecycle hook to do any start-up tasks
Only use the exec method for the postStart lifecycle hook
Amazon Web Services (AWS)
Use kube2iam to integrate Kubernetes permissions with AWS IAM
As a best practice to use this, I would recommend using kube2iam -- this allows you to control which IAM roles a pod is allowed to assume based on namespace, but also allows the pod to see itself as having the role vs requiring it to assume such a role.
In our setup we use kube2iam and assign nodes their IAM role - this IAM role is allowed to assume other roles. With kube2iam, we assign the role through a pod annotation, so the pod sees itself as having that role. Additionally we use namespace restrictions to control access to which roles a pod is allowed to assume.
See #7