Kubernetes Blog

Spotlight on SIG Architecture: API Governance

Thu, 12 Feb 2026 00:00:00 +0000

This is the fifth interview of a SIG Architecture Spotlight series that covers the different subprojects, and we will be covering SIG Architecture: API Governance.

In this SIG Architecture spotlight we talked with Jordan Liggitt, lead of the API Governance sub-project.

Introduction

FM: Hello Jordan, thank you for your availability. Tell us a bit about yourself, your role and how you got involved in Kubernetes.

JL: My name is Jordan Liggitt. I'm a Christian, husband, father of four, software engineer at Google by day, and amateur musician by stealth. I was born in Texas (and still like to claim it as my point of origin), but I've lived in North Carolina for most of my life.

I've been working on Kubernetes since 2014. At that time, I was working on authentication and authorization at Red Hat, and my very first pull request to Kubernetes attempted to add an OAuth server to the Kubernetes API server. It never exited work-in-progress status. I ended up going with a different approach that layered on top of the core Kubernetes API server in a different project (spoiler alert: this is foreshadowing), and I closed it without merging six months later.

Undeterred by that start, I stayed involved, helped build Kubernetes authentication and authorization capabilities, and got involved in the definition and evolution of the core Kubernetes APIs from early beta APIs, like v1beta3 to v1. I got tagged as an API reviewer in 2016 based on those contributions, and was added as an API approver in 2017.

Today, I help lead the API Governance and code organization subprojects for SIG Architecture, and I am a tech lead for SIG Auth.

FM: And when did you get specifically involved in the API Governance project?

JL: Around 2019.

Goals and scope of API Governance

FM: How would you describe the main goals and areas of intervention of the subproject?

The surface area includes all the various APIs Kubernetes has, and there are APIs that people do not always realize are APIs: command-line flags, configuration files, how binaries are run, how they talk to back-end components like the container runtime, and how they persist data. People often think of "the API" as only the REST API... that is the biggest and most obvious one, and the one with the largest audience, but all of these other surfaces are also APIs. Their audiences are narrower, so there is more flexibility there, but they still require consideration.

The goals are to be stable while still enabling innovation. Stability is easy if you never change anything, but that contradicts the goal of evolution and growth. So we balance "be stable" with "allow change".

FM: Speaking of changes, in terms of ensuring consistency and quality (which is clearly one of the reasons this project exists), what are the specific quality gates in the lifecycle of a Kubernetes change? Does API Governance get involved during the release cycle, prior to it through guidelines, or somewhere in between? At what points do you ensure the intended role is fulfilled?

JL: We have guidelines and conventions, both for APIs in general and for how to change an API. These are living documents that we update as we encounter new scenarios. They are long and dense, so we also support them with involvement at either the design stage or the implementation stage.

Sometimes, due to bandwidth constraints, teams move ahead with design work without feedback from API Review. That’s fine, but it means that when implementation begins, the API review will happen then, and there may be substantial feedback. So we get involved when a new API is created or an existing API is changed, either at design or implementation.

FM: Is this during the Kubernetes Enhancement Proposal (KEP) process? Since KEPs are mandatory for enhancements, I assume part of the work intersects with API Governance?

JL: It can. KEPs vary in how detailed they are. Some include literal API definitions. When they do, we can perform an API review at the design stage. Then implementation becomes a matter of checking fidelity to the design.

Getting involved early is ideal. But some KEPs are conceptual and leave details to the implementation. That’s not wrong; it just means the implementation will be more exploratory. Then API Review gets involved later, possibly recommending structural changes.

There’s a trade-off regardless: detailed design upfront versus iterative discovery during implementation. People and teams work differently, and we’re flexible and happy to consult early or at implementation time.

FM: This reminds me of what Fred Brooks wrote in "The Mythical Man-Month" about conceptual integrity being central to product quality... No matter how you structure the process, there must be a point where someone looks at what is coming and ensures conceptual integrity. Kubernetes uses APIs everywhere -- externally and internally -- so API Governance is critical to maintaining that integrity. How is this captured?

JL: Yes, the conventions document captures patterns we’ve learned over time: what to do in various situations. We also have automated linters and checks to ensure correctness around patterns like spec/status semantics. These automated tools help catch issues even when humans miss them.

As new scenarios arise -- and they do constantly -- we think through how to approach them and fold the results back into our documentation and tools. Sometimes it takes a few attempts before we settle on an approach that works well.

FM: Exactly. Each new interaction improves the guidelines.

JL: Right. And sometimes the first approach turns out to be wrong. It may take two or three iterations before we land on something robust.

The impact of Custom Resource Definitions

FM: Is there any particular change, episode, or domain that stands out as especially noteworthy, complex, or interesting in your experience?

JL: The watershed moment was Custom Resources. Prior to that, every API was handcrafted by us and fully reviewed. There were inconsistencies, but we understood and controlled every type and field.

When Custom Resources arrived, anyone could define anything. The first version did not even require a schema. That made it extremely powerful -- it enabled change immediately -- but it left us playing catch-up on stability and consistency.

When Custom Resources graduated to General Availability (GA), schemas became required, but escape hatches still existed for backward compatibility. Since then, we’ve been working on giving CRD authors validation capabilities comparable to built-ins. Built-in validation rules for CRDs have only just reached GA in the last few releases.

So CRDs opened the "anything is possible" era. Built-in validation rules are the second major milestone: bringing consistency back.

The three major themes have been defining schemas, validating data, and handling pre-existing invalid data. With ratcheting validation (allowing data to improve without breaking existing objects), we can now guide CRD authors toward conventions without breaking the world.

API Governance in context

FM: How does API Governance relate to SIG Architecture and API Machinery?

JL: API Machinery provides the actual code and tools that people build APIs on. They don’t review APIs for storage, networking, scheduling, etc.

SIG Architecture sets the overall system direction and works with API Machinery to ensure the system supports that direction. API Governance works with other SIGs building on that foundation to define conventions and patterns, ensuring consistent use of what API Machinery provides.

FM: Thank you. That clarifies the flow. Going back to release cycles: do release phases -- enhancements freeze, code freeze -- change your workload? Or is API Governance mostly continuous?

JL: We get involved in two places: design and implementation. Design involvement increases before enhancements freeze; implementation involvement increases before code freeze. However, many efforts span multiple releases, so there is always some design and implementation happening, even for work targeting future releases. Between those intense periods, we often have time to work on long-term design work.

An anti-pattern we see is teams thinking about a large feature for months and then presenting it three weeks before enhancements freeze, saying, "Here is the design, please review." For big changes with API impact, it’s much better to involve API Governance early.

And there are good times in the cycle for this -- between freezes -- when people have bandwidth. That’s when long-term review work fits best.

Getting involved

FM: Clearly. Now, regarding team dynamics and new contributors: how can someone get involved in API Governance? What should they focus on?

JL: It’s usually best to follow a specific change rather than trying to learn everything at once. Pick a small API change, perhaps one someone else is making or one you want to make, and observe the full process: design, implementation, review.

High-bandwidth review -- live discussion over video -- is often very effective. If you’re making or following a change, ask whether there’s a time to go over the design or PR together. Observing those discussions is extremely instructive.

Start with a small change. Then move to a bigger one. Then maybe a new API. That builds understanding of conventions as they are applied in practice.

FM: Excellent. Any final comments, or anything we missed?

JL: Yes... the reason we care so much about compatibility and stability is for our users. It’s easy for contributors to see those requirements as painful obstacles preventing cleanup or requiring tedious work... but users integrated with our system, and we made a promise to them: we want them to trust that we won’t break that contract. So even when it requires more work, moves slower, or involves duplication, we choose stability.

We are not trying to be obstructive; we are trying to make life good for users.

A lot of our questions focus on the future: you want to do something now... how will you evolve it later without breaking it? We assume we will know more in the future, and we want the design to leave room for that.

We also assume we will make mistakes. The question then is: how do we leave ourselves avenues to improve while keeping compatibility promises?

FM: Exactly. Jordan, thank you, I think we’ve covered everything. This has been an insightful view into the API Governance project and its role in the wider Kubernetes project.

JL: Thank you.

Introducing Node Readiness Controller

Tue, 03 Feb 2026 10:00:00 +0800

In the standard Kubernetes model, a node’s suitability for workloads hinges on a single binary "Ready" condition. However, in modern Kubernetes environments, nodes require complex infrastructure dependencies—such as network agents, storage drivers, GPU firmware, or custom health checks—to be fully operational before they can reliably host pods.

Today, on behalf of the Kubernetes project, I am announcing the Node Readiness Controller. This project introduces a declarative system for managing node taints, extending the readiness guardrails during node bootstrapping beyond standard conditions. By dynamically managing taints based on custom health signals, the controller ensures that workloads are only placed on nodes that met all infrastructure-specific requirements.

Why the Node Readiness Controller?

Core Kubernetes Node "Ready" status is often insufficient for clusters with sophisticated bootstrapping requirements. Operators frequently struggle to ensure that specific DaemonSets or local services are healthy before a node enters the scheduling pool.

The Node Readiness Controller fills this gap by allowing operators to define custom scheduling gates tailored to specific node groups. This enables you to enforce distinct readiness requirements across heterogeneous clusters, ensuring for example, that GPU equipped nodes only accept pods once specialized drivers are verified, while general purpose nodes follow a standard path.

It provides three primary advantages:

Custom Readiness Definitions: Define what ready means for your specific platform.
Automated Taint Management: The controller automatically applies or removes node taints based on condition status, preventing pods from landing on unready infrastructure.
Declarative Node Bootstrapping: Manage multi-step node initialization reliably, with a clear observability into the bootstrapping process.

Core concepts and features

The controller centers around the NodeReadinessRule (NRR) API, which allows you to define declarative gates for your nodes.

Flexible enforcement modes

The controller supports two distinct operational modes:

Continuous enforcement: Actively maintains the readiness guarantee throughout the node’s entire lifecycle. If a critical dependency (like a device driver) fails later, the node is immediately tainted to prevent new scheduling.
Bootstrap-only enforcement: Specifically for one-time initialization steps, such as pre-pulling heavy images or hardware provisioning. Once conditions are met, the controller marks the bootstrap as complete and stops monitoring that specific rule for the node.

Condition reporting

The controller reacts to Node Conditions rather than performing health checks itself. This decoupled design allows it to integrate seamlessly with other tools existing in the ecosystem as well as custom solutions:

Node Problem Detector (NPD): Use existing NPD setups and custom scripts to report node health.
Readiness Condition Reporter: A lightweight agent provided by the project that can be deployed to periodically check local HTTP endpoints and patch node conditions accordingly.

Operational safety with dry run

Deploying new readiness rules across a fleet carries inherent risk. To mitigate this, dry run mode allows operators to first simulate impact on the cluster. In this mode, the controller logs intended actions and updates the rule's status to show affected nodes without applying actual taints, enabling safe validation before enforcement.

Example: CNI bootstrapping

The following NodeReadinessRule ensures a node remains unschedulable until its CNI agent is functional. The controller monitors a custom cniplugin.example.net/NetworkReady condition and only removes the readiness.k8s.io/acme.com/network-unavailable taint once the status is True.

apiVersion: readiness.node.x-k8s.io/v1alpha1
kind: NodeReadinessRule
metadata:
  name: network-readiness-rule
spec:
  conditions:
    - type: "cniplugin.example.net/NetworkReady"
      requiredStatus: "True"
  taint:
    key: "readiness.k8s.io/acme.com/network-unavailable"
    effect: "NoSchedule"
    value: "pending"
  enforcementMode: "bootstrap-only"
  nodeSelector:
    matchLabels:
      node-role.kubernetes.io/worker: ""

Demo:

Getting involved

The Node Readiness Controller is just getting started, with our initial releases out, and we are seeking community feedback to refine the roadmap. Following our productive Unconference discussions at KubeCon NA 2025, we are excited to continue the conversation in person.

Join us at KubeCon + CloudNativeCon Europe 2026 for our maintainer track session: Addressing Non-Deterministic Scheduling: Introducing the Node Readiness Controller.

In the meantime, you can contribute or track our progress here:

GitHub: https://sigs.k8s.io/node-readiness-controller
Slack: Join the conversation in #sig-node-readiness-controller
Documentation: Getting Started

New Conversion from cgroup v1 CPU Shares to v2 CPU Weight

Fri, 30 Jan 2026 08:00:00 -0800

I'm excited to announce the implementation of an improved conversion formula from cgroup v1 CPU shares to cgroup v2 CPU weight. This enhancement addresses critical issues with CPU priority allocation for Kubernetes workloads when running on systems with cgroup v2.

Background

Kubernetes was originally designed with cgroup v1 in mind, where CPU shares were defined simply by assigning the container's CPU requests in millicpu form.

For example, a container requesting 1 CPU (1024m) would get (cpu.shares = 1024).

After a while, cgroup v1 started being replaced by its successor, cgroup v2. In cgroup v2, the concept of CPU shares (which ranges from 2 to 262144, or from 2¹ to 2¹⁸) was replaced with CPU weight (which ranges from [1, 10000], or 10⁰ to 10⁴).

With the transition to cgroup v2, KEP-2254 introduced a conversion formula to map cgroup v1 CPU shares to cgroup v2 CPU weight. The conversion formula was defined as: cpu.weight = (1 + ((cpu.shares - 2) * 9999) / 262142)

This formula linearly maps values from [2¹, 2¹⁸] to [10⁰, 10⁴].

While this approach is simple, the linear mapping imposes a few significant problems and impacts both performance and configuration granularity.

Problems with previous conversion formula

The current conversion formula creates two major issues:

1. Reduced priority against non-Kubernetes workloads

In cgroup v1, the default value for CPU shares is 1024, meaning a container requesting 1 CPU has equal priority with system processes that live outside of Kubernetes' scope. However, in cgroup v2, the default CPU weight is 100, but the current formula converts 1 CPU (1024m) to only ≈39 weight - less than 40% of the default.

Example:

Container requesting 1 CPU (1024m)
cgroup v1: cpu.shares = 1024 (equal to default)
cgroup v2 (current): cpu.weight = 39 (much lower than default 100)

This means that after moving to cgroup v2, Kubernetes (or OCI) workloads would de-facto reduce their CPU priority against non-Kubernetes processes. The problem can be severe for setups with many system daemons that run outside of Kubernetes' scope and expect Kubernetes workloads to have priority, especially in situations of resource starvation.

2. Unmanageable granularity

The current formula produces very low values for small CPU requests, limiting the ability to create sub-cgroups within containers for fine-grained resource distribution (which will possibly be much easier moving forward, see KEP #5474 for more info).

Example:

Container requesting 100m CPU
cgroup v1: cpu.shares = 102
cgroup v2 (current): cpu.weight = 4 (too low for sub-cgroup configuration)

With cgroup v1, requesting 100m CPU which led to 102 CPU shares was manageable in the sense that sub-cgroups could have been created inside the main container, assigning fine-grained CPU priorities for different groups of processes. With cgroup v2 however, having 4 shares is very hard to distribute between sub-cgroups since it's not granular enough.

With plans to allow writable cgroups for unprivileged containers, this becomes even more relevant.

New conversion formula

Description

The new formula is more complicated, but does a much better job mapping between cgroup v1 CPU shares and cgroup v2 CPU weight:

$$cpu.weight = \lceil 10^{(L^{2}/612 + 125L/612 - 7/34)} \rceil, \text{ where: } L = \log_2(cpu.shares)$$

The idea is that this is a quadratic function to cross the following values:

(2, 1): The minimum values for both ranges.
(1024, 100): The default values for both ranges.
(262144, 10000): The maximum values for both ranges.

Visually, the new function looks as follows:

And if you zoom in to the important part:

The new formula is "close to linear", yet it is carefully designed to map the ranges in a clever way so the three important points above would cross.

How it solves the problems

Better priority alignment:
- A container requesting 1 CPU (1024m) will now get a cpu.weight = 102. This value is close to cgroup v2's default 100. This restores the intended priority relationship between Kubernetes workloads and system processes.
Improved granularity:
- A container requesting 100m CPU will get cpu.weight = 17, (see here). Enables better fine-grained resource distribution within containers.

Adoption and integration

This change was implemented at the OCI layer. In other words, this is not implemented in Kubernetes itself; therefore the adoption of the new conversion formula depends solely on the OCI runtime adoption.

For example:

runc: The new formula is enabled from version 1.3.2.
crun: The new formula is enabled from version 1.23.

Impact on existing deployments

Important: Some consumers may be affected if they assume the older linear conversion formula. Applications or monitoring tools that directly calculate expected CPU weight values based on the previous formula may need updates to account for the new quadratic conversion. This is particularly relevant for:

Custom resource management tools that predict CPU weight values.
Monitoring systems that validate or expect specific weight values.
Applications that programmatically set or verify CPU weight values.

The Kubernetes project recommends testing the new conversion formula in non-production environments before upgrading OCI runtimes to ensure compatibility with existing tooling.

Where can I learn more?

For those interested in this enhancement:

Kubernetes GitHub Issue #131216 - Detailed technical analysis and examples, including discussions and reasoning for choosing the above formula.
KEP-2254: cgroup v2 - Original cgroup v2 implementation in Kubernetes.
Kubernetes cgroup documentation - Current resource management guidance.

How do I get involved?

For those interested in getting involved with Kubernetes node-level features, join the Kubernetes Node Special Interest Group. We always welcome new contributors and diverse perspectives on resource management challenges.

Ingress NGINX: Statement from the Kubernetes Steering and Security Response Committees

Thu, 29 Jan 2026 00:00:00 +0000

In March 2026, Kubernetes will retire Ingress NGINX, a piece of critical infrastructure for about half of cloud native environments. The retirement of Ingress NGINX was announced for March 2026, after years of public warnings that the project was in dire need of contributors and maintainers. There will be no more releases for bug fixes, security patches, or any updates of any kind after the project is retired. This cannot be ignored, brushed off, or left until the last minute to address. We cannot overstate the severity of this situation or the importance of beginning migration to alternatives like Gateway API or one of the many third-party Ingress controllers immediately.

To be abundantly clear: choosing to remain with Ingress NGINX after its retirement leaves you and your users vulnerable to attack. None of the available alternatives are direct drop-in replacements. This will require planning and engineering time. Half of you will be affected. You have two months left to prepare.

Existing deployments will continue to work, so unless you proactively check, you may not know you are affected until you are compromised. In most cases, you can check to find out whether or not you rely on Ingress NGINX by running kubectl get pods --all-namespaces --selector app.kubernetes.io/name=ingress-nginx with cluster administrator permissions.

Despite its broad appeal and widespread use by companies of all sizes, and repeated calls for help from the maintainers, the Ingress NGINX project never received the contributors it so desperately needed. According to internal Datadog research, about 50% of cloud native environments currently rely on this tool, and yet for the last several years, it has been maintained solely by one or two people working in their free time. Without sufficient staffing to maintain the tool to a standard both ourselves and our users would consider secure, the responsible choice is to wind it down and refocus efforts on modern alternatives like Gateway API.

We did not make this decision lightly; as inconvenient as it is now, doing so is necessary for the safety of all users and the ecosystem as a whole. Unfortunately, the flexibility Ingress NGINX was designed with, that was once a boon, has become a burden that cannot be resolved. With the technical debt that has piled up, and fundamental design decisions that exacerbate security flaws, it is no longer reasonable or even possible to continue maintaining the tool even if resources did materialize.

We issue this statement together to reinforce the scale of this change and the potential for serious risk to a significant percentage of Kubernetes users if this issue is ignored. It is imperative that you check your clusters now. If you are reliant on Ingress NGINX, you must begin planning for migration.

Thank you,

Kubernetes Steering Committee

Kubernetes Security Response Committee

Experimenting with Gateway API using kind

Wed, 28 Jan 2026 00:00:00 +0000

This document will guide you through setting up a local experimental environment with Gateway API on kind. This setup is designed for learning and testing. It helps you understand Gateway API concepts without production complexity.

Caution:

This is an experimentation learning setup, and should not be used for production. The components used on this document are not suited for production usage. Once you're ready to deploy Gateway API in a production environment, select an implementation that suits your needs.

Overview

In this guide, you will:

Set up a local Kubernetes cluster using kind (Kubernetes in Docker)
Deploy cloud-provider-kind, which provides both LoadBalancer Services and a Gateway API controller
Create a Gateway and HTTPRoute to route traffic to a demo application
Test your Gateway API configuration locally

This setup is ideal for learning, development, and experimentation with Gateway API concepts.

Prerequisites

Before you begin, ensure you have the following installed on your local machine:

Docker - Required to run kind and cloud-provider-kind
kubectl - The Kubernetes command-line tool
kind - Kubernetes in Docker
curl - Required to test the routes

Create a kind cluster

Create a new kind cluster by running:

kind create cluster

This will create a single-node Kubernetes cluster running in a Docker container.

Install cloud-provider-kind

Next, you need cloud-provider-kind, which provides two key components for this setup:

A LoadBalancer controller that assigns addresses to LoadBalancer-type Services
A Gateway API controller that implements the Gateway API specification

It also automatically installs the Gateway API Custom Resource Definitions (CRDs) in your cluster.

Run cloud-provider-kind as a Docker container on the same host where you created the kind cluster:

VERSION="$(basename $(curl -s -L -o /dev/null -w '%{url_effective}' https://github.com/kubernetes-sigs/cloud-provider-kind/releases/latest))"
docker run -d --name cloud-provider-kind --rm --network host -v /var/run/docker.sock:/var/run/docker.sock registry.k8s.io/cloud-provider-kind/cloud-controller-manager:${VERSION}

Note: On some systems, you may need elevated privileges to access the Docker socket.

Verify that cloud-provider-kind is running:

docker ps --filter name=cloud-provider-kind

You should see the container listed and in a running state. You can also check the logs:

docker logs cloud-provider-kind

Experimenting with Gateway API

Now that your cluster is set up, you can start experimenting with Gateway API resources.

cloud-provider-kind automatically provisions a GatewayClass called cloud-provider-kind. You'll use this class to create your Gateway.

It is worth noticing that while kind is not a cloud provider, the project is named as cloud-provider-kind as it provides features that simulate a cloud-enabled environment.

Deploy a Gateway

The following manifest will:

Create a new namespace called gateway-infra
Deploy a Gateway that listens on port 80
Accept HTTPRoutes with hostnames matching the *.exampledomain.example pattern
Allow routes from any namespace to attach to the Gateway. Note: In real clusters, prefer Same or Selector values on the allowedRoutes namespace selector field to limit attachments.

Apply the following manifest:

---
apiVersion: v1
kind: Namespace
metadata:
  name: gateway-infra
---
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: gateway
  namespace: gateway-infra
spec:
  gatewayClassName: cloud-provider-kind
  listeners:
  - name: default
    hostname: "*.exampledomain.example"
    port: 80
    protocol: HTTP
    allowedRoutes:
      namespaces:
        from: All

Then verify that your Gateway is properly programmed and has an address assigned:

kubectl get gateway -n gateway-infra gateway

Expected output:

NAME      CLASS                 ADDRESS      PROGRAMMED   AGE
gateway   cloud-provider-kind   172.18.0.3   True         5m6s

The PROGRAMMED column should show True, and the ADDRESS field should contain an IP address.

Deploy a demo application

Next, deploy a simple echo application that will help you test your Gateway configuration. This application:

Listens on port 3000
Echoes back request details including path, headers, and environment variables
Runs in a namespace called demo

Apply the following manifest:

apiVersion: v1
kind: Namespace
metadata:
  name: demo
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app.kubernetes.io/name: echo
  name: echo
  namespace: demo
spec:
  ports:
  - name: http
    port: 3000
    protocol: TCP
    targetPort: 3000
  selector:
    app.kubernetes.io/name: echo
  type: ClusterIP
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app.kubernetes.io/name: echo
  name: echo
  namespace: demo
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: echo
  template:
    metadata:
      labels:
        app.kubernetes.io/name: echo
    spec:
      containers:
      - env:
        - name: POD_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.name
        - name: NAMESPACE
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
        image: registry.k8s.io/gateway-api/echo-basic:v20251204-v1.4.1
        name: echo-basic

Create an HTTPRoute

Now create an HTTPRoute to route traffic from your Gateway to the echo application. This HTTPRoute will:

Respond to requests for the hostname some.exampledomain.example
Route traffic to the echo application
Attach to the Gateway in the gateway-infra namespace

Apply the following manifest:

apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: echo
  namespace: demo
spec:
  parentRefs:
  - name: gateway
    namespace: gateway-infra
  hostnames: ["some.exampledomain.example"]
  rules:
  - matches:
    - path:
        type: PathPrefix
        value: /
    backendRefs:
    - name: echo
      port: 3000

Test your route

The final step is to test your route using curl. You'll make a request to the Gateway's IP address with the hostname some.exampledomain.example. The command below is for POSIX shell only, and may need to be adjusted for your environment:

GW_ADDR=$(kubectl get gateway -n gateway-infra gateway -o jsonpath='{.status.addresses[0].value}')
curl --resolve some.exampledomain.example:80:${GW_ADDR} http://some.exampledomain.example

You should receive a JSON response similar to this:

{
 "path": "/",
 "host": "some.exampledomain.example",
 "method": "GET",
 "proto": "HTTP/1.1",
 "headers": {
  "Accept": [
   "*/*"
  ],
  "User-Agent": [
   "curl/8.15.0"
  ]
 },
 "namespace": "demo",
 "ingress": "",
 "service": "",
 "pod": "echo-dc48d7cf8-vs2df"
}

If you see this response, congratulations! Your Gateway API setup is working correctly.

Troubleshooting

If something isn't working as expected, you can troubleshoot by checking the status of your resources.

Check the Gateway status

First, inspect your Gateway resource:

kubectl get gateway -n gateway-infra gateway -o yaml

Look at the status section for conditions. Your Gateway should have:

Accepted: True - The Gateway was accepted by the controller
Programmed: True - The Gateway was successfully configured
.status.addresses populated with an IP address

Check the HTTPRoute status

Next, inspect your HTTPRoute:

kubectl get httproute -n demo echo -o yaml

Check the status.parents section for conditions. Common issues include:

ResolvedRefs set to False with reason BackendNotFound; this means that the backend Service doesn't exist or has the wrong name
Accepted set to False; this means that the route couldn't attach to the Gateway (check namespace permissions or hostname matching)

Example error when a backend is not found:

status:
  parents:
  - conditions:
    - lastTransitionTime: "2026-01-19T17:13:35Z"
      message: backend not found
      observedGeneration: 2
      reason: BackendNotFound
      status: "False"
      type: ResolvedRefs
    controllerName: kind.sigs.k8s.io/gateway-controller

Check controller logs

If the resource statuses don't reveal the issue, check the cloud-provider-kind logs:

docker logs -f cloud-provider-kind

This will show detailed logs from both the LoadBalancer and Gateway API controllers.

Cleanup

When you're finished with your experiments, you can clean up the resources:

Remove Kubernetes resources

Delete the namespaces (this will remove all resources within them):

kubectl delete namespace gateway-infra
kubectl delete namespace demo

Stop cloud-provider-kind

Stop and remove the cloud-provider-kind container:

docker stop cloud-provider-kind

Because the container was started with the --rm flag, it will be automatically removed when stopped.

Delete the kind cluster

Finally, delete the kind cluster:

kind delete cluster

Next steps

Now that you've experimented with Gateway API locally, you're ready to explore production-ready implementations:

Production Deployments: Review the Gateway API implementations to find a controller that matches your production requirements
Learn More: Explore the Gateway API documentation to learn about advanced features like TLS, traffic splitting, and header manipulation
Advanced Routing: Experiment with path-based routing, header matching, request mirroring and other features following Gateway API user guides

A final word of caution

This kind setup is for development and learning only. Always use a production-grade Gateway API implementation for real workloads.

Cluster API v1.12: Introducing In-place Updates and Chained Upgrades

Tue, 27 Jan 2026 08:00:00 -0800

Cluster API brings declarative management to Kubernetes cluster lifecycle, allowing users and platform teams to define the desired state of clusters and rely on controllers to continuously reconcile toward it.

Similar to how you can use StatefulSets or Deployments in Kubernetes to manage a group of Pods, in Cluster API you can use KubeadmControlPlane to manage a set of control plane Machines, or you can use MachineDeployments to manage a group of worker Nodes.

The Cluster API v1.12.0 release expands what is possible in Cluster API, reducing friction in common lifecycle operations by introducing in-place updates and chained upgrades.

Emphasis on simplicity and usability

With v1.12.0, the Cluster API project demonstrates once again that this community is capable of delivering a great amount of innovation, while at the same time minimizing impact for Cluster API users.

What does this mean in practice?

Users simply have to change the Cluster or the Machine spec (just as with previous Cluster API releases), and Cluster API will automatically trigger in-place updates or chained upgrades when possible and advisable.

In-place Updates

Like Kubernetes does for Pods in Deployments, when the Machine spec changes also Cluster API performs rollouts by creating a new Machine and deleting the old one.

This approach, inspired by the principle of immutable infrastructure, has a set of considerable advantages:

It is simple to explain, predictable, consistent and easy to reason about with users and engineers.
It is simple to implement, because it relies only on two core primitives, create and delete.
Implementation does not depend on Machine-specific choices, like OS, bootstrap mechanism etc.

As a result, Machine rollouts drastically reduce the number of variables to be considered when managing the lifecycle of a host server that is hosting Nodes.

However, while advantages of immutability are not under discussion, both Kubernetes and Cluster API are undergoing a similar journey, introducing changes that allow users to minimize workload disruption whenever possible.

Over time, also Cluster API has introduced several improvements to immutable rollouts, including:

Support for in-place propagation of changes affecting Kubernetes resources only, thus avoiding unnecessary rollouts
A way to Taint outdated nodes with PreferNoSchedule, thus reducing Pod churn by optimizing how Pods are rescheduled during rollouts.
Support for the delete first rollout strategy, thus making it easier to do immutable rollouts on bare metal / environments with constrained resources.

The new in-place update feature in Cluster API is the next step in this journey.

With the v1.12.0 release, Cluster API introduces support for update extensions allowing users to make changes on existing machines in-place, without deleting and re-creating the Machines.

Both KubeadmControlPlane and MachineDeployments support in-place updates based on the new update extension, and this means that the boundary of what is possible in Cluster API is now changed in a significant way.

How do in-place updates work?

The simplest way to explain it is that once the user triggers an update by changing the desired state of Machines, then Cluster API chooses the best tool to achieve the desired state.

The news is that now Cluster API can choose between immutable rollouts and in-place update extensions to perform required changes.

Importantly, this is not immutable rollouts vs in-place updates; Cluster API considers both valid options and selects the most appropriate mechanism for a given change.

From the perspective of the Cluster API maintainers, in-place updates are most useful for making changes that don't otherwise require a node drain or pod restart; for example: changing user credentials for the Machine. On the other hand, when the workload will be disrupted anyway, just do a rollout.

Nevertheless, Cluster API remains true to its extensible nature, and everyone can create their own update extension and decide when and how to use in-place updates by trading in some of the benefits of immutable rollouts.

For a deep dive into this feature, make sure to attend the session In-place Updates with Cluster API: The Sweet Spot Between Immutable and Mutable Infrastructure at KubeCon EU in Amsterdam!

Chained Upgrades

ClusterClass and managed topologies in Cluster API jointly provided a powerful and effective framework that acts as a building block for many platforms offering Kubernetes-as-a-Service.

Now with v1.12.0 this feature is making another important step forward, by allowing users to upgrade by more than one Kubernetes minor version in a single operation, commonly referred to as a chained upgrade.

This allows users to declare a target Kubernetes version and let Cluster API safely orchestrate the required intermediate steps, rather than manually managing each minor upgrade.

The simplest way to explain how chained upgrades work, is that once the user triggers an update by changing the desired version for a Cluster, Cluster API computes an upgrade plan, and then starts executing it. Rather than (for example) update the Cluster to v1.33.0 and then v1.34.0 and then v1.35.0, checking on progress at each step, a chained upgrade lets you go directly to v1.35.0.

Executing an upgrade plan means upgrading control plane and worker machines in a strictly controlled order, repeating this process as many times as needed to reach the desired state. The Cluster API is now capable of managing this for you.

Cluster API takes care of optimizing and minimizing the upgrade steps for worker machines, and in fact worker machines will skip upgrades to intermediate Kubernetes minor releases whenever allowed by the Kubernetes version skew policies.

Also in this case extensibility is at the core of this feature, and upgrade plan runtime extensions can be used to influence how the upgrade plan is computed; similarly, lifecycle hooks can be used to automate other tasks that must be performed during an upgrade, e.g. upgrading an addon after the control plane update completed.

From our perspective, chained upgrades are most useful for users that struggle to keep up with Kubernetes minor releases, and e.g. they want to upgrade only once per year and then upgrade by three versions (n-3 → n). But be warned: the fact that you can now easily upgrade by more than one minor version is not an excuse to not patch your cluster frequently!

Release team

I would like to thank all the contributors, the maintainers, and all the engineers that volunteered for the release team.

The reliability and predictability of Cluster API releases, which is one of the most appreciated features from our users, is only possible with the support, commitment, and hard work of its community.

Kudos to the entire Cluster API community for the v1.12.0 release and all the great releases delivered in 2025! If you are interested in getting involved, learn about Cluster API contributing guidelines.

What’s next?

If you read the Cluster API manifesto, you can see how the Cluster API subproject claims the right to remain unfinished, recognizing the need to continuously evolve, improve, and adapt to the changing needs of Cluster API’s users and the broader Cloud Native ecosystem.

As Kubernetes itself continues to evolve, the Cluster API subproject will keep advancing alongside it, focusing on safer upgrades, reduced disruption, and stronger building blocks for platforms managing Kubernetes at scale.

Innovation remains at the heart of Cluster API, stay tuned for an exciting 2026!

Useful links:

Headlamp in 2025: Project Highlights

Thu, 22 Jan 2026 10:00:00 +0800

This announcement is a recap from a post originally published on the Headlamp blog.

Headlamp has come a long way in 2025. The project has continued to grow – reaching more teams across platforms, powering new workflows and integrations through plugins, and seeing increased collaboration from the broader community.

We wanted to take a moment to share a few updates and highlight how Headlamp has evolved over the past year.

Updates

Joining Kubernetes SIG UI

This year marked a big milestone for the project: Headlamp is now officially part of Kubernetes SIG UI. This move brings roadmap and design discussions even closer to the core Kubernetes community and reinforces Headlamp’s role as a modern, extensible UI for the project.

As part of that, we’ve also been sharing more about making Kubernetes approachable for a wider audience, including an appearance on Enlightening with Whitney Lee and a talk at KCD New York 2025.

Linux Foundation mentorship

This year, we were excited to work with several students through the Linux Foundation’s Mentorship program, and our mentees have already left a visible mark on Headlamp:

Adwait Godbole built the KEDA plugin, adding a UI in Headlamp to view and manage KEDA resources like ScaledObjects and ScaledJobs.
Dhairya Majmudar set up an OpenTelemetry-based observability stack for Headlamp, wiring up metrics, logs, and traces so the project is easier to monitor and debug.
Aishwarya Ghatole led a UX audit of Headlamp plugins, identifying usability issues and proposing design improvements and personas for plugin users.
Anirban Singha developed the Karpenter plugin, giving Headlamp a focused view into Karpenter autoscaling resources and decisions.
Aditya Chaudhary improved Gateway API support, so you can see networking relationships on the resource map, as well as improved support for many of the new Gateway API resources.
Faakhir Zahid completed a way to easily manage plugin installation with Headlamp deployed in clusters.
Saurav Upadhyay worked on backend caching for Kubernetes API calls, reducing load on the API server and improving performance in Headlamp.

New changes

Multi-cluster view

Managing multiple clusters is challenging: teams often switch between tools and lose context when trying to see what runs where. Headlamp solves this by giving you a single view to compare clusters side-by-side. This makes it easier to understand workloads across environments and reduces the time spent hunting for resources.

View of multi-cluster workloads

Projects

Kubernetes apps often span multiple namespaces and resource types, which makes troubleshooting feel like piecing together a puzzle. We’ve added Projects to give you an application-centric view that groups related resources across multiple namespaces – and even clusters. This allows you to reduce sprawl, troubleshoot faster, and collaborate without digging through YAML or cluster-wide lists.

View of the new Projects feature

Changes:

New “Projects” feature for grouping namespaces into app- or team-centric projects
Extensible Projects details view that plugins can customize with their own tabs and actions

Day-to-day ops in Kubernetes often means juggling logs, terminals, YAML, and dashboards across clusters. We redesigned Headlamp’s navigation to treat these as first-class “activities” you can keep open and come back to, instead of one-off views you lose as soon as you click away.

View of the new task bar

Changes:

A new task bar/activities model lets you pin logs, exec sessions, and details as ongoing activities
An activity overview with a “Close all” action and cluster information
Multi-select and global filters in tables

Thanks to Jan Jansen and Aditya Chaudhary.

Search and map

When something breaks in production, the first two questions are usually “where is it?” and “what is it connected to?” We’ve upgraded both search and the map view so you can get from a high-level symptom to the right set of objects much faster.

View of the new Advanced Search feature

Changes:

An Advanced search view that supports rich, expression-based queries over Kubernetes objects
Improved global search that understands labels and multiple search items, and can even update your current namespace based on what you find
EndpointSlice support in the Network section
A richer map view that now includes Custom Resources and Gateway API objects

Thanks to Fabian, Alexander North, and Victor Marcolino from Swisscom, and also to Aditya Chaudhary.

OIDC and authentication

We’ve put real work into making OIDC setup clearer and more resilient, especially for in-cluster deployments.

View of user information for OIDC clusters

Changes:

User information displayed in the top bar for OIDC-authenticated users
PKCE support for more secure authentication flows, as well as hardened token refresh handling
Documentation for using the access token using -oidc-use-access-token=true
Improved support for public OIDC clients like AKS and EKS
New guide for setting up Headlamp on AKS with Azure Entra-ID using OAuth2Proxy

Thanks to David Dobmeier and Harsh Srivastava.

App Catalog and Helm

We’ve broadened how you deploy and source apps via Headlamp, specifically supporting vanilla Helm repos.

Changes:

A more capable Helm chart with optional backend TLS termination, PodDisruptionBudgets, custom pod labels, and more
Improved formatting and added missing access token arg in the Helm chart
New in-cluster Helm support with an --enable-helm flag and a service proxy

Thanks to Vrushali Shah and Murali Annamneni from Oracle, and also to Pat Riehecky, Joshua Akers, Rostislav Stříbrný, Rick L, and Victor.

Performance, accessibility, and UX

Finally, we’ve spent a lot of time on the things you notice every day but don’t always make headlines: startup time, list views, log viewers, accessibility, and small network UX details. A continuous accessibility self-audit has also helped us identify key issues and make Headlamp easier for everyone to use.

View of the Learn section in docs

Changes:

Significant desktop improvements, with up to 60% faster app loads and much quicker dev-mode reloads for contributors
Numerous table and log viewer refinements: persistent sort order, consistent row actions, copy-name buttons, better tooltips, and more forgiving log inputs
Accessibility and localization improvements, including fixes for zoom-related layout issues, better color contrast, improved screen reader support, and expanded language coverage
More control over resources, with live pod CPU/memory metrics, richer pod details, and inline editing for secrets and CRD fields
A refreshed documentation and plugin onboarding experience, including a “Learn” section and plugin showcase
A more complete NetworkPolicy UI and network-related polish
Nightly builds available for early testing

Thanks to Jaehan Byun and Jan Jansen.

Plugins and extensibility

Discovering plugins is simpler now – no more hopping between Artifact Hub and assorted GitHub repos. Browse our dedicated Plugins page for a curated catalog of Headlamp-endorsed plugins, along with a showcase of featured plugins.

View of the Plugins showcase

Headlamp AI Assistant

Managing Kubernetes often means memorizing commands and juggling tools. Headlamp’s new AI Assistant changes this by adding a natural-language interface built into the UI. Now, instead of typing kubectl or digging through YAML you can ask, “Is my app healthy?” or “Show logs for this deployment,” and get answers in context, speeding up troubleshooting and smoothing onboarding for new users. Learn more about it here.

New plugins additions

Alongside the new AI Assistant, we’ve been growing Headlamp’s plugin ecosystem so you can bring more of your workflows into a single UI, with integrations like Minikube, Karpenter, and more.

Highlights from the latest plugin releases:

Minikube plugin, providing a locally stored single node Minikube cluster
Karpenter plugin, with support for Azure Node Auto-Provisioning (NAP)
KEDA plugin, which you can learn more about here
Community-maintained plugins for Gatekeeper and KAITO

Thanks to Vrushali Shah and Murali Annamneni from Oracle, and also to Anirban Singha, Adwait Godbole, Sertaç Özercan, Ernest Wong, and Chloe Lim.

Other plugins updates

Alongside new additions, we’ve also spent time refining plugins that many of you already use, focusing on smoother workflows and better integration with the core UI.

View of the Backstage plugin

Changes:

Flux plugin: Updated for Flux v2.7, with support for newer CRDs, navigation fixes so it works smoothly on recent clusters
App Catalog: Now supports Helm repos in addition to Artifact Hub, can run in-cluster via /serviceproxy, and shows both current and latest app versions
Plugin Catalog: Improved card layout and accessibility, plus dependency and Storybook test updates
Backstage plugin: Dependency and build updates, more info here

Plugin development

We’ve focused on making it faster and clearer to build, test, and ship Headlamp plugins, backed by improved documentation and lighter tooling.

View of the Plugin Development guide

Changes:

New and expanded guides for plugin architecture and development, including how to publish and ship plugins
Added i18n support documentation so plugins can be translated and localized
Added example plugins: ui-panels, resource-charts, custom-theme, and projects
Improved type checking for Headlamp APIs, restored Storybook support for component testing, and reduced dependencies for faster installs and fewer updates
Documented plugin install locations, UI signifiers in Plugin Settings, and labels that differentiated shipped, UI-installed, and dev-mode plugins

Security upgrades

We've also been investing in keeping Headlamp secure – both by tightening how authentication works and by staying on top of upstream vulnerabilities and tooling.

Updates:

We've been keeping up with security updates, regularly updating dependencies and addressing upstream security issues.
We tightened the Helm chart's default security context and fixed a regression that broke the plugin manager.
We've improved OIDC security with PKCE support, helping unblock more secure and standards-compliant OIDC setups when deploying Headlamp in-cluster.

Conclusion

Thank you to everyone who has contributed to Headlamp this year – whether through pull requests, plugins, or simply sharing how you're using the project. Seeing the different ways teams are adopting and extending the project is a big part of what keeps us moving forward. If your organization uses Headlamp, consider adding it to our adopters list.

If you haven't tried Headlamp recently, all these updates are available today. Check out the latest Headlamp release, explore the new views, plugins, and docs, and share your feedback with us on Slack or GitHub – your feedback helps shape where Headlamp goes next.

Announcing the Checkpoint/Restore Working Group

Wed, 21 Jan 2026 10:00:00 -0800

The community around Kubernetes includes a number of Special Interest Groups (SIGs) and Working Groups (WGs) facilitating discussions on important topics between interested contributors. Today we would like to announce the new Kubernetes Checkpoint Restore WG focusing on the integration of Checkpoint/Restore functionality into Kubernetes.

Motivation and use cases

There are several high-level scenarios discussed in the working group:

Optimizing resource utilization for interactive workloads, such as Jupyter notebooks and AI chatbots
Accelerating startup of applications with long initialization times, including Java applications and LLM inference services
Using periodic checkpointing to enable fault-tolerance for long-running workloads, such as distributed model training
Providing interruption-aware scheduling with transparent checkpoint/restore, allowing lower-priority Pods to be preempted while preserving the runtime state of applications
Facilitating Pod migration across nodes for load balancing and maintenance, without disrupting workloads.
Enabling forensic checkpointing to investigate and analyze security incidents such as cyberattacks, data breaches, and unauthorized access.

Across these scenarios, the goal is to help facilitate discussions of ideas between the Kubernetes community and the growing Checkpoint/Restore in Userspace (CRIU) ecosystem. The CRIU community includes several projects that support these use cases, including:

CRIU - A tool for checkpointing and restoring running applications and containers
checkpointctl - A tool for in-depth analysis of container checkpoints
criu-coordinator - A tool for coordinated checkpoint/restore of distributed applications with CRIU
checkpoint-restore-operator - A Kubernetes operator for managing checkpoints

More information about the checkpoint/restore integration with Kubernetes is also available here.

Following our presentation about transparent checkpointing at KubeCon EU 2025, we are excited to welcome you to our panel discussion and AI + ML session at KubeCon + CloudNativeCon Europe 2026.

Connect with us

If you are interested in contributing to Kubernetes or CRIU, there are several ways to participate:

Join our meeting every second Thursday at 17:00 UTC via the Zoom link in our meeting notes; recordings of our prior meetings are available here.
Chat with us on the Kubernetes Slack: #wg-checkpoint-restore
Email us at the wg-checkpoint-restore mailing list

Uniform API server access using clientcmd

Mon, 19 Jan 2026 10:00:00 -0800

If you've ever wanted to develop a command line client for a Kubernetes API, especially if you've considered making your client usable as a kubectl plugin, you might have wondered how to make your client feel familiar to users of kubectl. A quick glance at the output of kubectl options might put a damper on that: "Am I really supposed to implement all those options?"

Fear not, others have done a lot of the work involved for you. In fact, the Kubernetes project provides two libraries to help you handle kubectl-style command line arguments in Go programs: clientcmd and cli-runtime (which uses clientcmd). This article will show how to use the former.

General philosophy

As might be expected since it's part of client-go, clientcmd's ultimate purpose is to provide an instance of restclient.Config that can issue requests to an API server.

It follows kubectl semantics:

defaults are taken from ~/.kube or equivalent;
files can be specified using the KUBECONFIG environment variable;
all of the above settings can be further overridden using command line arguments.

It doesn't set up a --kubeconfig command line argument, which you might want to do to align with kubectl; you'll see how to do this in the "Bind the flags" section.

Available features

clientcmd allows programs to handle

kubeconfig selection (using KUBECONFIG);
context selection;
namespace selection;
client certificates and private keys;
user impersonation;
HTTP Basic authentication support (username/password).

Configuration merging

In various scenarios, clientcmd supports merging configuration settings: KUBECONFIG can specify multiple files whose contents are combined. This can be confusing, because settings are merged in different directions depending on how they are implemented. If a setting is defined in a map, the first definition wins, subsequent definitions are ignored. If a setting is not defined in a map, the last definition wins.

When settings are retrieved using KUBECONFIG, missing files result in warnings only. If the user explicitly specifies a path (in --kubeconfig style), there must be a corresponding file.

If KUBECONFIG isn't defined, the default configuration file, ~/.kube/config, is used instead, if present.

Overall process

The general usage pattern is succinctly expressed in the clientcmd package documentation:

loadingRules := clientcmd.NewDefaultClientConfigLoadingRules()
// if you want to change the loading rules (which files in which order), you can do so here

configOverrides := &clientcmd.ConfigOverrides{}
// if you want to change override values or bind them to flags, there are methods to help you

kubeConfig := clientcmd.NewNonInteractiveDeferredLoadingClientConfig(loadingRules, configOverrides)
config, err := kubeConfig.ClientConfig()
if err != nil {
	// Do something
}
client, err := metav1.New(config)
// ...

In the context of this article, there are six steps:

Configure the loading rules

clientcmd.NewDefaultClientConfigLoadingRules() builds loading rules which will use either the contents of the KUBECONFIG environment variable, or the default configuration file name (~/.kube/config). In addition, if the default configuration file is used, it is able to migrate settings from the (very) old default configuration file (~/.kube/.kubeconfig).

You can build your own ClientConfigLoadingRules, but in most cases the defaults are fine.

Configure the overrides

clientcmd.ConfigOverrides is a struct storing overrides which will be applied over the settings loaded from the configuration derived using the loading rules. In the context of this article, its primary purpose is to store values obtained from command line arguments. These are handled using the pflag library, which is a drop-in replacement for Go's flag package, adding support for double-hyphen arguments with long names.

In most cases there's nothing to set in the overrides; I will only bind them to flags.

Build a set of flags

In this context, a flag is a representation of a command line argument, specifying its long name (such as --namespace), its short name if any (such as -n), its default value, and a description shown in the usage information. Flags are stored in instances of the FlagInfo struct.

Three sets of flags are available, representing the following command line arguments:

authentication arguments (certificates, tokens, impersonations, username/password);
cluster arguments (API server, certificate authority, TLS configuration, proxy, compression)
context arguments (cluster name, kubeconfig user name, namespace)

The recommended selection includes all three with a named context selection argument and a timeout argument.

These are all available using the Recommended…Flags functions. The functions take a prefix, which is prepended to all the argument long names.

So calling clientcmd.RecommendedConfigOverrideFlags("") results in command line arguments such as --context, --namespace, and so on. The --timeout argument is given a default value of 0, and the --namespace argument has a corresponding short variant, -n. Adding a prefix, such as "from-", results in command line arguments such as --from-context, --from-namespace, etc. This might not seem particularly useful on commands involving a single API server, but they come in handy when multiple API servers are involved, such as in multi-cluster scenarios.

There's a potential gotcha here: prefixes don't modify the short name, so --namespace needs some care if multiple prefixes are used: only one of the prefixes can be associated with the -n short name. You'll have to clear the short names associated with the other prefixes' --namespace , or perhaps all prefixes if there's no sensible -n association. Short names can be cleared as follows:

kflags := clientcmd.RecommendedConfigOverrideFlags(prefix)
kflags.ContextOverrideFlags.Namespace.ShortName = ""

In a similar fashion, flags can be disabled entirely by clearing their long name:

kflags.ContextOverrideFlags.Namespace.LongName = ""

Bind the flags

Once a set of flags has been defined, it can be used to bind command line arguments to overrides using clientcmd.BindOverrideFlags. This requires a pflag FlagSet rather than one from Go's flag package.

If you also want to bind --kubeconfig, you should do so now, by binding ExplicitPath in the loading rules:

flags.StringVarP(&loadingRules.ExplicitPath, "kubeconfig", "", "", "absolute path(s) to the kubeconfig file(s)")

Build the merged configuration

Two functions are available to build a merged configuration:

As the names suggest, the difference between the two is that the first can ask for authentication information interactively, using a provided reader, whereas the second only operates on the information given to it by the caller.

The "deferred" mention in these function names refers to the fact that the final configuration will be determined as late as possible. This means that these functions can be called before the command line arguments are parsed, and the resulting configuration will use whatever values have been parsed by the time it's actually constructed.

Obtain an API client

The merged configuration is returned as a ClientConfig instance. An API client can be obtained from that by calling the ClientConfig() method.

If no configuration is given (KUBECONFIG is empty or points to non-existent files, ~/.kube/config doesn't exist, and no configuration is given using command line arguments), the default setup will return an obscure error referring to KUBERNETES_MASTER. This is legacy behaviour; several attempts have been made to get rid of it, but it is preserved for the --local and --dry-run command line arguments in --kubectl. You should check for "empty configuration" errors by calling clientcmd.IsEmptyConfig() and provide a more explicit error message.

The Namespace() method is also useful: it returns the namespace that should be used. It also indicates whether the namespace was overridden by the user (using --namespace).

Full example

Here's a complete example.

package main

import (
	"context"
	"fmt"
	"os"

	"github.com/spf13/pflag"
	v1 "k8s.io/apimachinery/pkg/apis/meta/v1"
	"k8s.io/client-go/kubernetes"
	"k8s.io/client-go/tools/clientcmd"
)

func main() {
	// Loading rules, no configuration
	loadingRules := clientcmd.NewDefaultClientConfigLoadingRules()

	// Overrides and flag (command line argument) setup
	configOverrides := &clientcmd.ConfigOverrides{}
	flags := pflag.NewFlagSet("clientcmddemo", pflag.ExitOnError)
	clientcmd.BindOverrideFlags(configOverrides, flags,
		clientcmd.RecommendedConfigOverrideFlags(""))
	flags.StringVarP(&loadingRules.ExplicitPath, "kubeconfig", "", "", "absolute path(s) to the kubeconfig file(s)")
	flags.Parse(os.Args)

	// Client construction
	kubeConfig := clientcmd.NewNonInteractiveDeferredLoadingClientConfig(loadingRules, configOverrides)
	config, err := kubeConfig.ClientConfig()
	if err != nil {
		if clientcmd.IsEmptyConfig(err) {
			panic("Please provide a configuration pointing to the Kubernetes API server")
		}
		panic(err)
	}
	client, err := kubernetes.NewForConfig(config)
	if err != nil {
		panic(err)
	}

	// How to find out what namespace to use
	namespace, overridden, err := kubeConfig.Namespace()
	if err != nil {
		panic(err)
	}
	fmt.Printf("Chosen namespace: %s; overridden: %t\n", namespace, overridden)

	// Let's use the client
	nodeList, err := client.CoreV1().Nodes().List(context.TODO(), v1.ListOptions{})
	if err != nil {
		panic(err)
	}
	for _, node := range nodeList.Items {
		fmt.Println(node.Name)
	}
}

Happy coding, and thank you for your interest in implementing tools with familiar usage patterns!

Kubernetes v1.35: Restricting executables invoked by kubeconfigs via exec plugin allowList added to kuberc

Fri, 09 Jan 2026 10:30:00 -0800

Did you know that kubectl can run arbitrary executables, including shell scripts, with the full privileges of the invoking user, and without your knowledge? Whenever you download or auto-generate a kubeconfig, the users[n].exec.command field can specify an executable to fetch credentials on your behalf. Don't get me wrong, this is an incredible feature that allows you to authenticate to the cluster with external identity providers. Nevertheless, you probably see the problem: Do you know exactly what executables your kubeconfig is running on your system? Do you trust the pipeline that generated your kubeconfig? If there has been a supply-chain attack on the code that generates the kubeconfig, or if the generating pipeline has been compromised, an attacker might well be doing unsavory things to your machine by tricking your kubeconfig into running arbitrary code.

To give the user more control over what gets run on their system, SIG-Auth and SIG-CLI added the credential plugin policy and allowlist as a beta feature to Kubernetes 1.35. This is available to all clients using the client-go library, by filling out the ExecProvider.PluginPolicy struct on a REST config. To broaden the impact of this change, Kubernetes v1.35 also lets you manage this without writing a line of application code. You can configure kubectl to enforce the policy and allowlist by adding two fields to the kuberc configuration file: credentialPluginPolicy and credentialPluginAllowlist. Adding one or both of these fields restricts which credential plugins kubectl is allowed to execute.

How it works

A full description of this functionality is available in our official documentation for kuberc, but this blog post will give a brief overview of the new security knobs. The new features are in beta and available without using any feature gates.

The following example is the simplest one: simply don't specify the new fields.

apiVersion: kubectl.config.k8s.io/v1beta1
kind: Preference

This will keep kubectl acting as it always has, and all plugins will be allowed.

The next example is functionally identical, but it is more explicit and therefore preferred if it's actually what you want:

apiVersion: kubectl.config.k8s.io/v1beta1
kind: Preference
credentialPluginPolicy: AllowAll

If you don't know whether or not you're using exec credential plugins, try setting your policy to DenyAll:

apiVersion: kubectl.config.k8s.io/v1beta1
kind: Preference
credentialPluginPolicy: DenyAll

If you are using credential plugins, you'll quickly find out what kubectl is trying to execute. You'll get an error like the following.

Unable to connect to the server: getting credentials: plugin "cloudco-login" not allowed: policy set to "DenyAll"

If there is insufficient information for you to debug the issue, increase the logging verbosity when you run your next command. For example:

# increase or decrease verbosity if the issue is still unclear
kubectl get pods --verbosity 5

Selectively allowing plugins

What if you need the cloudco-login plugin to do your daily work? That is why there's a third option for your policy, Allowlist. To allow a specific plugin, set the policy and add the credentialPluginAllowlist:

apiVersion: kubectl.config.k8s.io/v1beta1
kind: Preference
credentialPluginPolicy: Allowlist
credentialPluginAllowlist:
  - name: /usr/local/bin/cloudco-login
  - name: get-identity

You'll notice that there are two entries in the allowlist. One of them is specified by full path, and the other, get-identity is just a basename. When you specify just the basename, the full path will be looked up using exec.LookPath, which does not expand globbing or handle wildcards. Globbing is not supported at this time. Both forms (basename and full path) are acceptable, but the full path is preferable because it narrows the scope of allowed binaries even further.

Future enhancements

Currently, an allowlist entry has only one field, name. In the future, we (Kubernetes SIG CLI) want to see other requirements added. One idea that seems useful is checksum verification whereby, for example, a binary would only be allowed to run if it has the sha256 sum b9a3fad00d848ff31960c44ebb5f8b92032dc085020f857c98e32a5d5900ff9c and exists at the path /usr/bin/cloudco-login.

Another possibility is only allowing binaries that have been signed by one of a set of a trusted signing keys.

Get involved

The credential plugin policy is still under development and we are very interested in your feedback. We'd love to hear what you like about it and what problems you'd like to see it solve. Or, if you have the cycles to contribute one of the above enhancements, they'd be a great way to get started contributing to Kubernetes. Feel free to join in the discussion on slack:

#sig-cli,
#sig-auth.

Kubernetes v1.35: Mutable PersistentVolume Node Affinity (alpha)

Thu, 08 Jan 2026 10:30:00 -0800

The PersistentVolume node affinity API dates back to Kubernetes v1.10. It is widely used to express that volumes may not be equally accessible by all nodes in the cluster. This field was previously immutable, and it is now mutable in Kubernetes v1.35 (alpha). This change opens a door to more flexible online volume management.

Why make node affinity mutable?

This raises an obvious question: why make node affinity mutable now? While stateless workloads like Deployments can be changed freely and the changes will be rolled out automatically by re-creating every Pod, PersistentVolumes (PVs) are stateful and cannot be re-created easily without losing data.

However, Storage providers evolve and storage requirements change. Most notably, multiple providers are offering regional disks now. Some of them even support live migration from zonal to regional disks, without disrupting the workloads. This change can be expressed through the VolumeAttributesClass API, which recently graduated to GA in 1.34. However, even if the volume is migrated to regional storage, Kubernetes still prevents scheduling Pods to other zones because of the node affinity recorded in the PV object. In this case, you may want to change the PV node affinity from:

spec:
  nodeAffinity:
    required:
      nodeSelectorTerms:
      - matchExpressions:
        - key: topology.kubernetes.io/zone
          operator: In
          values:
          - us-east1-b

to:

spec:
  nodeAffinity:
    required:
      nodeSelectorTerms:
      - matchExpressions:
        - key: topology.kubernetes.io/region
          operator: In
          values:
          - us-east1

As another example, providers sometimes offer new generations of disks. New disks cannot always be attached to older nodes in the cluster. This accessibility can also be expressed through PV node affinity and ensures the Pods can be scheduled to the right nodes. But when the disk is upgraded, new Pods using this disk can still be scheduled to older nodes. To prevent this, you may want to change the PV node affinity from:

spec:
  nodeAffinity:
    required:
      nodeSelectorTerms:
      - matchExpressions:
        - key: provider.com/disktype.gen1
          operator: In
          values:
          - available

to:

spec:
  nodeAffinity:
    required:
      nodeSelectorTerms:
      - matchExpressions:
        - key: provider.com/disktype.gen2
          operator: In
          values:
          - available

So, it is mutable now, a first step towards a more flexible online volume management. While it is a simple change that removes one validation from the API server, we still have a long way to go to integrate well with the Kubernetes ecosystem.

Try it out

This feature is for you if you are a Kubernetes cluster administrator, and your storage provider allows online update that you want to utilize, but those updates can affect the accessibility of the volume.

Note that changing PV node affinity alone will not actually change the accessibility of the underlying volume. Before using this feature, you must first update the underlying volume in the storage provider, and understand which nodes can access the volume after the update. You can then enable this feature and keep the PV node affinity in sync.

Currently, this feature is in alpha state. It is disabled by default, and may subject to change. To try it out, enable the MutablePVNodeAffinity feature gate on APIServer, then you can edit the PV spec.nodeAffinity field. Typically only administrators can edit PVs, please make sure you have the right RBAC permissions.

Race condition between updating and scheduling

There are only a few factors outside of a Pod that can affect the scheduling decision, and PV node affinity is one of them. It is fine to allow more nodes to access the volume by relaxing node affinity, but there is a race condition when you try to tighten node affinity: it is unclear how the Scheduler will see the modified PV in its cache, so there is a small window where the scheduler may place a Pod on an old node that can no longer access the volume. In this case, the Pod will stuck at ContainerCreating state.

One mitigation currently under discussion is for the kubelet to fail Pod startup if the PersistentVolume’s node affinity is violated. This has not landed yet. So if you are trying this out now, please watch subsequent Pods that use the updated PV, and make sure they are scheduled onto nodes that can access the volume. If you update PV and immediately start new Pods in a script, it may not work as intended.

Future integration with CSI (Container Storage Interface)

Currently, it is up to the cluster administrator to modify both PV's node affinity and the underlying volume in the storage provider. But manual operations are error-prone and time-consuming. It is preferred to eventually integrate this with VolumeAttributesClass, so that an unprivileged user can modify their PersistentVolumeClaim (PVC) to trigger storage-side updates, and PV node affinity is updated automatically when appropriate, without the need for cluster admin's intervention.

We welcome your feedback from users and storage driver developers

As noted earlier, this is only a first step.

If you are a Kubernetes user, we would like to learn how you use (or will use) PV node affinity. Is it beneficial to update it online in your case?

If you are a CSI driver developer, would you be willing to implement this feature? How would you like the API to look?

Please provide your feedback via:

Slack channel #sig-storage.
Mailing list kubernetes-sig-storage.
The KEP issue Mutable PersistentVolume Node Affinity.

For any inquiries or specific questions related to this feature, please reach out to the SIG Storage community.

Kubernetes v1.35: A Better Way to Pass Service Account Tokens to CSI Drivers

Wed, 07 Jan 2026 10:30:00 -0800

If you maintain a CSI driver that uses service account tokens, Kubernetes v1.35 brings a refinement you'll want to know about. Since the introduction of the TokenRequests feature, service account tokens requested by CSI drivers have been passed to them through the volume_context field. While this has worked, it's not the ideal place for sensitive information, and we've seen instances where tokens were accidentally logged in CSI drivers.

Kubernetes v1.35 introduces a beta solution to address this: CSI Driver Opt-in for Service Account Tokens via Secrets Field. This allows CSI drivers to receive service account tokens through the secrets field in NodePublishVolumeRequest, which is the appropriate place for sensitive data in the CSI specification.

Understanding the existing approach

When CSI drivers use the TokenRequests feature, they can request service account tokens for workload identity by configuring the TokenRequests field in the CSIDriver spec. These tokens are passed to drivers as part of the volume attributes map, using the key csi.storage.k8s.io/serviceAccount.tokens.

The volume_context field works, but it's not designed for sensitive data. Because of this, there are a few challenges:

First, the protosanitizer tool that CSI drivers use doesn't treat volume context as sensitive, so service account tokens can end up in logs when gRPC requests are logged. This happened with CVE-2023-2878 in the Secrets Store CSI Driver and CVE-2024-3744 in the Azure File CSI Driver.

Second, each CSI driver that wants to avoid this issue needs to implement its own sanitization logic, which leads to inconsistency across drivers.

The CSI specification already has a secrets field in NodePublishVolumeRequest that's designed exactly for this kind of sensitive information. The challenge is that we can't just change where we put the tokens without breaking existing CSI drivers that expect them in volume context.

How the opt-in mechanism works

Kubernetes v1.35 introduces an opt-in mechanism that lets CSI drivers choose how they receive service account tokens. This way, existing drivers continue working as they do today, and drivers can move to the more appropriate secrets field when they're ready.

CSI drivers can set a new field in their CSIDriver spec:

#
# CAUTION: this is an example configuration.
#          Do not use this for your own cluster!
#
apiVersion: storage.k8s.io/v1
kind: CSIDriver
metadata:
  name: example-csi-driver
spec:
  # ... existing fields ...
  tokenRequests:
  - audience: "example.com"
    expirationSeconds: 3600
  # New field for opting into secrets delivery
  serviceAccountTokenInSecrets: true  # defaults to false

The behavior depends on the serviceAccountTokenInSecrets field:

When set to false (the default), tokens are placed in VolumeContext with the key csi.storage.k8s.io/serviceAccount.tokens, just like today. When set to true, tokens are placed only in the Secrets field with the same key.

About the beta release

The CSIServiceAccountTokenSecrets feature gate is enabled by default on both kubelet and kube-apiserver. Since the serviceAccountTokenInSecrets field defaults to false, enabling the feature gate doesn't change any existing behavior. All drivers continue receiving tokens via volume context unless they explicitly opt in. This is why we felt comfortable starting at beta rather than alpha.

Guide for CSI driver authors

If you maintain a CSI driver that uses service account tokens, here's how to adopt this feature.

Adding fallback logic

First, update your driver code to check both locations for tokens. This makes your driver compatible with both the old and new approaches:

const serviceAccountTokenKey = "csi.storage.k8s.io/serviceAccount.tokens"

func getServiceAccountTokens(req *csi.NodePublishVolumeRequest) (string, error) {
    // Check secrets field first (new behavior when driver opts in)
    if tokens, ok := req.Secrets[serviceAccountTokenKey]; ok {
        return tokens, nil
    }
    
    // Fall back to volume context (existing behavior)
    if tokens, ok := req.VolumeContext[serviceAccountTokenKey]; ok {
        return tokens, nil
    }
    
    return "", fmt.Errorf("service account tokens not found")
}

This fallback logic is backward compatible and safe to ship in any driver version, even before clusters upgrade to v1.35.

Rollout sequence

CSI driver authors need to follow a specific sequence when adopting this feature to avoid breaking existing volumes.

Driver preparation (can happen anytime)

You can start preparing your driver right away by adding fallback logic that checks both the secrets field and volume context for tokens. This code change is backward compatible and safe to ship in any driver version, even before clusters upgrade to v1.35. We encourage you to add this fallback logic early, cut releases, and even backport to maintenance branches where feasible.

Cluster upgrade and feature enablement

Once your driver has the fallback logic deployed, here's the safe rollout order for enabling the feature in a cluster:

Complete the kube-apiserver upgrade to 1.35 or later
Complete kubelet upgrade to 1.35 or later on all nodes
Ensure CSI driver version with fallback logic is deployed (if not already done in preparation phase)
Fully complete CSI driver DaemonSet rollout across all nodes
Update your CSIDriver manifest to set serviceAccountTokenInSecrets: true

Important constraints

The most important thing to remember is timing. If your CSI driver DaemonSet and CSIDriver object are in the same manifest or Helm chart, you need two separate updates. Deploy the new driver version with fallback logic first, wait for the DaemonSet rollout to complete, then update the CSIDriver spec to set serviceAccountTokenInSecrets: true.

Also, don't update the CSIDriver before all driver pods have rolled out. If you do, volume mounts will fail on nodes still running the old driver version, since those pods only check volume context.

Why this matters

Adopting this feature helps in a few ways:

It eliminates the risk of accidentally logging service account tokens as part of volume context in gRPC requests
It uses the CSI specification's designated field for sensitive data, which feels right
The protosanitizer tool automatically handles the secrets field correctly, so you don't need driver-specific workarounds
It's opt-in, so you can migrate at your own pace without breaking existing deployments

Call to action

We (Kubernetes SIG Storage) encourage CSI driver authors to adopt this feature and provide feedback on the migration experience. If you have thoughts on the API design or run into any issues during adoption, please reach out to us on the #csi channel on Kubernetes Slack (for an invitation, visit https://slack.k8s.io/).

You can follow along on KEP-5538 to track progress across the coming Kubernetes releases.

Kubernetes v1.35: Extended Toleration Operators to Support Numeric Comparisons (Alpha)

Mon, 05 Jan 2026 10:30:00 -0800

Many production Kubernetes clusters blend on-demand (higher-SLA) and spot/preemptible (lower-SLA) nodes to optimize costs while maintaining reliability for critical workloads. Platform teams need a safe default that keeps most workloads away from risky capacity, while allowing specific workloads to opt-in with explicit thresholds like "I can tolerate nodes with failure probability up to 5%".

Today, Kubernetes taints and tolerations can match exact values or check for existence, but they can't compare numeric thresholds. You'd need to create discrete taint categories, use external admission controllers, or accept less-than-optimal placement decisions.

In Kubernetes v1.35, we're introducing Extended Toleration Operators as an alpha feature. This enhancement adds Gt (Greater Than) and Lt (Less Than) operators to spec.tolerations, enabling threshold-based scheduling decisions that unlock new possibilities for SLA-based placement, cost optimization, and performance-aware workload distribution.

The evolution of tolerations

Historically, Kubernetes supported two primary toleration operators:

Equal: The toleration matches a taint if the key and value are exactly equal
Exists: The toleration matches a taint if the key exists, regardless of value

While these worked well for categorical scenarios, they fell short for numeric comparisons. Starting with v1.35, we are closing this gap.

Consider these real-world scenarios:

SLA requirements: Schedule high-availability workloads only on nodes with failure probability below a certain threshold
Cost optimization: Allow cost-sensitive batch jobs to run on cheaper nodes that exceed a specific cost-per-hour value
Performance guarantees: Ensure latency-sensitive applications run only on nodes with disk IOPS or network bandwidth above minimum thresholds

Without numeric comparison operators, cluster operators have had to resort to workarounds like creating multiple discrete taint values or using external admission controllers, neither of which scale well or provide the flexibility needed for dynamic threshold-based scheduling.

Why extend tolerations instead of using NodeAffinity?

You might wonder: NodeAffinity already supports numeric comparison operators, so why extend tolerations? While NodeAffinity is powerful for expressing pod preferences, taints and tolerations provide critical operational benefits:

Policy orientation: NodeAffinity is per-pod, requiring every workload to explicitly opt-out of risky nodes. Taints invert control—nodes declare their risk level, and only pods with matching tolerations may land there. This provides a safer default; most pods stay away from spot/preemptible nodes unless they explicitly opt-in.
Eviction semantics: NodeAffinity has no eviction capability. Taints support the NoExecute effect with tolerationSeconds, enabling operators to drain and evict pods when a node's SLA degrades or spot instances receive termination notices.
Operational ergonomics: Centralized, node-side policy is consistent with other safety taints like disk-pressure and memory-pressure, making cluster management more intuitive.

This enhancement preserves the well-understood safety model of taints and tolerations while enabling threshold-based placement for SLA-aware scheduling.

Introducing Gt and Lt operators

Kubernetes v1.35 introduces two new operators for tolerations:

Gt (Greater Than): The toleration matches if the taint's numeric value is greater than the toleration's value
Lt (Less Than): The toleration matches if the taint's numeric value is less than the toleration's value

When a pod tolerates a taint with Lt, it's saying "I can tolerate nodes where this metric is less than my threshold". Since tolerations allow scheduling, the pod can run on nodes where the taint value is greater than the toleration value. Think of it as: "I tolerate nodes that are above my minimum requirements".

These operators work with numeric taint values and enable the scheduler to make sophisticated placement decisions based on continuous metrics rather than discrete categories.

Note:

Numeric values for Gt and Lt operators must be positive 64-bit integers without leading zeros. For example, "100" is valid, but "0100" (with leading zero) and "0" (zero value) are not permitted.

The Gt and Lt operators work with all taint effects: NoSchedule, NoExecute, and PreferNoSchedule.

Use cases and examples

Let's explore how Extended Toleration Operators solve real-world scheduling challenges.

Example 1: Spot instance protection with SLA thresholds

Many clusters mix on-demand and spot/preemptible nodes to optimize costs. Spot nodes offer significant savings but have higher failure rates. You want most workloads to avoid spot nodes by default, while allowing specific workloads to opt-in with clear SLA boundaries.

First, taint spot nodes with their failure probability (for example, 15% annual failure rate):

apiVersion: v1
kind: Node
metadata:
  name: spot-node-1
spec:
  taints:
  - key: "failure-probability"
    value: "15"
    effect: "NoExecute"

On-demand nodes have much lower failure rates:

apiVersion: v1
kind: Node
metadata:
  name: ondemand-node-1
spec:
  taints:
  - key: "failure-probability"
    value: "2"
    effect: "NoExecute"

Critical workloads can specify strict SLA requirements:

apiVersion: v1
kind: Pod
metadata:
  name: payment-processor
spec:
  tolerations:
  - key: "failure-probability"
    operator: "Lt"
    value: "5"
    effect: "NoExecute"
    tolerationSeconds: 30
  containers:
  - name: app
    image: payment-app:v1

This pod will only schedule on nodes with failure-probability less than 5 (meaning ondemand-node-1 with 2% but not spot-node-1 with 15%). The NoExecute effect with tolerationSeconds: 30 means if a node's SLA degrades (for example, cloud provider changes the taint value), the pod gets 30 seconds to gracefully terminate before forced eviction.

Meanwhile, a fault-tolerant batch job can explicitly opt-in to spot instances:

apiVersion: v1
kind: Pod
metadata:
  name: batch-job
spec:
  tolerations:
  - key: "failure-probability"
    operator: "Lt"
    value: "20"
    effect: "NoExecute"
  containers:
  - name: worker
    image: batch-worker:v1

This batch job tolerates nodes with failure probability up to 20%, so it can run on both on-demand and spot nodes, maximizing cost savings while accepting higher risk.

Example 2: AI workload placement with GPU tiers

AI and machine learning workloads often have specific hardware requirements. With Extended Toleration Operators, you can create GPU node tiers and ensure workloads land on appropriately powered hardware.

Taint GPU nodes with their compute capability score:

apiVersion: v1
kind: Node
metadata:
  name: gpu-node-a100
spec:
  taints:
  - key: "gpu-compute-score"
    value: "1000"
    effect: "NoSchedule"
---
apiVersion: v1
kind: Node
metadata:
  name: gpu-node-t4
spec:
  taints:
  - key: "gpu-compute-score"
    value: "500"
    effect: "NoSchedule"

A heavy training workload can require high-performance GPUs:

apiVersion: v1
kind: Pod
metadata:
  name: model-training
spec:
  tolerations:
  - key: "gpu-compute-score"
    operator: "Gt"
    value: "800"
    effect: "NoSchedule"
  containers:
  - name: trainer
    image: ml-trainer:v1
    resources:
      limits:
        nvidia.com/gpu: 1

This ensures the training pod only schedules on nodes with compute scores greater than 800 (like the A100 node), preventing placement on lower-tier GPUs that would slow down training.

Meanwhile, inference workloads with less demanding requirements can use any available GPU:

apiVersion: v1
kind: Pod
metadata:
  name: model-inference
spec:
  tolerations:
  - key: "gpu-compute-score"
    operator: "Gt"
    value: "400"
    effect: "NoSchedule"
  containers:
  - name: inference
    image: ml-inference:v1
    resources:
      limits:
        nvidia.com/gpu: 1

Example 3: Cost-optimized workload placement

For batch processing or non-critical workloads, you might want to minimize costs by running on cheaper nodes, even if they have lower performance characteristics.

Nodes can be tainted with their cost rating:

spec:
  taints:
  - key: "cost-per-hour"
    value: "50"
    effect: "NoSchedule"

A cost-sensitive batch job can express its tolerance for expensive nodes:

tolerations:
- key: "cost-per-hour"
  operator: "Lt"
  value: "100"
  effect: "NoSchedule"

This batch job will schedule on nodes costing less than $100/hour but avoid more expensive nodes. Combined with Kubernetes scheduling priorities, this enables sophisticated cost-tiering strategies where critical workloads get premium nodes while batch workloads efficiently use budget-friendly resources.

Example 4: Performance-based placement

Storage-intensive applications often require minimum disk performance guarantees. With Extended Toleration Operators, you can enforce these requirements at the scheduling level.

tolerations:
- key: "disk-iops"
  operator: "Gt"
  value: "3000"
  effect: "NoSchedule"

This toleration ensures the pod only schedules on nodes where disk-iops exceeds 3000. The Gt operator means "I need nodes that are greater than this minimum".

How to use this feature

Extended Toleration Operators is an alpha feature in Kubernetes v1.35. To try it out:

Enable the feature gate on both your API server and scheduler:
```
--feature-gates=TaintTolerationComparisonOperators=true
```

Taint your nodes with numeric values representing the metrics relevant to your scheduling needs:

  kubectl taint nodes node-1 failure-probability=5:NoSchedule
  kubectl taint nodes node-2 disk-iops=5000:NoSchedule

Use the new operators in your pod specifications:

  spec:
    tolerations:
    - key: "failure-probability"
      operator: "Lt"
      value: "1"
      effect: "NoSchedule"

Note:

As an alpha feature, Extended Toleration Operators may change in future releases and should be used with caution in production environments. Always test thoroughly in non-production clusters first.

What's next?

This alpha release is just the beginning. As we gather feedback from the community, we plan to:

Add support for CEL (Common Expression Language) expressions in tolerations and node affinity for even more flexible scheduling logic, including semantic versioning comparisons
Improve integration with cluster autoscaling for threshold-aware capacity planning
Graduate the feature to beta and eventually GA with production-ready stability

We're particularly interested in hearing about your use cases! Do you have scenarios where threshold-based scheduling would solve problems? Are there additional operators or capabilities you'd like to see?

Getting involved

This feature is driven by the SIG Scheduling community. Please join us to connect with the community and share your ideas and feedback around this feature and beyond.

You can reach the maintainers of this feature at:

Slack: #sig-scheduling on Kubernetes Slack
Mailing list: kubernetes-sig-scheduling@googlegroups.com

For questions or specific inquiries related to Extended Toleration Operators, please reach out to the SIG Scheduling community. We look forward to hearing from you!

How can I learn more?

Taints and Tolerations for understanding the fundamentals
Numeric comparison operators for details on using Gt and Lt operators
KEP-5471: Extended Toleration Operators for Threshold-Based Placement

Kubernetes v1.35: New level of efficiency with in-place Pod restart

Fri, 02 Jan 2026 10:30:00 -0800

The release of Kubernetes 1.35 introduces a powerful new feature that provides a much-requested capability: the ability to trigger a full, in-place restart of the Pod. This feature, Restart All Containers (alpha in 1.35), allows for an efficient way to reset a Pod's state compared to resource-intensive approach of deleting and recreating the entire Pod. This feature is especially useful for AI/ML workloads allowing application developers to concentrate on their core training logic while offloading complex failure-handling and recovery mechanisms to sidecars and declarative Kubernetes configuration. With RestartAllContainers and other planned enhancements, Kubernetes continues to add building blocks for creating the most flexible, robust, and efficient platforms for AI/ML workloads.

This new functionality is available by enabling the RestartAllContainersOnContainerExits feature gate. This alpha feature extends the Container Restart Rules feature, which graduated to beta in Kubernetes 1.35.

The problem: when a single container restart isn't enough and recreating pods is too costly

Kubernetes has long supported restart policies at the Pod level (restartPolicy) and, more recently, at the individual container level. These policies are great for handling crashes in a single, isolated process. However, many modern applications have more complex inter-container dependencies. For instance:

An init container prepares the environment by mounting a volume or generating a configuration file. If the main application container corrupts this environment, simply restarting that one container is not enough. The entire initialization process needs to run again.
A watcher sidecar monitors system health. If it detects an unrecoverable but retriable error state, it must trigger a restart of the main application container from a clean slate.
A sidecar that manages a remote resource fails. Even if the sidecar restarts on its own, the main container may be stuck trying to access an outdated or broken connection.

In all these cases, the desired action is not to restart a single container, but all of them. Previously, the only way to achieve this was to delete the Pod and have a controller (like a Job or ReplicaSet) create a new one. This process is slow and expensive, involving the scheduler, node resource allocation and re-initialization of networking and storage.

This inefficiency becomes even worse when handling large-scale AI/ML workloads (>= 1,000 Nodes with one Pod per Node). A common requirement for these synchronous workloads is that when a failure occurs (such as a Node crash), all Pods in the fleet must be recreated to reset the state before training can resume, even if all the other Pods were not directly affected by the failure. Deleting, creating and scheduling thousands of Pods simultaneously creates a massive bottleneck. The estimated overhead of this failure could cost $100,000 per month in wasted resources.

Handling these failures for AI/ML training jobs requires a complex integration touching both the training framework and Kubernetes, which are often fragile and toilsome. This feature introduces a Kubernetes-native solution, improving system robustness and allowing application developers to concentrate on their core training logic.

Another major benefit of restarting Pods in place is that keeping Pods on their assigned Nodes allows for further optimizations. For example, one can implement node-level caching tied to a specific Pod identity, something that is impossible when Pods are unnecessarily being recreated on different Nodes.

Introducing the `RestartAllContainers` action

To address this, Kubernetes v1.35 adds a new action to the container restart rules: RestartAllContainers. When a container exits in a way that matches a rule with this action, the kubelet initiates a fast, in-place restart of the Pod.

This in-place restart is highly efficient because it preserves the Pod's most important resources:

The Pod's UID, IP address and network namespace.
The Pod's sandbox and any attached devices.
All volumes, including emptyDir and mounted volumes from PVCs.

After terminating all running containers, the Pod's startup sequence is re-executed from the very beginning. This means all init containers are run again in order, followed by the sidecar and regular containers, ensuring a completely fresh start in a known-good environment. With the exception of ephemeral containers (which are terminated), all other containers—including those that previously succeeded or failed—will be restarted, regardless of their individual restart policies.

Use cases

1. Efficient restarts for ML/Batch jobs

For ML training jobs, rescheduling a worker Pod on failure is a costly operation that wastes valuable compute resources. On a 1,000-node training cluster, rescheduling overhead can waste over $100,000 in compute resources monthly.

With RestartAllContainers actions you can address this by enabling a much faster, hybrid recovery strategy: recreate only the "bad" Pods (e.g., those on unhealthy Nodes) while triggering RestartAllContainers for the remaining healthy Pods. Benchmarks show this reduces the recovery overhead from minutes to a few seconds.

With in-place restarts, a watcher sidecar can monitor the main training process. If it encounters a specific, retriable error, the watcher can exit with a designated code to trigger a fast reset of the worker Pod, allowing it to restart from the last checkpoint without involving the Job controller. This capability is now natively supported by Kubernetes.

Read more details about future development and JobSet features at KEP-467 JobSet in-place restart.

apiVersion: v1
kind: Pod
metadata:
  name: ml-worker-pod
spec:
  restartPolicy: Never
  initContainers:
  # This init container will re-run on every in-place restart
  - name: setup-environment
    image: my-repo/setup-worker:1.0
  - name: watcher-sidecar
    image: my-repo/watcher:1.0
    restartPolicy: Always
    restartPolicyRules:
    - action: RestartAllContainers
      exitCodes:
        operator: In
        # A specific exit code from the watcher triggers a full pod restart
        values: [88]
  containers:
  - name: main-application
    image: my-repo/training-app:1.0

2. Re-running init containers for a clean state

Imagine a scenario where an init container is responsible for fetching credentials or setting up a shared volume. If the main application fails in a way that corrupts this shared state, you need the init container to rerun.

By configuring the main application to exit with a specific code upon detecting such a corruption, you can trigger the RestartAllContainers action, guaranteeing that the init container provides a clean setup before the application restarts.

3. Handling high rate of similar tasks execution

There are cases when tasks are best represented as a Pod execution. And each task requires a clean execution. The task may be a game session backend or some queue item processing. If the rate of tasks is high, running the whole cycle of Pod creation, scheduling and initialization is simply too expensive, especially when tasks can be short. The ability to restart all containers from scratch enables a Kubernetes-native way to handle this scenario without custom solutions or frameworks.

How to use it

To try this feature, you must enable the RestartAllContainersOnContainerExits feature gate on your Kubernetes cluster components (API server and kubelet) running Kubernetes v1.35+. This alpha feature extends the ContainerRestartRules feature, which graduated to beta in v1.35 and is enabled by default.

Once enabled, you can add restartPolicyRules to any container (init, sidecar, or regular) and use the RestartAllContainers action.

The feature is designed to be easily usable on existing apps. However, if an application does not follow some best practices, it may cause issues for the application or for observability tooling. When enabling the feature, make sure that all containers are reentrant and that external tooling is prepared for init containers to re-run. Also, when restarting all containers, the kubelet does not run preStop hooks. This means containers must be designed to handle abrupt termination without relying on preStop hooks for graceful shutdown.

Observing the restart

To make this process observable, a new Pod condition, AllContainersRestarting, is added to the Pod's status. When a restart is triggered, this condition becomes True and it reverts to False once all containers have terminated and the Pod is ready to start its lifecycle anew. This provides a clear signal to users and other cluster components about the Pod's state.

All containers restarted by this action will have their restart count incremented in the container status.

Learn more

Read the official documentation on Pod Lifecycle.
Read the detailed proposal in the KEP-5532: Restart All Containers on Container Exits.
Read the proposal for JobSet in-place restart in JobSet issue #467.

We want your feedback!

As an alpha feature, RestartAllContainers is ready for you to experiment with and any use cases and feedback are welcome. This feature is driven by the SIG Node community. If you are interested in getting involved, sharing your thoughts, or contributing, please join us!

You can reach SIG Node through:

Slack: #sig-node
Mailing list

Kubernetes 1.35: Enhanced Debugging with Versioned z-pages APIs

Wed, 31 Dec 2025 10:30:00 -0800

Debugging Kubernetes control plane components can be challenging, especially when you need to quickly understand the runtime state of a component or verify its configuration. With Kubernetes 1.35, we're enhancing the z-pages debugging endpoints with structured, machine-parseable responses that make it easier to build tooling and automate troubleshooting workflows.

What are z-pages?

z-pages are special debugging endpoints exposed by Kubernetes control plane components. Introduced as an alpha feature in Kubernetes 1.32, these endpoints provide runtime diagnostics for components like kube-apiserver, kube-controller-manager, kube-scheduler, kubelet and kube-proxy. The name "z-pages" comes from the convention of using /*z paths for debugging endpoints.

Currently, Kubernetes supports two primary z-page endpoints:

/statusz: Displays high-level component information including version information, start time, uptime, and available debug paths
/flagz: Shows all command-line arguments and their values used to start the component (with confidential values redacted for security)

These endpoints are valuable for human operators who need to quickly inspect component state, but until now, they only returned plain text output that was difficult to parse programmatically.

What's new in Kubernetes 1.35?

Kubernetes 1.35 introduces structured, versioned responses for both /statusz and /flagz endpoints. This enhancement maintains backward compatibility with the existing plain text format while adding support for machine-readable JSON responses.

Backward compatible design

The new structured responses are opt-in. Without specifying an Accept header, the endpoints continue to return the familiar plain text format:

$ curl --cert /etc/kubernetes/pki/apiserver-kubelet-client.crt \
  --key /etc/kubernetes/pki/apiserver-kubelet-client.key \
  --cacert /etc/kubernetes/pki/ca.crt \
  https://localhost:6443/statusz

kube-apiserver statusz
Warning: This endpoint is not meant to be machine parseable, has no formatting compatibility guarantees and is for debugging purposes only.

Started: Wed Oct 16 21:03:43 UTC 2024
Up: 0 hr 00 min 16 sec
Go version: go1.23.2
Binary version: 1.35.0-alpha.0.1595
Emulation version: 1.35
Paths: /healthz /livez /metrics /readyz /statusz /version

Structured JSON responses

To receive a structured response, include the appropriate Accept header:

Accept: application/json;v=v1alpha1;g=config.k8s.io;as=Statusz

This returns a versioned JSON response:

{
  "kind": "Statusz",
  "apiVersion": "config.k8s.io/v1alpha1",
  "metadata": {
    "name": "kube-apiserver"
  },
  "startTime": "2025-10-29T00:30:01Z",
  "uptimeSeconds": 856,
  "goVersion": "go1.23.2",
  "binaryVersion": "1.35.0",
  "emulationVersion": "1.35",
  "paths": [
    "/healthz",
    "/livez",
    "/metrics",
    "/readyz",
    "/statusz",
    "/version"
  ]
}

Similarly, /flagz supports structured responses with the header:

Accept: application/json;v=v1alpha1;g=config.k8s.io;as=Flagz

Example response:

{
  "kind": "Flagz",
  "apiVersion": "config.k8s.io/v1alpha1",
  "metadata": {
    "name": "kube-apiserver"
  },
  "flags": {
    "advertise-address": "192.168.8.4",
    "allow-privileged": "true",
    "authorization-mode": "[Node,RBAC]",
    "enable-priority-and-fairness": "true",
    "profiling": "true"
  }
}

Why structured responses matter

The addition of structured responses opens up several new possibilities:

1. Automated health checks and monitoring

Instead of parsing plain text, monitoring tools can now easily extract specific fields. For example, you can programmatically check if a component has been running with an unexpected emulated version or verify that critical flags are set correctly.

2. Better debugging tools

Developers can build sophisticated debugging tools that compare configurations across multiple components or track configuration drift over time. The structured format makes it trivial to diff configurations or validate that components are running with expected settings.

3. API versioning and stability

By introducing versioned APIs (starting with v1alpha1), we provide a clear path to stability. As the feature matures, we'll introduce v1beta1 and eventually v1, giving you confidence that your tooling won't break with future Kubernetes releases.

How to use structured z-pages

Prerequisites

Both endpoints require feature gates to be enabled:

/statusz: Enable the ComponentStatusz feature gate
/flagz: Enable the ComponentFlagz feature gate

Example: Getting structured responses

Here's an example using curl to retrieve structured JSON responses from the kube-apiserver:

# Get structured statusz response
curl \
  --cert /etc/kubernetes/pki/apiserver-kubelet-client.crt \
  --key /etc/kubernetes/pki/apiserver-kubelet-client.key \
  --cacert /etc/kubernetes/pki/ca.crt \
  -H "Accept: application/json;v=v1alpha1;g=config.k8s.io;as=Statusz" \
  https://localhost:6443/statusz | jq .

# Get structured flagz response
curl \
  --cert /etc/kubernetes/pki/apiserver-kubelet-client.crt \
  --key /etc/kubernetes/pki/apiserver-kubelet-client.key \
  --cacert /etc/kubernetes/pki/ca.crt \
  -H "Accept: application/json;v=v1alpha1;g=config.k8s.io;as=Flagz" \
  https://localhost:6443/flagz | jq .

Note:

The examples above use client certificate authentication and verify the server's certificate using --cacert. If you need to bypass certificate verification in a test environment, you can use --insecure (or -k), but this should never be done in production as it makes you vulnerable to man-in-the-middle attacks.

Important considerations

Alpha feature status

The structured z-page responses are an alpha feature in Kubernetes 1.35. This means:

The API format may change in future releases
These endpoints are intended for debugging, not production automation
You should avoid relying on them for critical monitoring workflows until they reach beta or stable status

Security and access control

z-pages expose internal component information and require proper access controls. Here are the key security considerations:

Authorization: Access to z-page endpoints is restricted to members of the system:monitoring group, which follows the same authorization model as other debugging endpoints like /healthz, /livez, and /readyz. This ensures that only authorized users and service accounts can access debugging information. If your cluster uses RBAC, you can manage access by granting appropriate permissions to this group.

Authentication: The authentication requirements for these endpoints depend on your cluster's configuration. Unless anonymous authentication is enabled for your cluster, you typically need to use authentication mechanisms (such as client certificates) to access these endpoints.

Information disclosure: These endpoints reveal configuration details about your cluster components, including:

Component versions and build information
All command-line arguments and their values (with confidential values redacted)
Available debug endpoints

Only grant access to trusted operators and debugging tools. Avoid exposing these endpoints to unauthorized users or automated systems that don't require this level of access.

Future evolution

As the feature matures, we (Kubernetes SIG Instrumentation) expect to:

Introduce v1beta1 and eventually v1 versions of the API
Gather community feedback on the response schema
Potentially add additional z-page endpoints based on user needs

Try it out

We encourage you to experiment with structured z-pages in a test environment:

Enable the ComponentStatusz and ComponentFlagz feature gates on your control plane components
Try querying the endpoints with both plain text and structured formats
Build a simple tool or script that uses the structured data
Share your feedback with the community

Learn more

z-pages documentation
KEP-4827: Component Statusz
KEP-4828: Component Flagz
Join the discussion in the #sig-instrumentation channel on Kubernetes Slack

Get involved

We'd love to hear your feedback! The structured z-pages feature is designed to make Kubernetes easier to debug and monitor. Whether you're building internal tooling, contributing to open source projects, or just exploring the feature, your input helps shape the future of Kubernetes observability.

If you have questions, suggestions, or run into issues, please reach out to SIG Instrumentation. You can find us on Slack or at our regular community meetings.

Happy debugging!

Kubernetes v1.35: Watch Based Route Reconciliation in the Cloud Controller Manager

Tue, 30 Dec 2025 10:30:00 -0800

Up to and including Kubernetes v1.34, the route controller in Cloud Controller Manager (CCM) implementations built using the k8s.io/cloud-provider library reconciles routes at a fixed interval. This causes unnecessary API requests to the cloud provider when there are no changes to routes. Other controllers implemented through the same library already use watch-based mechanisms, leveraging informers to avoid unnecessary API calls. A new feature gate is being introduced in v1.35 to allow changing the behavior of the route controller to use watch-based informers.

What's new?

The feature gate CloudControllerManagerWatchBasedRoutesReconciliation has been introduced to k8s.io/cloud-provider in alpha stage by SIG Cloud Provider. To enable this feature you can use --feature-gate=CloudControllerManagerWatchBasedRoutesReconciliation=true in the CCM implementation you are using.

About the feature gate

This feature gate will trigger the route reconciliation loop whenever a node is added, deleted, or the fields .spec.podCIDRs or .status.addresses are updated.

An additional reconcile is performed in a random interval between 12h and 24h, which is chosen at the controller's start time.

This feature gate does not modify the logic within the reconciliation loop. Therefore, users of a CCM implementation should not experience significant changes to their existing route configurations.

How can I learn more?

For more details, refer to the KEP-5237.

Kubernetes v1.35: Introducing Workload Aware Scheduling

Mon, 29 Dec 2025 10:30:00 -0800

Scheduling large workloads is a much more complex and fragile operation than scheduling a single Pod, as it often requires considering all Pods together instead of scheduling each one independently. For example, when scheduling a machine learning batch job, you often need to place each worker strategically, such as on the same rack, to make the entire process as efficient as possible. At the same time, the Pods that are part of such a workload are very often identical from the scheduling perspective, which fundamentally changes how this process should look.

There are many custom schedulers adapted to perform workload scheduling efficiently, but considering how common and important workload scheduling is to Kubernetes users, especially in the AI era with the growing number of use cases, it is high time to make workloads a first-class citizen for kube-scheduler and support them natively.

Workload aware scheduling

The recent 1.35 release of Kubernetes delivered the first tranche of workload aware scheduling improvements. These are part of a wider effort that is aiming to improve scheduling and management of workloads. The effort will span over many SIGs and releases, and is supposed to gradually expand capabilities of the system toward reaching the north star goal, which is seamless workload scheduling and management in Kubernetes including, but not limited to, preemption and autoscaling.

Kubernetes v1.35 introduces the Workload API that you can use to describe the desired shape as well as scheduling-oriented requirements of the workload. It comes with an initial implementation of gang scheduling that instructs the kube-scheduler to schedule gang Pods in the all-or-nothing fashion. Finally, we improved scheduling of identical Pods (that typically make a gang) to speed up the process thanks to the opportunistic batching feature.

Workload API

The new Workload API resource is part of the scheduling.k8s.io/v1alpha1 API group. This resource acts as a structured, machine-readable definition of the scheduling requirements of a multi-Pod application. While user-facing workloads like Jobs define what to run, the Workload resource determines how a group of Pods should be scheduled and how its placement should be managed throughout its lifecycle.

A Workload allows you to define a group of Pods and apply a scheduling policy to them. Here is what a gang scheduling configuration looks like. You can define a podGroup named workers and apply the gang policy with a minCount of 4.

apiVersion: scheduling.k8s.io/v1alpha1
kind: Workload
metadata:
  name: training-job-workload
  namespace: some-ns
spec:
  podGroups:
  - name: workers
    policy:
      gang:
        # The gang is schedulable only if 4 pods can run at once
        minCount: 4

When you create your Pods, you link them to this Workload using the new workloadRef field:

apiVersion: v1
kind: Pod
metadata:
  name: worker-0
  namespace: some-ns
spec:
  workloadRef:
    name: training-job-workload
    podGroup: workers
  ...

How gang scheduling works

The gang policy enforces all-or-nothing placement. Without gang scheduling, a Job might be partially scheduled, consuming resources without being able to run, leading to resource wastage and potential deadlocks.

When you create Pods that are part of a gang-scheduled pod group, the scheduler's GangScheduling plugin manages the lifecycle independently for each pod group (or replica key):

When you create your Pods (or a controller makes them for you), the scheduler blocks them from scheduling, until:
- The referenced Workload object is created.
- The referenced pod group exists in a Workload.
- The number of pending Pods in that group meets your minCount.
Once enough Pods arrive, the scheduler tries to place them. However, instead of binding them to nodes immediately, the Pods wait at a Permit gate.
The scheduler checks if it has found valid assignments for the entire group (at least the minCount).
- If there is room for the group, the gate opens, and all Pods are bound to nodes.
- If only a subset of the group pods was successfully scheduled within a timeout (set to 5 minutes), the scheduler rejects all of the Pods in the group. They go back to the queue, freeing up the reserved resources for other workloads.

We'd like to point out that that while this is a first implementation, the Kubernetes project firmly intends to improve and expand the gang scheduling algorithm in future releases. Benefits we hope to deliver include a single-cycle scheduling phase for a whole gang, workload-level preemption, and more, moving towards the north star goal.

Opportunistic batching

In addition to explicit gang scheduling, v1.35 introduces opportunistic batching. This is a Beta feature that improves scheduling latency for identical Pods.

Unlike gang scheduling, this feature does not require the Workload API or any explicit opt-in on the user's part. It works opportunistically within the scheduler by identifying Pods that have identical scheduling requirements (container images, resource requests, affinities, etc.). When the scheduler processes a Pod, it can reuse the feasibility calculations for subsequent identical Pods in the queue, significantly speeding up the process.

Most users will benefit from this optimization automatically, without taking any special steps, provided their Pods meet the following criteria.

Restrictions

Opportunistic batching works under specific conditions. All fields used by the kube-scheduler to find a placement must be identical between Pods. Additionally, using some features disables the batching mechanism for those Pods to ensure correctness.

Note that you may need to review your kube-scheduler configuration to ensure it is not implicitly disabling batching for your workloads.

See the docs for more details about restrictions.

The north star vision

The project has a broad ambition to deliver workload aware scheduling. These new APIs and scheduling enhancements are just the first steps. In the near future, the effort aims to tackle:

Introducing a workload scheduling phase
Improved support for multi-node DRA and topology aware scheduling
Workload-level preemption
Improved integration between scheduling and autoscaling
Improved interaction with external workload schedulers
Managing placement of workloads throughout their entire lifecycle
Multi-workload scheduling simulations

And more. The priority and implementation order of these focus areas are subject to change. Stay tuned for further updates.

Getting started

To try the workload aware scheduling improvements:

Workload API: Enable the GenericWorkload feature gate on both kube-apiserver and kube-scheduler, and ensure the scheduling.k8s.io/v1alpha1 API group is enabled.
Gang scheduling: Enable the GangScheduling feature gate on kube-scheduler (requires the Workload API to be enabled).
Opportunistic batching: As a Beta feature, it is enabled by default in v1.35. You can disable it using the OpportunisticBatching feature gate on kube-scheduler if needed.

We encourage you to try out workload aware scheduling in your test clusters and share your experiences to help shape the future of Kubernetes scheduling. You can send your feedback by:

Reaching out via Slack (#sig-scheduling).
Commenting on the workload aware scheduling tracking issue
Filing a new issue in the Kubernetes repository.

Learn more

Read the KEPs for Workload API and gang scheduling and Opportunistic batching.
Track the Workload aware scheduling issue for recent updates.

Kubernetes v1.35: Fine-grained Supplemental Groups Control Graduates to GA

Tue, 23 Dec 2025 10:30:00 -0800

On behalf of Kubernetes SIG Node, we are pleased to announce the graduation of fine-grained supplemental groups control to General Availability (GA) in Kubernetes v1.35!

The new Pod field, supplementalGroupsPolicy, was introduced as an opt-in alpha feature for Kubernetes v1.31, and then had graduated to beta in v1.33. Now, the feature is generally available. This feature allows you to implement more precise control over supplemental groups in Linux containers that can strengthen the security posture particularly in accessing volumes. Moreover, it also enhances the transparency of UID/GID details in containers, offering improved security oversight.

If you are planning to upgrade your cluster from v1.32 or an earlier version, please be aware that some behavioral breaking change introduced since beta (v1.33). For more details, see the behavioral changes introduced in beta and the upgrade considerations sections of the previous blog for graduation to beta.

Motivation: Implicit group memberships defined in `/etc/group` in the container image

Even though the majority of Kubernetes cluster admins/users may not be aware of this, by default Kubernetes merges group information from the Pod with information defined in /etc/group in the container image.

Here's an example; a Pod manifest that specifies spec.securityContext.runAsUser: 1000, spec.securityContext.runAsGroup: 3000 and spec.securityContext.supplementalGroups: 4000 as part of the Pod's security context.

apiVersion: v1
kind: Pod
metadata:
  name: implicit-groups-example
spec:
  securityContext:
    runAsUser: 1000
    runAsGroup: 3000
    supplementalGroups: [4000]
  containers:
  - name: example-container
    image: registry.k8s.io/e2e-test-images/agnhost:2.45
    command: [ "sh", "-c", "sleep 1h" ]
    securityContext:
      allowPrivilegeEscalation: false

What is the result of id command in the example-container container? The output should be similar to this:

uid=1000 gid=3000 groups=3000,4000,50000

Where does group ID 50000 in supplementary groups (groups field) come from, even though 50000 is not defined in the Pod's manifest at all? The answer is /etc/group file in the container image.

Checking the contents of /etc/group in the container image contains something like the following:

user-defined-in-image:x:1000:
group-defined-in-image:x:50000:user-defined-in-image

This shows that the container's primary user 1000 belongs to the group 50000 in the last entry.

Thus, the group membership defined in /etc/group in the container image for the container's primary user is implicitly merged to the information from the Pod. Please note that this was a design decision the current CRI implementations inherited from Docker, and the community never really reconsidered it until now.

What's wrong with it?

The implicitly merged group information from /etc/group in the container image poses a security risk. These implicit GIDs can't be detected or validated by policy engines because there's no record of them in the Pod manifest. This can lead to unexpected access control issues, particularly when accessing volumes (see kubernetes/kubernetes#112879 for details) because file permission is controlled by UID/GIDs in Linux.

Fine-grained supplemental groups control in a Pod: `supplementaryGroupsPolicy`

To tackle this problem, a Pod's .spec.securityContext now includes supplementalGroupsPolicy field.

This field lets you control how Kubernetes calculates the supplementary groups for container processes within a Pod. The available policies are:

Merge: The group membership defined in /etc/group for the container's primary user will be merged. If not specified, this policy will be applied (i.e. as-is behavior for backward compatibility).
Strict: Only the group IDs specified in fsGroup, supplementalGroups, or runAsGroup are attached as supplementary groups to the container processes. Group memberships defined in /etc/group for the container's primary user are ignored.

I'll explain how the Strict policy works. The following Pod manifest specifies supplementalGroupsPolicy: Strict:

apiVersion: v1
kind: Pod
metadata:
  name: strict-supplementalgroups-policy-example
spec:
  securityContext:
    runAsUser: 1000
    runAsGroup: 3000
    supplementalGroups: [4000]
    supplementalGroupsPolicy: Strict
  containers:
  - name: example-container
    image: registry.k8s.io/e2e-test-images/agnhost:2.45
    command: [ "sh", "-c", "sleep 1h" ]
    securityContext:
      allowPrivilegeEscalation: false

The result of id command in the example-container container should be similar to this:

uid=1000 gid=3000 groups=3000,4000

You can see Strict policy can exclude group 50000 from groups!

Thus, ensuring supplementalGroupsPolicy: Strict (enforced by some policy mechanism) helps prevent the implicit supplementary groups in a Pod.

Note:

A container with sufficient privileges can change its process identity. The supplementalGroupsPolicy only affect the initial process identity.

Read on for more details.

Attached process identity in Pod status

This feature also exposes the process identity attached to the first container process of the container via .status.containerStatuses[].user.linux field. It would be helpful to see if implicit group IDs are attached.

...
status:
  containerStatuses:
  - name: ctr
    user:
      linux:
        gid: 3000
        supplementalGroups:
        - 3000
        - 4000
        uid: 1000
...

Note:

Please note that the values in status.containerStatuses[].user.linux field is the firstly attached process identity to the first container process in the container. If the container has sufficient privilege to call system calls related to process identity (e.g. setuid(2), setgid(2) or setgroups(2), etc.), the container process can change its identity. Thus, the actual process identity will be dynamic.

There are several ways to restrict these permissions in containers. We suggest the belows as simple solutions:

setting privilege: false and allowPrivilegeEscalation: false in your container's securityContext, or
conform your pod to Restricted policy in Pod Security Standard.

Also, kubelet has no visibility into NRI plugins or container runtime internal workings. Cluster Administrator configuring nodes or highly privilege workloads with the permission of a local administrator may change supplemental groups for any pod. However this is outside of a scope of Kubernetes control and should not be a concern for security-hardened nodes.

`Strict` policy requires up-to-date container runtimes

The high level container runtime (e.g. containerd, CRI-O) plays a key role for calculating supplementary group ids that will be attached to the containers. Thus, supplementalGroupsPolicy: Strict requires a CRI runtime that support this feature. The old behavior (supplementalGroupsPolicy: Merge) can work with a CRI runtime that does not support this feature, because this policy is fully backward compatible.

Here are some CRI runtimes that support this feature, and the versions you need to be running:

containerd: v2.0 or later
CRI-O: v1.31 or later

And, you can see if the feature is supported in the Node's .status.features.supplementalGroupsPolicy field. Please note that this field is different from status.declaredFeatures introduced in KEP-5328: Node Declared Features(formerly Node Capabilities).

apiVersion: v1
kind: Node
...
status:
  features:
    supplementalGroupsPolicy: true

As container runtimes support this feature universally, various security policies may start enforcing the Strict behavior as more secure. It is the best practice to ensure that your Pods are ready for this enforcement and all supplemental groups are transparently declared in Pod spec, rather than in images.

Getting involved

This enhancement was driven by the SIG Node community. Please join us to connect with the community and share your ideas and feedback around the above feature and beyond. We look forward to hearing from you!

How can I learn more?

Configure a Security Context for a Pod or Container for the further details of supplementalGroupsPolicy
KEP-3619: Fine-grained SupplementalGroups control

Kubernetes v1.35: Kubelet Configuration Drop-in Directory Graduates to GA

Mon, 22 Dec 2025 10:30:00 -0800

With the recent v1.35 release of Kubernetes, support for a kubelet configuration drop-in directory is generally available. The newly stable feature simplifies the management of kubelet configuration across large, heterogeneous clusters.

With v1.35, the kubelet command line argument --config-dir is production-ready and fully supported, allowing you to specify a directory containing kubelet configuration drop-in files. All files in that directory will be automatically merged with your main kubelet configuration. This allows cluster administrators to maintain a cohesive base configuration for kubelets while enabling targeted customizations for different node groups or use cases, and without complex tooling or manual configuration management.

The problem: managing kubelet configuration at scale

As Kubernetes clusters grow larger and more complex, they often include heterogeneous node pools with different hardware capabilities, workload requirements, and operational constraints. This diversity necessitates different kubelet configurations across node groups—yet managing these varied configurations at scale becomes increasingly challenging. Several pain points emerge:

Configuration drift: Different nodes may have slightly different configurations, leading to inconsistent behavior
Node group customization: GPU nodes, edge nodes, and standard compute nodes often require different kubelet settings
Operational overhead: Maintaining separate, complete configuration files for each node type is error-prone and difficult to audit
Change management: Rolling out configuration changes across heterogeneous node pools requires careful coordination

Before this support was added to Kubernetes, cluster administrators had to choose between using a single monolithic configuration file for all nodes, manually maintaining multiple complete configuration files, or relying on separate tooling. Each approach had its own drawbacks. This graduation to stable gives cluster administrators a fully supported fourth way to solve that challenge.

Example use cases

Managing heterogeneous node pools

Consider a cluster with multiple node types: standard compute nodes, high-capacity nodes (such as those with GPUs or large amounts of memory), and edge nodes with specialized requirements.

Base configuration

File: 00-base.conf

apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
clusterDNS:
  - "10.96.0.10"
clusterDomain: cluster.local

High-capacity node override

File: 50-high-capacity-nodes.conf

apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
maxPods: 50
systemReserved:
  memory: "4Gi"
  cpu: "1000m"

Edge node override

File: 50-edge-nodes.conf (edge compute typically has lower capacity)

apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
evictionHard:
  memory.available: "500Mi"
  nodefs.available: "5%"

With this structure, high-capacity nodes apply both the base configuration and the capacity-specific overrides, while edge nodes apply the base configuration with edge-specific settings.

Gradual configuration rollouts

When rolling out configuration changes, you can:

Add a new drop-in file with a high numeric prefix (e.g., 99-new-feature.conf)
Test the changes on a subset of nodes
Gradually roll out to more nodes
Once stable, merge changes into the base configuration

Viewing the merged configuration

Since configuration is now spread across multiple files, you can inspect the final merged configuration using the kubelet's /configz endpoint:

# Start kubectl proxy
kubectl proxy

# In another terminal, fetch the merged configuration
# Change the '<node-name>' placeholder before running the curl command
curl -X GET http://127.0.0.1:8001/api/v1/nodes/<node-name>/proxy/configz | jq .

This shows the actual configuration the kubelet is using after all merging has been applied. The merged configuration also includes any configuration settings that were specified via kubelet command-line arguments.

For detailed setup instructions, configuration examples, and merging behavior, see the official documentation:

Good practices

When using the kubelet configuration drop-in directory:

Test configurations incrementally: Always test new drop-in configurations on a subset of nodes before rolling out cluster-wide to minimize risk
Version control your drop-ins: Store your drop-in configuration files in version control (or the configuration source from which these are generated) alongside your infrastructure as code to track changes and enable easy rollbacks
Use numeric prefixes for predictable ordering: Name files with numeric prefixes (e.g., 00-, 50-, 90-) to explicitly control merge order and make the configuration layering obvious to other administrators
Be mindful of temporary files: Some text editors automatically create backup files (such as .bak, .swp, or files with ~ suffix) in the same directory when editing. Ensure these temporary or backup files are not left in the configuration directory, as they may be processed by the kubelet

Acknowledgments

This feature was developed through the collaborative efforts of SIG Node. Special thanks to all contributors who helped design, implement, test, and document this feature across its journey from alpha in v1.28, through beta in v1.30, to GA in v1.35.

To provide feedback on this feature, join the Kubernetes Node Special Interest Group, participate in discussions on the public Slack channel (#sig-node), or file an issue on GitHub.

Get involved

If you have feedback or questions about kubelet configuration management, or want to share your experience using this feature, join the discussion:

SIG Node community page
Kubernetes Slack in the #sig-node channel
SIG Node mailing list

SIG Node would love to hear about your experiences using this feature in production!

Avoiding Zombie Cluster Members When Upgrading to etcd v3.6

Sun, 21 Dec 2025 00:00:00 +0000

This article is a mirror of an original that was recently published to the official etcd blog. The key takeaway? Always upgrade to etcd v3.5.26 or later before moving to v3.6. This ensures your cluster is automatically repaired, and avoids zombie members.

Issue summary

Recently, the etcd community addressed an issue that may appear when users upgrade from v3.5 to v3.6. This bug can cause the cluster to report "zombie members", which are etcd nodes that were removed from the database cluster some time ago, and are re-appearing and joining database consensus. The etcd cluster is then inoperable until these zombie members are removed.

In etcd v3.5 and earlier, the v2store was the source of truth for membership data, even though the v3store was also present. As a part of our v2store deprecation plan, in v3.6 the v3store is the source of truth for cluster membership. Through a bug report we found out that, in some older clusters, v2store and v3store could become inconsistent. This inconsistency manifests after upgrading as seeing old, removed "zombie" cluster members re-appearing in the cluster.

The fix and upgrade path

We’ve added a mechanism in etcd v3.5.26 to automatically sync v3store from v2store, ensuring that affected clusters are repaired before upgrading to 3.6.x.

To support the many users currently upgrading to 3.6, we have provided the following safe upgrade path:

Upgrade your cluster to v3.5.26 or later.
Wait and confirm that all members are healthy post-update.
Upgrade to v3.6.

We are unable to provide a safe workaround path for users who have some obstacle preventing updating to v3.5.26. As such, if v3.5.26 is not available from your packaging source or vendor, you should delay upgrading to v3.6 until it is.

Additional technical detail

Information below is offered for reference only. Users can follow the safe upgrade path without knowledge of the following details.

This issue is encountered with clusters that have been running in production on etcd v3.5.25 or earlier. It is a side effect of adding and removing members from the cluster, or recovering the cluster from failure. This means that the issue is more likely the older the etcd cluster is, but it cannot be ruled out for any user regardless of the age of the cluster.

etcd maintainers, working with issue reporters, have found three possible triggers for the issue based on symptoms and an analysis of etcd code and logs:

Bug in etcdctl snapshot restore (v3.4 and old versions): When restoring a snapshot using etcdctl snapshot restore, etcdctl was supposed to remove existing members before adding the new ones. In v3.4, due to a bug, old members were not removed, resulting in zombie members. Refer to the comment on etcdctl.
--force-new-cluster in v3.5 and earlier versions: In rare cases, forcibly creating a new single-member cluster did not fully remove old members, leaving zombies. The issue was resolved in v3.5.22. Please refer to this PR in the Raft project for detailed technical information.
--unsafe-no-sync enabled: If --unsafe-no-sync is enabled, in rare cases etcd might persist a membership change to v3store but crash before writing it to the WAL, causing inconsistency between v2store and v3store. This is a problem for single-member clusters. For multi-member clusters, forcibly creating a new single-member cluster from the crashed node’s data may lead to zombie members.

Note

--unsafe-no-sync is generally not recommended, as it may break the guarantees given by the consensus protocol.

Importantly, there may be other triggers for v2store and v3store membership data becoming inconsistent that we have not yet found. This means that you cannot assume that you are safe just because you have not performed any of the three actions above. Once users are upgraded to etcd v3.6, v3store becomes the source of membership data, and further inconsistency is not possible.

Advanced users who want to verify the consistency between v2store and v3store can follow the steps described in this comment. This check is not required to fix the issue, nor does SIG etcd recommend bypassing the v3.5.26 update regardless of the results of the check.

Key takeaway

Always upgrade to v3.5.26 or later before moving to v3.6. This ensures your cluster is automatically repaired and avoids zombie members.

Acknowledgements

We would like to thank Christian Baumann for reporting this long-standing upgrade issue. His report and follow-up work helped bring the issue to our attention so that we could investigate and resolve it upstream.

Kubernetes 1.35: In-Place Pod Resize Graduates to Stable

Fri, 19 Dec 2025 10:30:00 -0800

This release marks a major step: more than 6 years after its initial conception, the In-Place Pod Resize feature (also known as In-Place Pod Vertical Scaling), first introduced as alpha in Kubernetes v1.27, and graduated to beta in Kubernetes v1.33, is now stable (GA) in Kubernetes 1.35!

This graduation is a major milestone for improving resource efficiency and flexibility for workloads running on Kubernetes.

What is in-place Pod Resize?

In the past, the CPU and memory resources allocated to a container in a Pod were immutable. This meant changing them required deleting and recreating the entire Pod. For stateful services, batch jobs, or latency-sensitive workloads, this was an incredibly disruptive operation.

In-Place Pod Resize makes CPU and memory requests and limits mutable, allowing you to adjust these resources within a running Pod, often without requiring a container restart.

Key Concept:

Desired Resources: A container's spec.containers[*].resources field now represents the desired resources. For CPU and memory, these fields are now mutable.
Actual Resources: The status.containerStatuses[*].resources field reflects the resources currently configured for a running container.
Triggering a Resize: You can request a resize by updating the desired requests and limits in the Pod's specification by utilizing the new resize subresource.

How can I start using in-place Pod Resize?

Detailed usage instructions and examples are provided in the official documentation: Resize CPU and Memory Resources assigned to Containers.

How does this help me?

In-place Pod Resize is a foundational building block that unlocks seamless, vertical autoscaling and improvements to workload efficiency.

Resources adjusted without disruption Workloads sensitive to latency or restarts can have their resources modified in-place without downtime or loss of state.
More powerful autoscaling Autoscalers are now empowered to adjust resources and with less impact. For example, Vertical Pod Autoscaler (VPA)'s InPlaceOrRecreate update mode, which leverages this feature, has graduated to beta. This allows resources to be adjusted automatically and seamlessly based on usage with minimal disruption.
- See AEP-4016 for more details.
Address transient resource needs Workloads that temporarily need more resources can be adjusted quickly. This enables features like the CPU Startup Boost (AEP-7862) where applications can request more CPU during startup and then automatically scale back down.

Here are a few examples of some use cases:

A game server that needs to adjust its size with shifting player count.
A pre-warmed worker that can be shrunk while unused but inflated with the first request.
Dynamically scale with load for efficient bin-packing.
Increased resources for JIT compilation on startup.

Changes between beta (1.33) and stable (1.35)

Since the initial beta in v1.33, development effort has primarily been around stabilizing the feature and improving its usability based on community feedback. Here are the primary changes for the stable release:

Memory limit decrease Decreasing memory limits was previously prohibited. This restriction has been lifted, and memory limit decreases are now permitted. The Kubelet attempts to prevent OOM-kills by allowing the resize only if the current memory usage is below the new desired limit. However, this check is best-effort and not guaranteed.
Prioritized resizes If a node doesn't have enough room to accept all resize requests, Deferred resizes are reattempted based on the following priority:
- PriorityClass
- QoS class
- Duration Deferred, with older requests prioritized first.
Pod Level Resources (Alpha) Support for in-place Pod Resize with Pod Level Resources has been introduced behind its own feature gate, which is alpha in v1.35.
Increased observability: There are now new Kubelet metrics and Pod events specifically associated with In-Place Pod Resize to help users track and debug resource changes.

What's next?

The graduation of In-Place Pod Resize to stable opens the door for powerful integrations across the Kubernetes ecosystem. There are several areas for futher improvement that are currently planned.

Integration with autoscalers and other projects

There are planned integrations with several autoscalers and other projects to improve workload efficiency at a larger scale. Some projects under discussion:

VPA CPU startup boost (AEP-7862): Allows applications to request more CPU at startup and scale back down after a specific period of time.
VPA Support for in-place updates (AEP-4016): VPA support for InPlaceOrRecreate has recently graduated to beta, with the eventual goal being to graduate the feature to stable. Support for InPlace mode is still being worked on; see this pull request.
Ray autoscaler: Plans to leverage In-Place Pod Resize to improve workload efficiency. See this Google Cloud blog post for more details.
Agent-sandbox "Soft-Pause": Investigating leveraging in-place Pod Resize for better improved latency. See the Github issue for more details.
Runtime support: Java and Python runtimes do not support resizing memory without restart. There is an open conversation with the Java developers, see the bug.

If you have a project that could benefit from integration with in-place pod resize, please reach out using the channels listed in the feedback section!

Feature expansion

Today, In-Place Pod Resize is prohibited when used in combination with: swap, the static CPU Manager, and the static Memory Manager. Additionally, resources other than CPU and memory are still immutable. Expanding the set of supported features and resources is under consideration as more feedback about community needs comes in.

There are also plans to support workload preemption; if there is not enough room on the node for the resize of a high priority pod, the goal is to enable policies to automatically evict a lower-priority pod or upsize the node.

Improved stability

Resolve kubelet-scheduler race conditions There are known race conditions between the kubelet and scheduler with regards to in-place pod resize. Work is underway to resolve these issues over the next few releases. See the issue for more details.
Safer memory limit decrease The Kubelet's best-effort check for OOM-kill prevention can be made even safer by moving the memory usage check into the container runtime itself. See the issue for more details.

Providing feedback

Looking to further build on this foundational feature, please share your feedback on how to improve and extend this feature. You can share your feedback through GitHub issues, mailing lists, or Slack channels related to the Kubernetes #sig-node and #sig-autoscaling communities.

Thank you to everyone who contributed to making this long-awaited feature a reality!

Kubernetes v1.35: Job Managed By Goes GA

Thu, 18 Dec 2025 10:30:00 -0800

In Kubernetes v1.35, the ability to specify an external Job controller (through .spec.managedBy) graduates to General Availability.

This feature allows external controllers to take full responsibility for Job reconciliation, unlocking powerful scheduling patterns like multi-cluster dispatching with MultiKueue.

Why delegate Job reconciliation?

The primary motivation for this feature is to support multi-cluster batch scheduling architectures, such as MultiKueue.

The MultiKueue architecture distinguishes between a Management Cluster and a pool of Worker Clusters:

The Management Cluster is responsible for dispatching Jobs but not executing them. It needs to accept Job objects to track status, but it skips the creation and execution of Pods.
The Worker Clusters receive the dispatched Jobs and execute the actual Pods.
Users usually interact with the Management Cluster. Because the status is automatically propagated back, they can observe the Job's progress "live" without accessing the Worker Clusters.
In the Worker Clusters, the dispatched Jobs run as regular Jobs managed by the built-in Job controller, with no .spec.managedBy set.

By using .spec.managedBy, the MultiKueue controller on the Management Cluster can take over the reconciliation of a Job. It copies the status from the "mirror" Job running on the Worker Cluster back to the Management Cluster.

Why not just disable the Job controller? While one could theoretically achieve this by disabling the built-in Job controller entirely, this is often impossible or impractical for two reasons:

Managed Control Planes: In many cloud environments, the Kubernetes control plane is locked, and users cannot modify controller manager flags.
Hybrid Cluster Role: Users often need a "hybrid" mode where the Management Cluster dispatches some heavy workloads to remote clusters but still executes smaller or control-plane-related Jobs in the Management Cluster. .spec.managedBy allows this granularity on a per-Job basis.

How `.spec.managedBy` works

The .spec.managedBy field indicates which controller is responsible for the Job, specifically there are two modes of operation:

Standard: if unset or set to the reserved value kubernetes.io/job-controller, the built-in Job controller reconciles the Job as usual (standard behavior).
Delegation: If set to any other value, the built-in Job controller skips reconciliation entirely for that Job.

To prevent orphaned Pods or resource leaks, this field is immutable. You cannot transfer a running Job from one controller to another.

If you are looking into implementing an external controller, be aware that your controller needs to be conformant with the definitions for the Job API. In order to enforce the conformance, a significant part of the effort was to introduce the extensive Job status validation rules. Navigate to the How can you learn more? section for more details.

Ecosystem Adoption

The .spec.managedBy field is rapidly becoming the standard interface for delegating control in the Kubernetes batch ecosystem.

Various custom workload controllers are adding this field (or an equivalent) to allow MultiKueue to take over their reconciliation and orchestrate them across clusters:

While it is possible to use .spec.managedBy to implement a custom Job controller from scratch, we haven't observed that yet. The feature is specifically designed to support delegation patterns, like MultiKueue, without reinventing the wheel.

How can you learn more?

If you want to dig deeper:

Read the user-facing documentation for:

Deep dive into the design history:

The Kubernetes Enhancement Proposal (KEP) Job's managed-by mechanism including introduction of the extensive Job status validation rules.
The Kueue KEP for MultiKueue.

Explore how MultiKueue uses .spec.managedBy in practice in the task guide for running Jobs across clusters.

Acknowledgments

As with any Kubernetes feature, a lot of people helped shape this one through design discussions, reviews, test runs, and bug reports.

We would like to thank, in particular:

Maciej Szulik - for guidance, mentorship, and reviews.
Filip Křepinský - for guidance, mentorship, and reviews.

Get involved

This work was sponsored by the Kubernetes Batch Working Group in close collaboration with the SIG Apps, and with strong input from the SIG Scheduling community.

If you are interested in batch scheduling, multi-cluster solutions, or further improving the Job API:

Join us in the Batch WG and SIG Apps meetings.
Subscribe to the WG Batch Slack channel.

Kubernetes v1.35: Timbernetes (The World Tree Release)

Wed, 17 Dec 2025 10:30:00 -0800

Editors: Aakanksha Bhende, Arujjwal Negi, Chad M. Crowell, Graziano Casto, Swathi Rao

Similar to previous releases, the release of Kubernetes v1.35 introduces new stable, beta, and alpha features. The consistent delivery of high-quality releases underscores the strength of our development cycle and the vibrant support from our community.

This release consists of 60 enhancements, including 17 stable, 19 beta, and 22 alpha features.

There are also some deprecations and removals in this release; make sure to read about those.

Release theme and logo

2025 began in the shimmer of Octarine: The Color of Magic (v1.33) and rode the gusts Of Wind & Will (v1.34). We close the year with our hands on the World Tree, inspired by Yggdrasil, the tree of life that binds many realms. Like any great tree, Kubernetes grows ring by ring and release by release, shaped by the care of a global community.

At its center sits the Kubernetes wheel wrapped around the Earth, grounded by the resilient maintainers, contributors and users who keep showing up. Between day jobs, life changes, and steady open-source stewardship, they prune old APIs, graft new features and keep one of the world’s largest open source projects healthy.

Three squirrels guard the tree: a wizard holding the LGTM scroll for reviewers, a warrior with an axe and Kubernetes shield for the release crews who cut new branches, and a rogue with a lantern for the triagers who bring light to dark issue queues.

Together, they stand in for a much larger adventuring party. Kubernetes v1.35 adds another growth ring to the World Tree, a fresh cut shaped by many hands, many paths and a community whose branches reach higher as its roots grow deeper.

Spotlight on key updates

Kubernetes v1.35 is packed with new features and improvements. Here are a few select updates the Release Team would like to highlight!

Stable: In-place update of Pod resources

Kubernetes has graduated in-place updates for Pod resources to General Availability (GA). This feature allows users to adjust CPU and memory resources without restarting Pods or Containers. Previously, such modifications required recreating Pods, which could disrupt workloads, particularly for stateful or batch applications. Earlier Kubernetes releases allowed you to change only infrastructure resource settings (requests and limits) for existing Pods. The new in-place functionality allows for smoother, nondisruptive vertical scaling, improves efficiency, and can also simplify development.

This work was done as part of KEP #1287 led by SIG Node.

Beta: Pod certificates for workload identity and security

Previously, delivering certificates to pods required external controllers (cert-manager, SPIFFE/SPIRE), CRD orchestration, and Secret management, with rotation handled by sidecars or init containers. Kubernetes v1.35 enables native workload identity with automated certificate rotation, drastically simplifying service mesh and zero-trust architectures.

Now, the kubelet generates keys, requests certificates via PodCertificateRequest, and writes credential bundles directly to the Pod's filesystem. The kube-apiserver enforces node restriction at admission time, eliminating the most common pitfall for third-party signers: accidentally violating node isolation boundaries. This enables pure mTLS flows with no bearer tokens in the issuance path.

This work was done as part of KEP #4317 led by SIG Auth.

Alpha: Node declared features before scheduling

When control planes enable new features but nodes lag behind (permitted by Kubernetes skew policy), the scheduler can place pods requiring those features onto incompatible older nodes. The node-declaration features framework allows nodes to declare their supported Kubernetes features. With the new alpha feature enabled, a Node reports the features it supports, publishing this information to the control plane via a new .status.declaredFeatures field. Then, the kube-scheduler, admission controllers, and third-party components can use these declarations. For example, you can enforce scheduling and API validation constraints to ensure that Pods run only on compatible nodes.

This work was done as part of KEP #5328 led by SIG Node.

Features graduating to Stable

This is a selection of some of the improvements that are now stable following the v1.35 release.

PreferSameNode traffic distribution

The trafficDistribution field for Services has been updated to provide more explicit control over traffic routing. A new option, PreferSameNode, has been introduced to let services strictly prioritize endpoints on the local node if available, falling back to remote endpoints otherwise.

Simultaneously, the existing PreferClose option has been renamed to PreferSameZone. This change makes the API self-explanatory by explicitly indicating that traffic is preferred within the current availability zone. While PreferClose is preserved for backward compatibility, PreferSameZone is now the standard for zonal routing, ensuring that both node-level and zone-level preferences are clearly distinguished.

This work was done as part of KEP #3015 led by SIG Network.

Job API managed-by mechanism

The Job API now includes a managedBy field that allows an external controller to handle Job status synchronization. This feature, which graduates to stable in Kubernetes v1.35, is primarily driven by MultiKueue, a multi-cluster dispatching system where a Job created in a management cluster is mirrored and executed in a worker cluster, with status updates propagated back. To enable this workflow, the built-in Job controller must not act on a particular Job resource so that the Kueue controller can manage status updates instead.

The goal is to allow clean delegation of Job synchronization to another controller. It does not aim to pass custom parameters to that controller or modify CronJob concurrency policies.

This work was done as part of KEP #4368 led by SIG Apps.

Reliable Pod update tracking with `.metadata.generation`

Historically, the Pod API lacked the metadata.generation field found in other Kubernetes objects such as Deployments. Because of this omission, controllers and users had no reliable way to verify whether the kubelet had actually processed the latest changes to a Pod's specification. This ambiguity was particularly problematic for features like In-Place Pod Vertical Scaling, where it was difficult to know exactly when a resource resize request had been enacted.

Kubernetes v1.33 added .metadata.generation fields for Pods, as an alpha feature. That field is now stable in the v1.35 Pod API, which means that every time a Pod's spec is updated, the .metadata.generation value is incremented. As part of this improvement, the Pod API also gained a .status.observedGeneration field, which reports the generation that the kubelet has successfully seen and processed. Pod conditions also each contain their own individual observedGeneration field that clients can report and / or observe.

Because this feature has graduated to stable in v1.35, it is available for all workloads.

This work was done as part of KEP #5067 led by SIG Node.

Configurable NUMA node limit for topology manager

The topology manager historically used a hard-coded limit of 8 for the maximum number of NUMA nodes it can support, preventing state explosion during affinity calculation. (There's an important detail here; a NUMA node is not the same as a Node in the Kubernetes API.) This limit on the number of NUMA nodes prevented Kubernetes from fully utilizing modern high-end servers, which increasingly feature CPU architectures with more than 8 NUMA nodes.

Kubernetes v1.31 introduced a new, beta max-allowable-numa-nodes option to the topology manager policy configuration. In Kubernetes v1.35, that option is stable. Cluster administrators who enable it can use servers with more than 8 NUMA nodes.

Although the configuration option is stable, the Kubernetes community is aware of the poor performance for large NUMA hosts, and there is a proposed enhancement (KEP-5726) that aims to improve on it. You can learn more about this by reading Control Topology Management Policies on a node.

This work was done as part of KEP #4622 led by SIG Node.

New features in Beta

This is a selection of some of the improvements that are now beta following the v1.35 release.

Expose node topology labels via Downward API

Accessing node topology information, such as region and zone, from within a Pod has typically required querying the Kubernetes API server. While functional, this approach creates complexity and security risks by necessitating broad RBAC permissions or sidecar containers just to retrieve infrastructure metadata. Kubernetes v1.35 promotes the capability to expose node topology labels directly via the Downward API to beta.

The kubelet can now inject standard topology labels, such as topology.kubernetes.io/zone and topology.kubernetes.io/region, into Pods as environment variables or projected volume files. The primary benefit is a safer and more efficient way for workloads to be topology-aware. This allows applications to natively adapt to their availability zone or region without dependencies on the API server, strengthening security by upholding the principle of least privilege and simplifying cluster configuration.

Note: Kubernetes now injects available topology labels to every Pod so that they can be used as inputs to the downward API. With the v1.35 upgrade, most cluster administrators will see several new labels added to each Pod; this is expected as part of the design.

This work was done as part of KEP #4742 led by SIG Node.

Native support for storage version migration

In Kubernetes v1.35, the native support for storage version migration graduates to beta and is enabled by default. This move integrates the migration logic directly into the core Kubernetes control plane ("in-tree"), eliminating the dependency on external tools.

Historically, administrators relied on manual "read/write loops"—often piping kubectl get into kubectl replace—to update schemas or re-encrypt data at rest. This method was inefficient and prone to conflicts, especially for large resources like Secrets. With this release, the built-in controller automatically handles update conflicts and consistency tokens, providing a safe, streamlined, and reliable way to ensure stored data remains current with minimal operational overhead.

This work was done as part of KEP #4192 led by SIG API Machinery.

Mutable Volume attach limits

A CSI (Container Storage Interface) driver is a Kubernetes plugin that provides a consistent way for storage systems to be exposed to containerized workloads. The CSINode object records details about all CSI drivers installed on a node. However, a mismatch can arise between the reported and actual attachment capacity on nodes. When volume slots are consumed after a CSI driver starts up, the kube-scheduler may assign stateful pods to nodes without sufficient capacity, ultimately getting stuck in a ContainerCreating state.

Kubernetes v1.35 makes CSINode.spec.drivers[*].allocatable.count mutable so that a node’s available volume attachment capacity can be updated dynamically. It also allows CSI drivers to control how frequently the allocatable.count value is updated on all nodes by introducing a configurable refresh interval, defined through the CSIDriver object. Additionally, it automatically updates CSINode.spec.drivers[*].allocatable.count on detecting a failure in volume attachment due to insufficient capacity. Although this feature graduated to beta in v1.34 with the feature flag MutableCSINodeAllocatableCount disabled by default, it remains in beta for v1.35 to allow time for feedback, but the feature flag is enabled by default.

This work was done as part of KEP #4876 led by SIG Storage.

Opportunistic batching

Historically, the Kubernetes scheduler processes pods sequentially with time complexity of O(num pods × num nodes), which can result in redundant computation for compatible pods. This KEP introduces an opportunistic batching mechanism that aims to improve performance by identifying such compatible Pods via Pod scheduling signature and batching them together, allowing shared filtering and scoring results across them.

The pod scheduling signature ensures that two pods with the same signature are “the same” from a scheduling perspective. It takes into account not only the pod and node attributes, but also the other pods in the system and global data about the pod placement. This means that any pod with the given signature will get the same scores/feasibility results from any arbitrary set of nodes.

The batching mechanism consists of two operations that can be invoked whenever needed - create and nominate. Create leads to the creation of a new set of batch information from the scheduling results of Pods that have a valid signature. Nominate uses the batching information from create to set the nominated node name from a new Pod whose signature matches the canonical Pod’s signature.

This work was done as part of KEP #5598 led by SIG Scheduling.

`maxUnavailable` for StatefulSets

A StatefulSet runs a group of Pods and maintains a sticky identity for each of those Pods. This is critical for stateful workloads requiring stable network identifiers or persistent storage. When a StatefulSet's .spec.updateStrategy.<type> is set to RollingUpdate, the StatefulSet controller will delete and recreate each Pod in the StatefulSet. It will proceed in the same order as Pod termination (from the largest ordinal to the smallest), updating each Pod one at a time.

Kubernetes v1.24 added a new alpha field to a StatefulSet's rollingUpdate configuration settings, called maxUnavailable. That field wasn't part of the Kubernetes API unless your cluster administrator explicitly opted in. In Kubernetes v1.35 that field is beta and is available by default. You can use it to define the maximum number of pods that can be unavailable during an update. This setting is most effective in combination with .spec.podManagementPolicy set to Parallel. You can set maxUnavailable as either a positive number (example: 2) or a percentage of the desired number of Pods (example: 10%). If this field is not specified, it will default to 1, to maintain the previous behavior of only updating one Pod at a time. This improvement allows stateful applications (that can tolerate more than one Pod being down) to finish updating faster.

This work was done as part of KEP #961 led by SIG Apps.

Configurable credential plugin policy in `kuberc`

The optional kuberc file is a way to separate server configurations and cluster credentials from user preferences without disrupting already running CI pipelines with unexpected outputs.

As part of the v1.35 release, kuberc gains additional functionality which allows users to configure credential plugin policy. This change introduces two fields credentialPluginPolicy, which allows or denies all plugins, and allows specifying a list of allowed plugins using credentialPluginAllowlist.

This work was done as part of KEP #3104 as a cooperation between SIG Auth and SIG CLI.

KYAML

YAML is a human-readable format of data serialization. In Kubernetes, YAML files are used to define and configure resources, such as Pods, Services, and Deployments. However, complex YAML is difficult to read. YAML's significant whitespace requires careful attention to indentation and nesting, while its optional string-quoting can lead to unexpected type coercion (see: The Norway Bug). While JSON is an alternative, it lacks support for comments and has strict requirements for trailing commas and quoted keys.

KYAML is a safer and less ambiguous subset of YAML designed specifically for Kubernetes. Introduced as an opt-in alpha feature in v1.34, this feature graduated to beta in Kubernetes v1.35 and has been enabled by default. It can be disabled by setting the environment variable KUBECTL_KYAML=false.

KYAML addresses challenges pertaining to both YAML and JSON. All KYAML files are also valid YAML files. This means you can write KYAML and pass it as an input to any version of kubectl. This also means that you don’t need to write in strict KYAML for the input to be parsed.

This work was done as part of KEP #5295 led by SIG CLI.

Configurable tolerance for HorizontalPodAutoscalers

The Horizontal Pod Autoscaler (HPA) has historically relied on a fixed, global 10% tolerance for scaling actions. A drawback of this hardcoded value was that workloads requiring high sensitivity, such as those needing to scale on a 5% load increase, were often blocked from scaling, while others might oscillate unnecessarily.

With Kubernetes v1.35, the configurable tolerance feature graduates to beta and is enabled by default. This enhancement allows users to define a custom tolerance window on a per-resource basis within the HPA behavior field. By setting a specific tolerance (e.g., lowering it to 0.05 for 5%), operators gain precise control over autoscaling sensitivity, ensuring that critical workloads react quickly to small metric changes, without requiring cluster-wide configuration adjustments.

This work was done as part of KEP #4951 led by SIG Autoscaling.

Support for user namespaces in Pods

Kubernetes is adding support for user namespaces, allowing pods to run with isolated user and group ID mappings instead of sharing host IDs. This means containers can operate as root internally while actually being mapped to an unprivileged user on the host, reducing the risk of privilege escalation in the event of a compromise. The feature improves pod-level security and makes it safer to run workloads that need root inside the container. Over time, support has expanded to both stateless and stateful Pods through id-mapped mounts.

This work was done as part of KEP #127 led by SIG Node.

VolumeSource: OCI artifact and/or image

When creating a Pod, you often need to provide data, binaries, or configuration files for your containers. This meant including the content into the main container image or using a custom init container to download and unpack files into an emptyDir. Both these approaches are still valid. Kubernetes v1.31 added support for the image volume type allowing Pods to declaratively pull and unpack OCI container image artifacts into a volume. This lets you package and deliver data-only artifacts such as configs, binaries, or machine learning models using standard OCI registry tools.

With this feature, you can fully separate your data from your container image and remove the need for extra init containers or startup scripts. The image volume type has been in beta since v1.33 and is enabled by default in v1.35. Please note that using this feature requires a compatible container runtime, such as containerd v2.1 or later.

This work was done as part of KEP #4639 led by SIG Node.

Enforced `kubelet` credential verification for cached images

The imagePullPolicy: IfNotPresent setting currently allows a Pod to use a container image that is already cached on a node, even if the Pod itself does not possess the credentials to pull that image. A drawback of this behavior is that it creates a security vulnerability in multi-tenant clusters: if a Pod with valid credentials pulls a sensitive private image to a node, a subsequent unauthorized Pod on the same node can access that image simply by relying on the local cache.

This KEP introduces a mechanism where the kubelet enforces credential verification for cached images. Before allowing a Pod to use a locally cached image, the kubelet checks if the Pod has the valid credentials to pull it. This ensures that only authorized workloads can use private images, regardless of whether they are already present on the node, significantly hardening the security posture for shared clusters.

In Kubernetes v1.35, this feature has graduated to beta and is enabled by default. Users can still disable it by setting the KubeletEnsureSecretPulledImages feature gate to false. Additionally, the imagePullCredentialsVerificationPolicy flag allows operators to configure the desired security level, ranging from a mode that prioritizes backward compatibility to a strict enforcement mode that offers maximum security.

This work was done as part of KEP #2535 led by SIG Node.

Fine-grained Container restart rules

Historically, the restartPolicy field was defined strictly at the Pod level, forcing the same behavior on all containers within a Pod. A drawback of this global setting was the lack of granularity for complex workloads, such as AI/ML training jobs. These often required restartPolicy: Never for the Pod to manage job completion, yet individual containers would benefit from in-place restarts for specific, retriable errors (like network glitches or GPU init failures).

Kubernetes v1.35 addresses this by enabling restartPolicy and restartPolicyRules within the container API itself. This allows users to define restart strategies for individual regular and init containers that operate independently of the Pod's overall policy. For example, a container can now be configured to restart automatically only if it exits with a specific error code, avoiding the expensive overhead of rescheduling the entire Pod for a transient failure.

In this release, the feature has graduated to beta and is enabled by default. Users can immediately leverage restartPolicyRules in their container specifications to optimize recovery times and resource utilization for long-running workloads, without altering the broader lifecycle logic of their Pods.

This work was done as part of KEP #5307 led by SIG Node.

CSI driver opt-in for service account tokens via secrets field

Providing ServiceAccount tokens to Container Storage Interface (CSI) drivers has traditionally relied on injecting them into the volume_context field. This approach presents a significant security risk because volume_context is intended for non-sensitive configuration data and is frequently logged in plain text by drivers and debugging tools, potentially leaking credentials.

Kubernetes v1.35 introduces an opt-in mechanism for CSI drivers to receive ServiceAccount tokens via the dedicated secrets field in the NodePublishVolume request. Drivers can now enable this behavior by setting the serviceAccountTokenInSecrets field to true in their CSIDriver object, instructing the kubelet to populate the token securely.

The primary benefit is the prevention of accidental credential exposure in logs and error messages. This change ensures that sensitive workload identities are handled via the appropriate secure channels, aligning with best practices for secret management while maintaining backward compatibility for existing drivers.

This work was done as part of KEP #5538 led by SIG Auth in cooperation with SIG Storage.

Deployment status: count of terminating replicas

Historically, the Deployment status provided details on available and updated replicas but lacked explicit visibility into Pods that were in the process of shutting down. A drawback of this omission was that users and controllers could not easily distinguish between a stable Deployment and one that still had Pods executing cleanup tasks or adhering to long grace periods.

Kubernetes v1.35 promotes the terminatingReplicas field within the Deployment status to beta. This field provides a count of Pods that have a deletion timestamp set but have not yet been removed from the system. This feature is a foundational step in a larger initiative to improve how Deployments handle Pod replacement, laying the groundwork for future policies regarding when to create new Pods during a rollout.

The primary benefit is improved observability for lifecycle management tools and operators. By exposing the number of terminating Pods, external systems can now make more informed decisions such as waiting for a complete shutdown before proceeding with subsequent tasks without needing to manually query and filter individual Pod lists.

This work was done as part of KEP #3973 led by SIG Apps.

New features in Alpha

This is a selection of some of the improvements that are now alpha following the v1.35 release.

Gang scheduling support in Kubernetes

Scheduling interdependent workloads, such as AI/ML training jobs or HPC simulations, has traditionally been challenging because the default Kubernetes scheduler places Pods individually. This often leads to partial scheduling where some Pods start while others wait indefinitely for resources, resulting in deadlocks and wasted cluster capacity.

Kubernetes v1.35 introduces native support for so-called "gang scheduling" via the new Workload API and PodGroup concept. This feature implements an "all-or-nothing" scheduling strategy, ensuring that a defined group of Pods is scheduled only if the cluster has sufficient resources to accommodate the entire group simultaneously.

The primary benefit is improved reliability and efficiency for batch and parallel workloads. By preventing partial deployments, it eliminates resource deadlocks and ensures that expensive cluster capacity is utilized only when a complete job can run, significantly optimizing the orchestration of large-scale data processing tasks.

This work was done as part of KEP #4671 led by SIG Scheduling.

Constrained impersonation

Historically, the impersonate verb in Kubernetes RBAC functioned on an all-or-nothing basis: once a user was authorized to impersonate a target identity, they gained all associated permissions. A drawback of this broad authorization was that it violated the principle of least privilege, preventing administrators from restricting impersonators to specific actions or resources.

Kubernetes v1.35 introduces a new alpha feature, constrained impersonation, which adds a secondary authorization check to the impersonation flow. When enabled via the ConstrainedImpersonation feature gate, the API server verifies not only the basic impersonate permission but also checks if the impersonator is authorized for the specific action using new verb prefixes (e.g., impersonate-on:<mode>:<verb>). This allows administrators to define fine-grained policies—such as permitting a support engineer to impersonate a cluster admin solely to view logs, without granting full administrative access.

This work was done as part of KEP #5284 led by SIG Auth.

Flagz for Kubernetes components

Verifying the runtime configuration of Kubernetes components, such as the API server or kubelet, has traditionally required privileged access to the host node or process arguments. To address this, the /flagz endpoint was introduced to expose command-line options via HTTP. However, its output was initially limited to plain text, making it difficult for automated tools to parse and validate configurations reliably.

In Kubernetes v1.35, the /flagz endpoint has been enhanced to support structured, machine-readable JSON output. Authorized users can now request a versioned JSON response using standard HTTP content negotiation, while the original plain text format remains available for human inspection. This update significantly improves observability and compliance workflows, allowing external systems to programmatically audit component configurations without fragile text parsing or direct infrastructure access.

This work was done as part of KEP #4828 led by SIG Instrumentation.

Statusz for Kubernetes components

Troubleshooting Kubernetes components like the kube-apiserver or kubelet has traditionally involved parsing unstructured logs or text output, which is brittle and difficult to automate. While a basic /statusz endpoint existed previously, it lacked a standardized, machine-readable format, limiting its utility for external monitoring systems.

In Kubernetes v1.35, the /statusz endpoint has been enhanced to support structured, machine-readable JSON output. Authorized users can now request this format using standard HTTP content negotiation to retrieve precise status data—such as version information and health indicators—without relying on fragile text parsing. This improvement provides a reliable, consistent interface for automated debugging and observability tools across all core components.

This work was done as part of KEP #4827 led by SIG Instrumentation.

CCM: watch-based route controller reconciliation using informers

Managing network routes within cloud environments has traditionally relied on the Cloud Controller Manager (CCM) periodically polling the cloud provider's API to verify and update route tables. This fixed-interval reconciliation approach can be inefficient, often generating a high volume of unnecessary API calls and introducing latency between a node state change and the corresponding route update.

For the Kubernetes v1.35 release, the cloud-controller-manager library introduces a watch-based reconciliation strategy for the route controller. Instead of relying on a timer, the controller now utilizes informers to watch for specific Node events, such as additions, deletions, or relevant field updates and triggers route synchronization only when a change actually occurs.

The primary benefit is a significant reduction in cloud provider API usage, which lowers the risk of hitting rate limits and reduces operational overhead. Additionally, this event-driven model improves the responsiveness of the cluster's networking layer by ensuring that route tables are updated immediately following changes in cluster topology.

This work was done as part of KEP #5237 led by SIG Cloud Provider.

Extended toleration operators for threshold-based placement

Kubernetes v1.35 introduces SLA-aware scheduling by enabling workloads to express reliability requirements. The feature adds numeric comparison operators to tolerations, allowing pods to match or avoid nodes based on SLA-oriented taints such as service guarantees or fault-domain quality.

The primary benefit is enhancing the scheduler with more precise placement. Critical workloads can demand higher-SLA nodes, while lower priority workloads can opt into lower SLA ones. This improves utilization and reduces cost without compromising reliability.

This work was done as part of KEP #5471 led by SIG Scheduling.

Mutable container resources when Job is suspended

Running batch workloads often involves trial and error with resource limits. Currently, the Job specification is immutable, meaning that if a Job fails due to an Out of Memory (OOM) error or insufficient CPU, the user cannot simply adjust the resources; they must delete the Job and create a new one, losing the execution history and status.

Kubernetes v1.35 introduces the capability to update resource requests and limits for Jobs that are in a suspended state. Enabled via the MutablePodResourcesForSuspendedJobs feature gate, this enhancement allows users to pause a failing Job, modify its Pod template with appropriate resource values, and then resume execution with the corrected configuration.

The primary benefit is a smoother recovery workflow for misconfigured jobs. By allowing in-place corrections during suspension, users can resolve resource bottlenecks without disrupting the Job's lifecycle identity or losing track of its completion status, significantly improving the developer experience for batch processing.

This work was done as part of KEP #5440 led by SIG Apps.

Other notable changes

Continued innovation in Dynamic Resource Allocation (DRA)

The core functionality was graduated to stable in v1.34, with the ability to turn it off. In v1.35 it is always enabled. Several alpha features have also been significantly improved and are ready for testing. We encourage users to provide feedback on these capabilities to help clear the path for their target promotion to beta in upcoming releases.

Extended Resource Requests via DRA

Several functional gaps compared to Extended Resource requests via Device Plugins were addressed, for example scoring and reuse of devices in init containers.

Device Taints and Tolerations

The new "None" effect can be used to report a problem without immediately affecting scheduling or running pod. DeviceTaintRule now provides status information about an ongoing eviction. The "None" effect can be used for a "dry run" before actually evicting pods:

Create DeviceTaintRule with "effect: None".
Check the status to see how many pods would be evicted.
Replace "effect: None" with "effect: NoExecute".

Partitionable Devices

Devices belonging to the same partitionable devices may now be defined in different ResourceSlices. You can read more in the official documentation.

Consumable Capacity, Device Binding Conditions

Several bugs were fixed and/or more tests added. You can learn more about Consumable Capacity and Binding Conditions in the official documentation.

Comparable resource version semantics

Kubernetes v1.35 changes the way that clients are allowed to interpret resource versions.

Before v1.35, the only supported comparison that clients could make was to check for string equality: if two resource versions were equal, they were the same. Clients could also provide a resource version to the API server and ask the control plane to do internal comparisons, such as streaming all events since a particular resource version.

In v1.35, all in-tree resource versions meet a new stricter definition: the values are a special form of decimal number. And, because they can be compared, clients can do their own operations to compare two different resource versions. For example, this means that a client reconnecting after a crash can detect when it has lost updates, as distinct from the case where there has been an update but no lost changes in the meantime.

This change in semantics enables other important use cases such as storage version migration, performance improvements to informers (a client helper concept), and controller reliability. All of those cases require knowing whether one resource version is newer than another.

This work was done as part of KEP #5504 led by SIG API Machinery.

Graduations, deprecations, and removals in v1.35

Graduations to stable

This lists all the features that graduated to stable (also known as general availability). For a full list of updates including new features and graduations from alpha to beta, see the release notes.

This release includes a total of 15 enhancements promoted to stable:

Deprecations, removals and community updates

As Kubernetes develops and matures, features may be deprecated, removed, or replaced with better ones to improve the project's overall health. See the Kubernetes deprecation and removal policy for more details on this process. Kubernetes v1.35 includes a couple of deprecations.

Ingress NGINX retirement

For years, the Ingress NGINX controller has been a popular choice for routing traffic into Kubernetes clusters. It was flexible, widely adopted, and served as the standard entry point for countless applications.

However, maintaining the project has become unsustainable. With a severe shortage of maintainers and mounting technical debt, the community recently made the difficult decision to retire it. This isn't strictly part of the v1.35 release, but it's such an important change that we wanted to highlight it here.

Consequently, the Kubernetes project announced that Ingress NGINX will receive only best-effort maintenance until March 2026. After this date, it will be archived with no further updates. The recommended path forward is to migrate to the Gateway API, which offers a more modern, secure, and extensible standard for traffic management.

You can find more in the official blog post.

Removal of cgroup v1 support

When it comes to managing resources on Linux nodes, Kubernetes has historically relied on cgroups (control groups). While the original cgroup v1 was functional, it was often inconsistent and limited. That is why Kubernetes introduced support for cgroup v2 back in v1.25, offering a much cleaner, unified hierarchy and better resource isolation.

Because cgroup v2 is now the modern standard, Kubernetes is ready to retire the legacy cgroup v1 support in v1.35. This is an important notice for cluster administrators: if you are still running nodes on older Linux distributions that don't support cgroup v2, your kubelet will fail to start. To avoid downtime, you will need to migrate those nodes to systems where cgroup v2 is enabled.

To learn more, read about cgroup v2;
you can also track the switchover work via KEP-5573: Remove cgroup v1 support.

Deprecation of ipvs mode in kube-proxy

Years ago, Kubernetes adopted the ipvs mode in kube-proxy to provide faster load balancing than the standard iptables. While it offered a performance boost, keeping it in sync with evolving networking requirements created too much technical debt and complexity.

Because of this maintenance burden, Kubernetes v1.35 deprecates ipvs mode. Although the mode remains available in this release, kube-proxy will now emit a warning on startup when configured to use it. The goal is to streamline the codebase and focus on modern standards. For Linux nodes, you should begin transitioning to nftables, which is now the recommended replacement.

You can find more in KEP-5495: Deprecate ipvs mode in kube-proxy.

Final call for containerd v1.X

While Kubernetes v1.35 still supports containerd 1.7 and other LTS releases, this is the final version with such support. The SIG Node community has designated v1.35 as the last release to support the containerd v1.X series.

This serves as an important reminder: before upgrading to the next Kubernetes version, you must switch to containerd 2.0 or later. To help identify which nodes need attention, you can monitor the kubelet_cri_losing_support metric within your cluster.

You can find more in the official blog post or in KEP-4033: Discover cgroup driver from CRI.

Improved Pod stability during `kubelet` restarts

Previously, restarting the kubelet service often caused a temporary disruption in Pod status. During a restart, the kubelet would reset container states, causing healthy Pods to be marked as NotReady and removed from load balancers, even if the application itself was still running correctly.

To address this reliability issue, this behavior has been corrected to ensure seamless node maintenance. The kubelet now properly restores the state of existing containers from the runtime upon startup. This ensures that your workloads remain Ready and traffic continues to flow uninterrupted during kubelet restarts or upgrades.

You can find more in KEP-4781: Fix inconsistent container ready state after kubelet restart.

Release notes

Check out the full details of the Kubernetes v1.35 release in our release notes.

Availability

Kubernetes v1.35 is available for download on GitHub or on the Kubernetes download page.

To get started with Kubernetes, check out these interactive tutorials or run local Kubernetes clusters using minikube. You can also easily install v1.35 using kubeadm.

Release team

Kubernetes is only possible with the support, commitment, and hard work of its community. Each release team is made up of dedicated community volunteers who work together to build the many pieces that make up the Kubernetes releases you rely on. This requires the specialized skills of people from all corners of our community, from the code itself to its documentation and project management.

We honor the memory of Han Kang, a long-time contributor and respected engineer whose technical excellence and infectious enthusiasm left a lasting impact on the Kubernetes community. Han was a significant force within SIG Instrumentation and SIG API Machinery, earning a 2021 Kubernetes Contributor Award for his critical work and sustained commitment to the project's core stability. Beyond his technical contributions, Han was deeply admired for his generosity as a mentor and his passion for building connections among people. He was known for "opening doors" for others, whether guiding new contributors through their first pull requests or supporting colleagues with patience and kindness. Han’s legacy lives on through the engineers he inspired, the robust systems he helped build, and the warm, collaborative spirit he fostered within the cloud native ecosystem.

We would like to thank the entire Release Team for the hours spent hard at work to deliver the Kubernetes v1.35 release to our community. The Release Team's membership ranges from first-time shadows to returning team leads with experience forged over several release cycles. We are incredibly grateful to our Release Lead, Drew Hagen, whose hands-on guidance and vibrant energy not only navigated us through complex challenges but also fueled the community spirit behind this successful release.

Project velocity

The CNCF K8s DevStats project aggregates a number of interesting data points related to the velocity of Kubernetes and various sub-projects. This includes everything from individual contributions to the number of companies that are contributing and is an illustration of the depth and breadth of effort that goes into evolving this ecosystem.

During the v1.35 release cycle, which spanned 14 weeks from 15th September 2025 to 17th December 2025, Kubernetes received contributions from as many as 85 different companies and 419 individuals. In the wider cloud native ecosystem, the figure goes up to 281 companies, counting 1769 total contributors.

Note that "contribution" counts when someone makes a commit, code review, comment, creates an issue or PR, reviews a PR (including blogs and documentation) or comments on issues and PRs.
If you are interested in contributing, visit Getting Started on our contributor website.

Sources for this data:

Events update

Explore upcoming Kubernetes and cloud native events, including KubeCon + CloudNativeCon, KCD, and other notable conferences worldwide. Stay informed and get involved with the Kubernetes community!

February 2026

KCD - Kubernetes Community Days: New Delhi: Feb 21, 2026 | New Delhi, India
KCD - Kubernetes Community Days: Guadalajara: Feb 23, 2026 | Guadalajara, Mexico

March 2026

KubeCon + CloudNativeCon Europe 2026: Mar 23-26, 2026 | Amsterdam, Netherlands

May 2026

KCD - Kubernetes Community Days: Toronto: May 13, 2026 | Toronto, Canada
KCD - Kubernetes Community Days: Helsinki: May 20, 2026 | Helsinki, Finland

June 2026

KubeCon + CloudNativeCon India 2026: Jun 18-19, 2026 | Mumbai, India
KCD - Kubernetes Community Days: Kuala Lumpur: Jun 27, 2026 | Kuala Lumpur, Malaysia

July 2026

KubeCon + CloudNativeCon Japan 2026: Jul 29-30, 2026 | Yokohama, Japan

You can find the latest event details here.

Upcoming release webinar

Join members of the Kubernetes v1.35 Release Team on Wednesday, January 14, 2026, at 5:00 PM (UTC) to learn about the release highlights of this release. For more information and registration, visit the event page on the CNCF Online Programs site.

Get involved

The simplest way to get involved with Kubernetes is by joining one of the many Special Interest Groups (SIGs) that align with your interests. Have something you’d like to broadcast to the Kubernetes community? Share your voice at our weekly community meeting, and through the channels below. Thank you for your continued feedback and support.

Follow us on Bluesky @Kubernetesio for the latest updates
Join the community discussion on Discuss
Join the community on Slack
Post questions (or answer questions) on Stack Overflow
Share your Kubernetes story
Read more about what’s happening with Kubernetes on the blog
Learn more about the Kubernetes Release Team

Kubernetes v1.35 Sneak Peek

Wed, 26 Nov 2025 00:00:00 +0000

As the release of Kubernetes v1.35 approaches, the Kubernetes project continues to evolve. Features may be deprecated, removed, or replaced to improve the project's overall health. This blog post outlines planned changes for the v1.35 release that the release team believes you should be aware of to ensure the continued smooth operation of your Kubernetes cluster(s), and to keep you up to date with the latest developments. The information below is based on the current status of the v1.35 release and is subject to change before the final release date.

Deprecations and removals for Kubernetes v1.35

cgroup v1 support

On Linux nodes, container runtimes typically rely on cgroups (short for "control groups"). Support for using cgroup v2 has been stable in Kubernetes since v1.25, providing an alternative to the original v1 cgroup support. While cgroup v1 provided the initial resource control mechanism, it suffered from well-known inconsistencies and limitations. Adding support for cgroup v2 allowed use of a unified control group hierarchy, improved resource isolation, and served as the foundation for modern features, making legacy cgroup v1 support ready for removal. The removal of cgroup v1 support will only impact cluster administrators running nodes on older Linux distributions that do not support cgroup v2; on those nodes, the kubelet will fail to start. Administrators must migrate their nodes to systems with cgroup v2 enabled. More details on compatibility requirements will be available in a blog post soon after the v1.35 release.

To learn more, read about cgroup v2;
you can also track the switchover work via KEP-5573: Remove cgroup v1 support.

Deprecation of ipvs mode in kube-proxy

Many releases ago, the Kubernetes project implemented an ipvs mode in kube-proxy. It was adopted as a way to provide high-performance service load balancing, with better performance than the existing iptables mode. However, maintaining feature parity between ipvs and other kube-proxy modes became difficult, due to technical complexity and diverging requirements. This created significant technical debt and made the ipvs backend impractical to support alongside newer networking capabilities.

The Kubernetes project intends to deprecate kube-proxy ipvs mode in the v1.35 release, to streamline the kube-proxy codebase. For Linux nodes, the recommended kube-proxy mode is already nftables.

You can find more in KEP-5495: Deprecate ipvs mode in kube-proxy

Kubernetes is deprecating containerd v1.y support

While Kubernetes v1.35 still supports containerd 1.7 and other LTS releases of containerd, as a consequence of automated cgroup driver detection, the Kubernetes SIG Node community has formally agreed upon a final support timeline for containerd v1.X. Kubernetes v1.35 is the last release to offer this support (aligned with containerd 1.7 EOL).

This is a final warning that if you are using containerd 1.X, you must switch to 2.0 or later before upgrading Kubernetes to the next version. You are able to monitor the kubelet_cri_losing_support metric to determine if any nodes in your cluster are using a containerd version that will soon be unsupported.

You can find more in the official blog post or in KEP-4033: Discover cgroup driver from CRI

Featured enhancements of Kubernetes v1.35

The following enhancements are some of those likely to be included in the v1.35 release. This is not a commitment, and the release content is subject to change.

Node declared features

When scheduling Pods, Kubernetes uses node labels, taints, and tolerations to match workload requirements with node capabilities. However, managing feature compatibility becomes challenging during cluster upgrades due to version skew between the control plane and nodes. This can lead to Pods being scheduled on nodes that lack required features, resulting in runtime failures.

The node declared features framework will introduce a standard mechanism for nodes to declare their supported Kubernetes features. With the new alpha feature enabled, a Node reports the features it can support, publishing this information to the control plane through a new .status.declaredFeatures field. Then, the kube-scheduler, admission controllers and third-party components can use these declarations. For example, you can enforce scheduling and API validation constraints, ensuring that Pods run only on compatible nodes.

This approach reduces manual node labeling, improves scheduling accuracy, and prevents incompatible pod placements proactively. It also integrates with the Cluster Autoscaler for informed scale-up decisions. Feature declarations are temporary and tied to Kubernetes feature gates, enabling safe rollout and cleanup.

Targeting alpha in v1.35, node declared features aims to solve version skew scheduling issues by making node capabilities explicit, enhancing reliability and cluster stability in heterogeneous version environments.

To learn more about this before the official documentation is published, you can read KEP-5328.

In-place update of Pod resources

Kubernetes is graduating in-place updates for Pod resources to General Availability (GA). This feature allows users to adjust cpu and memory resources without restarting Pods or Containers. Previously, such modifications required recreating Pods, which could disrupt workloads, particularly for stateful or batch applications. Previous Kubernetes releases already allowed you to change infrastructure resources settings (requests and limits) for existing Pods. This allows for smoother vertical scaling, improves efficiency, and can also simplify solution development.

The Container Runtime Interface (CRI) has also been improved, extending the UpdateContainerResources API for Windows and future runtimes while allowing ContainerStatus to report real-time resource configurations. Together, these changes make scaling in Kubernetes faster, more flexible, and disruption-free. The feature was introduced as alpha in v1.27, graduated to beta in v1.33, and is targeting graduation to stable in v1.35.

You can find more in KEP-1287: In-place Update of Pod Resources

Pod certificates

When running microservices, Pods often require a strong cryptographic identity to authenticate with each other using mutual TLS (mTLS). While Kubernetes provides Service Account tokens, these are designed for authenticating to the API server, not for general-purpose workload identity.

Before this enhancement, operators had to rely on complex, external projects like SPIFFE/SPIRE or cert-manager to provision and rotate certificates for their workloads. But what if you could issue a unique, short-lived certificate to your Pods natively and automatically? KEP-4317 is designed to enable such native workload identity. It opens up various possibilities for securing pod-to-pod communication by allowing the kubelet to request and mount certificates for a Pod via a projected volume.

This provides a built-in mechanism for workload identity, complete with automated certificate rotation, significantly simplifying the setup of service meshes and other zero-trust network policies. This feature was introduced as alpha in v1.34 and is targeting beta in v1.35.

You can find more in KEP-4317: Pod Certificates

Numeric values for taints

Kubernetes is enhancing taints and tolerations by adding numeric comparison operators, such as Gt (Greater Than) and Lt (Less Than).

Previously, tolerations supported only exact (Equal) or existence (Exists) matches, which were not suitable for numeric properties such as reliability SLAs.

With this change, a Pod can use a toleration to "opt-in" to nodes that meet a specific numeric threshold. For example, a Pod can require a Node with an SLA taint value greater than 950 (operator: Gt, value: "950").

This approach is more powerful than Node Affinity because it supports the NoExecute effect, allowing Pods to be automatically evicted if a node's numeric value drops below the tolerated threshold.

You can find more in KEP-5471: Enable SLA-based Scheduling

User namespaces

When running Pods, you can use securityContext to drop privileges, but containers inside the pod often still run as root (UID 0). This simplicity poses a significant challenge, as that container UID 0 maps directly to the host's root user.

Before this enhancement, a container breakout vulnerability could grant an attacker full root access to the node. But what if you could dynamically remap the container's root user to a safe, unprivileged user on the host? KEP-127 specifically allows such native support for Linux User Namespaces. It opens up various possibilities for pod security by isolating container and host user/group IDs. This allows a process to have root privileges (UID 0) within its namespace, while running as a non-privileged, high-numbered UID on the host.

Released as alpha in v1.25 and beta in v1.30, this feature continues to progress through beta maturity, paving the way for truly "rootless" containers that drastically reduce the attack surface for a whole class of security vulnerabilities.

You can find more in KEP-127: User Namespaces

Support for mounting OCI images as volumes

When provisioning a Pod, you often need to bundle data, binaries, or configuration files for your containers. Before this enhancement, people often included that kind of data directly into the main container image, or required a custom init container to download and unpack files into an emptyDir. You can still take either of those approaches, of course.

But what if you could populate a volume directly from a data-only artifact in an OCI registry, just like pulling a container image? Kubernetes v1.31 added support for the image volume type, allowing Pods to pull and unpack OCI container image artifacts into a volume declaratively.

This allows for seamless distribution of data, binaries, or ML models using standard registry tooling, completely decoupling data from the container image and eliminating the need for complex init containers or startup scripts. This volume type has been in beta since v1.33 and will likely be enabled by default in v1.35.

You can try out the beta version of image volumes, or you can learn more about the plans from KEP-4639: OCI Volume Source.

Want to know more?

New features and deprecations are also announced in the Kubernetes release notes. We will formally announce what's new in Kubernetes v1.35 as part of the CHANGELOG for that release.

The Kubernetes v1.35 release is planned for December 17, 2025. Stay tuned for updates!

You can also see the announcements of changes in the release notes for:

Get involved

Follow us on Bluesky @kubernetes.io for the latest updates
Join the community discussion on Discuss
Join the community on Slack
Post questions (or answer questions) on Server Fault or Stack Overflow
Share your Kubernetes story
Read more about what’s happening with Kubernetes on the blog
Learn more about the Kubernetes Release Team

Kubernetes Configuration Good Practices

Tue, 25 Nov 2025 00:00:00 +0000

Configuration is one of those things in Kubernetes that seems small until it's not. Configuration is at the heart of every Kubernetes workload. A missing quote, a wrong API version or a misplaced YAML indent can ruin your entire deploy.

This blog brings together tried-and-tested configuration best practices. The small habits that make your Kubernetes setup clean, consistent and easier to manage. Whether you are just starting out or already deploying apps daily, these are the little things that keep your cluster stable and your future self sane.

This blog is inspired by the original Configuration Best Practices page, which has evolved through contributions from many members of the Kubernetes community.

General configuration practices

Use the latest stable API version

Kubernetes evolves fast. Older APIs eventually get deprecated and stop working. So, whenever you are defining resources, make sure you are using the latest stable API version. You can always check with

kubectl api-resources

This simple step saves you from future compatibility issues.

Store configuration in version control

Never apply manifest files directly from your desktop. Always keep them in a version control system like Git, it's your safety net. If something breaks, you can instantly roll back to a previous commit, compare changes or recreate your cluster setup without panic.

Write configs in YAML not JSON

Write your configuration files using YAML rather than JSON. Both work technically, but YAML is just easier for humans. It's cleaner to read and less noisy and widely used in the community.

YAML has some sneaky gotchas with boolean values: Use only true or false. Don't write yes, no, on or off. They might work in one version of YAML but break in another. To be safe, quote anything that looks like a Boolean (for example "yes").

Keep configuration simple and minimal

Avoid setting default values that are already handled by Kubernetes. Minimal manifests are easier to debug, cleaner to review and less likely to break things later.

If your Deployment, Service and ConfigMap all belong to one app, put them in a single manifest file.
It's easier to track changes and apply them as a unit. See the Guestbook all-in-one.yaml file for an example of this syntax.

You can even apply entire directories with:

kubectl apply -f configs/

One command and boom everything in that folder gets deployed.

Add helpful annotations

Manifest files are not just for machines, they are for humans too. Use annotations to describe why something exists or what it does. A quick one-liner can save hours when debugging later and also allows better collaboration.

The most helpful annotation to set is kubernetes.io/description. It's like using comment, except that it gets copied into the API so that everyone else can see it even after you deploy.

Managing Workloads: Pods, Deployments, and Jobs

A common early mistake in Kubernetes is creating Pods directly. Pods work, but they don't reschedule themselves if something goes wrong.

Naked Pods (Pods not managed by a controller, such as Deployment or a StatefulSet) are fine for testing, but in real setups, they are risky.

Why? Because if the node hosting that Pod dies, the Pod dies with it and Kubernetes won't bring it back automatically.

Use Deployments for apps that should always be running

A Deployment, which both creates a ReplicaSet to ensure that the desired number of Pods is always available, and specifies a strategy to replace Pods (such as RollingUpdate), is almost always preferable to creating Pods directly. You can roll out a new version, and if something breaks, roll back instantly.

Use Jobs for tasks that should finish

A Job is perfect when you need something to run once and then stop like database migration or batch processing task. It will retry if the pods fails and report success when it's done.

Service Configuration and Networking

Services are how your workloads talk to each other inside (and sometimes outside) your cluster. Without them, your pods exist but can't reach anyone. Let's make sure that doesn't happen.

Create Services before workloads that use them

When Kubernetes starts a Pod, it automatically injects environment variables for existing Services. So, if a Pod depends on a Service, create a Service before its corresponding backend workloads (Deployments or StatefulSets), and before any workloads that need to access it.

For example, if a Service named foo exists, all containers will get the following variables in their initial environment:

FOO_SERVICE_HOST=<the host the Service runs on>
FOO_SERVICE_PORT=<the port the Service runs on>

DNS based discovery doesn't have this problem, but it's a good habit to follow anyway.

Use DNS for Service discovery

If your cluster has the DNS add-on (most do), every Service automatically gets a DNS entry. That means you can access it by name instead of IP:

curl http://my-service.default.svc.cluster.local

It's one of those features that makes Kubernetes networking feel magical.

Avoid `hostPort` and `hostNetwork` unless absolutely necessary

You'll sometimes see these options in manifests:

hostPort: 8080
hostNetwork: true

But here's the thing: They tie your Pods to specific nodes, making them harder to schedule and scale. Because each <hostIP, hostPort, protocol> combination must be unique. If you don't specify the hostIP and protocol explicitly, Kubernetes will use 0.0.0.0 as the default hostIP and TCP as the default protocol. Unless you're debugging or building something like a network plugin, avoid them.

If you just need local access for testing, try kubectl port-forward:

kubectl port-forward deployment/web 8080:80

See Use Port Forwarding to access applications in a cluster to learn more. Or if you really need external access, use a type: NodePort Service. That's the safer, Kubernetes-native way.

Use headless Services for internal discovery

Sometimes, you don't want Kubernetes to load balance traffic. You want to talk directly to each Pod. That's where headless Services come in.

You create one by setting clusterIP: None. Instead of a single IP, DNS gives you a list of all Pods IPs, perfect for apps that manage connections themselves.

Working with labels effectively

Labels are key/value pairs that are attached to objects such as Pods. Labels help you organize, query and group your resources. They don't do anything by themselves, but they make everything else from Services to Deployments work together smoothly.

Use semantics labels

Good labels help you understand what's what, even after months later. Define and use labels that identify semantic attributes of your application or Deployment. For example;

labels:
  app.kubernetes.io/name: myapp
  app.kubernetes.io/component: web
  tier: frontend
  phase: test

app.kubernetes.io/name : what the app is
tier : which layer it belongs to (frontend/backend)
phase : which stage it's in (test/prod)

You can then use these labels to make powerful selectors. For example:

kubectl get pods -l tier=frontend

This will list all frontend Pods across your cluster, no matter which Deployment they came from. Basically you are not manually listing Pod names; you are just describing what you want. See the guestbook app for examples of this approach.

Use common Kubernetes labels

Kubernetes actually recommends a set of common labels. It's a standardized way to name things across your different workloads or projects. Following this convention makes your manifests cleaner, and it means that tools such as Headlamp, dashboard, or third-party monitoring systems can all automatically understand what's running.

Manipulate labels for debugging

Since controllers (like ReplicaSets or Deployments) use labels to manage Pods, you can remove a label to “detach” a Pod temporarily.

Example:

kubectl label pod mypod app-

The app- part removes the label key app. Once that happens, the controller won’t manage that Pod anymore. It’s like isolating it for inspection, a “quarantine mode” for debugging. To interactively remove or add labels, use kubectl label.

You can then check logs, exec into it and once done, delete it manually. That’s a super underrated trick every Kubernetes engineer should know.

Handy kubectl tips

These small tips make life much easier when you are working with multiple manifest files or clusters.

Apply entire directories

Instead of applying one file at a time, apply the whole folder:

# Using server-side apply is also a good practice
kubectl apply -f configs/ --server-side

This command looks for .yaml, .yml and .json files in that folder and applies them all together. It's faster, cleaner and helps keep things grouped by app.

Use label selectors to get or delete resources

You don't always need to type out resource names one by one. Instead, use selectors to act on entire groups at once:

kubectl get pods -l app=myapp
kubectl delete pod -l phase=test

It's especially useful in CI/CD pipelines, where you want to clean up test resources dynamically.

Quickly create Deployments and Services

For quick experiments, you don't always need to write a manifest. You can spin up a Deployment right from the CLI:

kubectl create deployment webapp --image=nginx

Then expose it as a Service:

kubectl expose deployment webapp --port=80

This is great when you just want to test something before writing full manifests. Also, see Use a Service to Access an Application in a cluster for an example.

Conclusion

Cleaner configuration leads to calmer cluster administrators. If you stick to a few simple habits: keep configuration simple and minimal, version-control everything, use consistent labels, and avoid relying on naked Pods, you'll save yourself hours of debugging down the road.

The best part? Clean configurations stay readable. Even after months, you or anyone on your team can glance at them and know exactly what’s happening.

Ingress NGINX Retirement: What You Need to Know

Tue, 11 Nov 2025 10:30:00 -0800

To prioritize the safety and security of the ecosystem, Kubernetes SIG Network and the Security Response Committee are announcing the upcoming retirement of Ingress NGINX. Best-effort maintenance will continue until March 2026. Afterward, there will be no further releases, no bugfixes, and no updates to resolve any security vulnerabilities that may be discovered. Existing deployments of Ingress NGINX will continue to function and installation artifacts will remain available.

We recommend migrating to one of the many alternatives. Consider migrating to Gateway API, the modern replacement for Ingress. If you must continue using Ingress, many alternative Ingress controllers are listed in the Kubernetes documentation. Continue reading for further information about the history and current state of Ingress NGINX, as well as next steps.

About Ingress NGINX

Ingress is the original user-friendly way to direct network traffic to workloads running on Kubernetes. (Gateway API is a newer way to achieve many of the same goals.) In order for an Ingress to work in your cluster, there must be an Ingress controller running. There are many Ingress controller choices available, which serve the needs of different users and use cases. Some are cloud-provider specific, while others have more general applicability.

Ingress NGINX was an Ingress controller, developed early in the history of the Kubernetes project as an example implementation of the API. It became very popular due to its tremendous flexibility, breadth of features, and independence from any particular cloud or infrastructure provider. Since those days, many other Ingress controllers have been created within the Kubernetes project by community groups, and by cloud native vendors. Ingress NGINX has continued to be one of the most popular, deployed as part of many hosted Kubernetes platforms and within innumerable independent users’ clusters.

History and Challenges

The breadth and flexibility of Ingress NGINX has caused maintenance challenges. Changing expectations about cloud native software have also added complications. What were once considered helpful options have sometimes come to be considered serious security flaws, such as the ability to add arbitrary NGINX configuration directives via the "snippets" annotations. Yesterday’s flexibility has become today’s insurmountable technical debt.

Despite the project’s popularity among users, Ingress NGINX has always struggled with insufficient or barely-sufficient maintainership. For years, the project has had only one or two people doing development work, on their own time, after work hours and on weekends. Last year, the Ingress NGINX maintainers announced their plans to wind down Ingress NGINX and develop a replacement controller together with the Gateway API community. Unfortunately, even that announcement failed to generate additional interest in helping maintain Ingress NGINX or develop InGate to replace it. (InGate development never progressed far enough to create a mature replacement; it will also be retired.)

Current State and Next Steps

Currently, Ingress NGINX is receiving best-effort maintenance. SIG Network and the Security Response Committee have exhausted our efforts to find additional support to make Ingress NGINX sustainable. To prioritize user safety, we must retire the project.

In March 2026, Ingress NGINX maintenance will be halted, and the project will be retired. After that time, there will be no further releases, no bugfixes, and no updates to resolve any security vulnerabilities that may be discovered. The GitHub repositories will be made read-only and left available for reference.

Existing deployments of Ingress NGINX will not be broken. Existing project artifacts such as Helm charts and container images will remain available.

In most cases, you can check whether you use Ingress NGINX by running kubectl get pods \--all-namespaces \--selector app.kubernetes.io/name=ingress-nginx with cluster administrator permissions.

We would like to thank the Ingress NGINX maintainers for their work in creating and maintaining this project–their dedication remains impressive. This Ingress controller has powered billions of requests in datacenters and homelabs all around the world. In a lot of ways, Kubernetes wouldn’t be where it is without Ingress NGINX, and we are grateful for so many years of incredible effort.

SIG Network and the Security Response Committee recommend that all Ingress NGINX users begin migration to Gateway API or another Ingress controller immediately. Many options are listed in the Kubernetes documentation: Gateway API, Ingress. Additional options may be available from vendors you work with.

Announcing the 2025 Steering Committee Election Results

Sun, 09 Nov 2025 15:10:00 -0500

The 2025 Steering Committee Election is now complete. The Kubernetes Steering Committee consists of 7 seats, 4 of which were up for election in 2025. Incoming committee members serve a term of 2 years, and all members are elected by the Kubernetes Community.

The Steering Committee oversees the governance of the entire Kubernetes project. With that great power comes great responsibility. You can learn more about the steering committee’s role in their charter.

Thank you to everyone who voted in the election; your participation helps support the community’s continued health and success.

Results

Congratulations to the elected committee members whose two year terms begin immediately (listed in alphabetical order by GitHub handle):

Kat Cosgrove (@katcosgrove), Minimus
Paco Xu (@pacoxu), DaoCloud
Rita Zhang (@ritazh), Microsoft
Maciej Szulik (@soltysh), Defense Unicorns

They join continuing members:

Antonio Ojea (@aojea), Google
Benjamin Elder (@BenTheElder), Google
Sascha Grunert (@saschagrunert), Red Hat

Maciej Szulik and Paco Xu are returning Steering Committee Members.

Big thanks!

Thank you and congratulations on a successful election to this round’s election officers:

Christoph Blecker (@cblecker)
Nina Polshakova (@npolshakova)
Sreeram Venkitesh (@sreeram-venkitesh)

Thanks to the Emeritus Steering Committee Members. Your service is appreciated by the community:

Stephen Augustus (@justaugustus), Bloomberg
Patrick Ohly (@pohly), Intel

And thank you to all the candidates who came forward to run for election.

Get involved with the Steering Committee

This governing body, like all of Kubernetes, is open to all. You can follow along with Steering Committee meeting notes and weigh in by filing an issue or creating a PR against their repo. They have an open meeting on the first Wednesday at 8am PT of every month. They can also be contacted at their public mailing list steering@kubernetes.io.

You can see what the Steering Committee meetings are all about by watching past meetings on the YouTube Playlist.

This post was adapted from one written by the Contributor Comms Subproject. If you want to write stories about the Kubernetes community, learn more about us.

This article was revised in November 2025 to update the information about when the steering committee meets.

Gateway API 1.4: New Features

Thu, 06 Nov 2025 09:00:00 -0800

Ready to rock your Kubernetes networking? The Kubernetes SIG Network community presented the General Availability (GA) release of Gateway API (v1.4.0)! Released on October 6, 2025, version 1.4.0 reinforces the path for modern, expressive, and extensible service networking in Kubernetes.

Gateway API v1.4.0 brings three new features to the Standard channel (Gateway API's GA release channel):

BackendTLSPolicy for TLS between gateways and backends
supportedFeatures in GatewayClass status
Named rules for Routes

and introduces three new experimental features:

Mesh resource for service mesh configuration
Default gateways to ease configuration burden**
externalAuth filter for HTTPRoute

Graduations to Standard Channel

Backend TLS policy

Leads: Candace Holman, Norwin Schnyder, Katarzyna Łach

GEP-1897: BackendTLSPolicy

BackendTLSPolicy is a new Gateway API type for specifying the TLS configuration of the connection from the Gateway to backend pod(s). . Prior to the introduction of BackendTLSPolicy, there was no API specification that allowed encrypted traffic on the hop from Gateway to backend.

The BackendTLSPolicy validation configuration requires a hostname. This hostname serves two purposes. It is used as the SNI header when connecting to the backend and for authentication, the certificate presented by the backend must match this hostname, unless subjectAltNames is explicitly specified.

If subjectAltNames (SANs) are specified, the hostname is only used for SNI, and authentication is performed against the SANs instead. If you still need to authenticate against the hostname value in this case, you MUST add it to the subjectAltNames list.

BackendTLSPolicy validation configuration also requires either caCertificateRefs or wellKnownCACertificates. caCertificateRefs refer to one or more (up to 8) PEM-encoded TLS certificate bundles. If there are no specific certificates to use, then depending on your implementation, you may use wellKnownCACertificates, set to "System" to tell the Gateway to use an implementation-specific set of trusted CA Certificates.

In this example, the BackendTLSPolicy is configured to use certificates defined in the auth-cert ConfigMap to connect with a TLS-encrypted upstream connection where pods backing the auth service are expected to serve a valid certificate for auth.example.com. It uses subjectAltNames with a Hostname type, but you may also use a URI type.

apiVersion: gateway.networking.k8s.io/v1
kind: BackendTLSPolicy
metadata:
  name: tls-upstream-auth
spec:
  targetRefs:
  - kind: Service
    name: auth
    group: ""
    sectionName: "https"
  validation:
    caCertificateRefs:
    - group: "" # core API group
      kind: ConfigMap
      name: auth-cert
    subjectAltNames: 
    - type: "Hostname"
      hostname: "auth.example.com"

In this example, the BackendTLSPolicy is configured to use system certificates to connect with a TLS-encrypted backend connection where Pods backing the dev Service are expected to serve a valid certificate for dev.example.com.

apiVersion: gateway.networking.k8s.io/v1
kind: BackendTLSPolicy
metadata:
  name: tls-upstream-dev
spec:
  targetRefs:
  - kind: Service
    name: dev
    group: ""
    sectionName: "btls"
  validation:
    wellKnownCACertificates: "System"
    hostname: dev.example.com

More information on the configuration of TLS in Gateway API can be found in Gateway API - TLS Configuration.

Status information about the features that an implementation supports

Leads: Lior Lieberman, Beka Modebadze

GEP-2162: Supported features in GatewayClass Status

GatewayClass status has a new field, supportedFeatures. This addition allows implementations to declare the set of features they support. This provides a clear way for users and tools to understand the capabilities of a given GatewayClass.

This feature's name for conformance tests (and GatewayClass status reporting) is SupportedFeatures. Implementations must populate the supportedFeatures field in the .status of the GatewayClass before the GatewayClass is accepted, or in the same operation.

Here’s an example of a supportedFeatures published under GatewayClass' .status:

apiVersion: gateway.networking.k8s.io/v1
kind: GatewayClass
...
status:
  conditions:
  - lastTransitionTime: "2022-11-16T10:33:06Z"
    message: Handled by Foo controller
    observedGeneration: 1
    reason: Accepted
    status: "True"
    type: Accepted
  supportedFeatures:
    - HTTPRoute
    - HTTPRouteHostRewrite
    - HTTPRoutePortRedirect
    - HTTPRouteQueryParamMatching

Graduation of SupportedFeatures to Standard, helped improve the conformance testing process for Gateway API. The conformance test suite will now automatically run tests based on the features populated in the GatewayClass' status. This creates a strong, verifiable link between an implementation's declared capabilities and the test results, making it easier for implementers to run the correct conformance tests and for users to trust the conformance reports.

This means when the SupportedFeatures field is populated in the GatewayClass status there will be no need for additional conformance tests flags like –suported-features, or –exempt or –all-features. It's important to note that Mesh features are an exception to this and can be tested for conformance by using Conformance Profiles, or by manually providing any combination of features related flags until the dedicated resource graduates from the experimental channel.

Named rules for Routes

GEP-995: Adding a new name field to all xRouteRule types (HTTPRouteRule, GRPCRouteRule, etc.)

Leads: Guilherme Cassolato

This enhancement enables route rules to be explicitly identified and referenced across the Gateway API ecosystem. Some of the key use cases include:

Status: Allowing status conditions to reference specific rules directly by name.
Observability: Making it easier to identify individual rules in logs, traces, and metrics.
Policies: Enabling policies (GEP-713) to target specific route rules via the sectionName field in their targetRef[s].
Tooling: Simplifying filtering and referencing of route rules in tools such as gwctl, kubectl, and general-purpose utilities like jq and yq.
Internal configuration mapping: Facilitating the generation of internal configurations that reference route rules by name within gateway and mesh implementations.

This follows the same well-established pattern already adopted for Gateway listeners, Service ports, Pods (and containers), and many other Kubernetes resources.

While the new name field is optional (so existing resources remain valid), its use is strongly encouraged. Implementations are not expected to assign a default value, but they may enforce constraints such as immutability.

Finally, keep in mind that the name format is validated, and other fields (such as sectionName) may impose additional, indirect constraints.

Experimental channel changes

Enabling external Auth for HTTPRoute

Giving Gateway API the ability to enforce authentication and maybe authorization as well at the Gateway or HTTPRoute level has been a highly requested feature for a long time. (See the GEP-1494 issue for some background.)

This Gateway API release adds an Experimental filter in HTTPRoute that tells the Gateway API implementation to call out to an external service to authenticate (and, optionally, authorize) requests.

This filter is based on the Envoy ext_authz API, and allows talking to an Auth service that uses either gRPC or HTTP for its protocol.

Both methods allow the configuration of what headers to forward to the Auth service, with the HTTP protocol allowing some extra information like a prefix path.

A HTTP example might look like this (noting that this example requires the Experimental channel to be installed and an implementation that supports External Auth to actually understand the config):

apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: require-auth
  namespace: default
spec:
  parentRefs:
    - name: your-gateway-here
  rules:
    - matches:
      - path:
          type: Prefix
          value: /admin
      filters:
        - type: ExternalAuth
          externalAuth:
            protocol: HTTP
            backendRef:
              name: auth-service
            http:
              # These headers are always sent for the HTTP protocol,
              # but are included here for illustrative purposes
              allowedHeaders:
                - Host
                - Method
                - Path
                - Content-Length
                - Authorization
      backendRefs:
        - name: admin-backend
          port: 8080

This allows the backend Auth service to use the supplied headers to make a determination about the authentication for the request.

When a request is allowed, the external Auth service will respond with a 200 HTTP response code, and optionally extra headers to be included in the request that is forwarded to the backend. When the request is denied, the Auth service will respond with a 403 HTTP response.

Since the Authorization header is used in many authentication methods, this method can be used to do Basic, Oauth, JWT, and other common authentication and authorization methods.

Mesh resource

Lead(s): Flynn

GEP-3949: Mesh-wide configuration and supported features

Gateway API v1.4.0 introduces a new experimental Mesh resource, which provides a way to configure mesh-wide settings and discover the features supported by a given mesh implementation. This resource is analogous to the Gateway resource and will initially be mainly used for conformance testing, with plans to extend its use to off-cluster Gateways in the future.

The Mesh resource is cluster-scoped and, as an experimental feature, is named XMesh and resides in the gateway.networking.x-k8s.io API group. A key field is controllerName, which specifies the mesh implementation responsible for the resource. The resource's status stanza indicates whether the mesh implementation has accepted it and lists the features the mesh supports.

One of the goals of this GEP is to avoid making it more difficult for users to adopt a mesh. To simplify adoption, mesh implementations are expected to create a default Mesh resource upon startup if one with a matching controllerName doesn't already exist. This avoids the need for manual creation of the resource to begin using a mesh.

The new XMesh API kind, within the gateway.networking.x-k8s.io/v1alpha1 API group, provides a central point for mesh configuration and feature discovery (source).

A minimal XMesh object specifies the controllerName:

apiVersion: gateway.networking.x-k8s.io/v1alpha1
kind: XMesh
metadata:
  name: one-mesh-to-mesh-them-all
spec:
  controllerName: one-mesh.example.com/one-mesh

The mesh implementation populates the status field to confirm it has accepted the resource and to list its supported features ( source):

status:
  conditions:
    - type: Accepted
      status: "True"
      reason: Accepted
  supportedFeatures:
    - name: MeshHTTPRoute
    - name: OffClusterGateway

Introducing default Gateways

Lead(s): Flynn

GEP-3793: Allowing Gateways to program some routes by default.

For application developers, one common piece of feedback has been the need to explicitly name a parent Gateway for every single north-south Route. While this explicitness prevents ambiguity, it adds friction, especially for developers who just want to expose their application to the outside world without worrying about the underlying infrastructure's naming scheme. To address this, we have introduce the concept of Default Gateways.

For application developers: Just "use the default"

As an application developer, you often don't care about the specific Gateway your traffic flows through, you just want it to work. With this enhancement, you can now create a Route and simply ask it to use a default Gateway.

This is done by setting the new useDefaultGateways field in your Route's spec.

Here’s a simple HTTPRoute that uses a default Gateway:

apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: my-route
spec:
  useDefaultGateways: All
  rules:
  - backendRefs:
    - name: my-service
      port: 80

That's it! No more need to hunt down the correct Gateway name for your environment. Your Route is now a "defaulted Route."

For cluster operators: You're still in control

This feature doesn't take control away from cluster operators ("Chihiro"). In fact, they have explicit control over which Gateways can act as a default. A Gateway will only accept these defaulted Routes if it is configured to do so.

You can also use a ValidatingAdmissionPolicy to either require or even forbid for Routes to rely on a default Gateway.

As a cluster operator, you can designate a Gateway as a default by setting the (new) .spec.defaultScope field:

apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: my-default-gateway
  namespace: default
spec:
  defaultScope: All
  # ... other gateway configuration

Operators can choose to have no default Gateways, or even multiple.

How it works and key details

To maintain a clean, GitOps-friendly workflow, a default Gateway does not modify the spec.parentRefs of your Route. Instead, the binding is reflected in the Route's status field. You can always inspect the status.parents stanza of your Route to see exactly which Gateway or Gateways have accepted it. This preserves your original intent and avoids conflicts with CD tools.
The design explicitly supports having multiple Gateways designated as defaults within a cluster. When this happens, a defaulted Route will bind to all of them. This enables cluster operators to perform zero-downtime migrations and testing of new default Gateways.
You can create a single Route that handles both north-south traffic (traffic entering or leaving the cluster, via a default Gateway) and east-west/mesh traffic (traffic between services within the cluster), by explicitly referencing a Service in parentRefs.

Default Gateways represent a significant step forward in making the Gateway API simpler and more intuitive for everyday use cases, bridging the gap between the flexibility needed by operators and the simplicity desired by developers.

Configuring client certificate validation

Lead(s): Arko Dasgupta, Katarzyna Łach

GEP-91: Address connection coalescing security issue

This release brings updates for configuring client certificate validation, addressing a critical security vulnerability related to connection reuse. HTTP connection coalescing is a web performance optimization that allows a client to reuse an existing TLS connection for requests to different domains. While this reduces the overhead of establishing new connections, it introduces a security risk in the context of API gateways. The ability to reuse a single TLS connection across multiple Listeners brings the need to introduce shared client certificate configuration in order to avoid unauthorized access.

Why SNI-based mTLS is not the answer

One might think that using Server Name Indication (SNI) to differentiate between Listeners would solve this problem. However, TLS SNI is not a reliable mechanism for enforcing security policies in a connection coalescing scenario. A client could use a single TLS connection for multiple peer connections, as long as they are all covered by the same certificate. This means that a client could establish a connection by indicating one peer identity (using SNI), and then reuse that connection to access a different virtual host that is listening on the same IP address and port. That reuse, which is controlled by client side heuristics, could bypass mutual TLS policies that were specific to the second listener configuration.

Here's an example to help explain it:

apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: wildcard-tls-gateway
spec:
  gatewayClassName: example
  listeners:
  - name: foo-https
    protocol: HTTPS
    port: 443
    hostname: foo.example.com
    tls:
      certificateRefs:
      - group: "" # core API group
        kind: Secret
        name: foo-example-com-cert  # SAN: foo.example.com
  - name: wildcard-https
    protocol: HTTPS
    port: 443
    hostname: "*.example.com"
    tls:
      certificateRefs:
      - group: "" # core API group
        kind: Secret
        name: wildcard-example-com-cert # SAN: *.example.com

I have configured a Gateway with two listeners, both having overlapping hostnames. My intention is for the foo-http listener to be accessible only by clients presenting the foo-example-com-cert certificate. In contrast, the wildcard-https listener should allow access to a broader audience using any certificate valid for the *.example.com domain.

Consider a scenario where a client initially connects to foo.example.com. The server requests and successfully validates the foo-example-com-cert certificate, establishing the connection. Subsequently, the same client wishes to access other sites within this domain, such as bar.example.com, which is handled by the wildcard-https listener. Due to connection reuse, clients can access wildcard-https backends without an additional TLS handshake on the existing connection. This process functions as expected.

However, a critical security vulnerability arises when the order of access is reversed. If a client first connects to bar.example.com and presents a valid bar.example.com certificate, the connection is successfully established. If this client then attempts to access foo.example.com, the existing connection's client certificate will not be re-validated. This allows the client to bypass the specific certificate requirement for the foo backend, leading to a serious security breach.

The solution: per-port TLS configuration

The updated Gateway API gains a tls field in the .spec of a Gateway, that allows you to define a default client certificate validation configuration for all Listeners, and then if needed override it on a per-port basis. This provides a flexible and powerful way to manage your TLS policies.

Here’s a look at the updated API definitions (shown as Go source code):

// GatewaySpec defines the desired state of Gateway.
type GatewaySpec struct {
    ...
    // GatewayTLSConfig specifies frontend tls configuration for gateway.
    TLS *GatewayTLSConfig `json:"tls,omitempty"`
}

// GatewayTLSConfig specifies frontend tls configuration for gateway.
type GatewayTLSConfig struct {
    // Default specifies the default client certificate validation configuration
    Default TLSConfig `json:"default"`

    // PerPort specifies tls configuration assigned per port.
    PerPort []TLSPortConfig `json:"perPort,omitempty"`
}

// TLSPortConfig describes a TLS configuration for a specific port.
type TLSPortConfig struct {
    // The Port indicates the Port Number to which the TLS configuration will be applied.
    Port PortNumber `json:"port"`

    // TLS store the configuration that will be applied to all Listeners handling
    // HTTPS traffic and matching given port.
    TLS TLSConfig `json:"tls"`
}

Breaking changes

Standard GRPCRoute - `.spec` field required (technicality)

The promotion of GRPCRoute to Standard introduces a minor but technically breaking change regarding the presence of the top-level .spec field. As part of achieving Standard status, the Gateway API has tightened the OpenAPI schema validation within the GRPCRoute CustomResourceDefinition (CRD) to explicitly ensure the spec field is required for all GRPCRoute resources. This change enforces stricter conformance to Kubernetes object standards and enhances the resource's stability and predictability. While it is highly unlikely that users were attempting to define a GRPCRoute without any specification, any existing automation or manifests that might have relied on a relaxed interpretation allowing a completely absent spec field will now fail validation and must be updated to include the .spec field, even if empty.

Experimental CORS support in HTTPRoute - breaking change for `allowCredentials` field

The Gateway API subproject has introduced a breaking change to the Experimental CORS support in HTTPRoute, concerning the allowCredentials field within the CORS policy. This field's definition has been strictly aligned with the upstream CORS specification, which dictates that the corresponding Access-Control-Allow-Credentials header must represent a Boolean value. Previously, the implementation might have been overly permissive, potentially accepting non-standard or string representations such as true due to relaxed schema validation. Users who were configuring CORS rules must now review their manifests and ensure the value for allowCredentials strictly conforms to the new, more restrictive schema. Any existing HTTPRoute definitions that do not adhere to this stricter validation will now be rejected by the API server, requiring a configuration update to maintain functionality.

Improving the development and usage experience

As part of this release, we have improved some of the developer experience workflow:

Added Kube API Linter to the CI/CD pipelines, reducing the burden of API reviewers and also reducing the amount of common mistakes.
Improving the execution time of CRD tests with the usage of envtest.

Additionally, as part of the effort to improve Gateway API usage experience, some efforts were made to remove some ambiguities and some old tech-debts from our documentation website:

The API reference is now explicit when a field is experimental.
The GEP (GatewayAPI Enhancement Proposal) navigation bar is automatically generated, reflecting the real status of the enhancements.

Try it out

Unlike other Kubernetes APIs, you don't need to upgrade to the latest version of Kubernetes to get the latest version of Gateway API. As long as you're running Kubernetes 1.26 or later, you'll be able to get up and running with this version of Gateway API.

To try out the API, follow the Getting Started Guide.

As of this writing, seven implementations are already conformant with Gateway API v1.4.0. In alphabetical order:

Get involved

Wondering when a feature will be added? There are lots of opportunities to get involved and help define the future of Kubernetes routing APIs for both ingress and service mesh.

Check out the user guides to see what use-cases can be addressed.
Try out one of the existing Gateway controllers.
Or join us in the community and help us build the future of Gateway API together!

The maintainers would like to thank everyone who's contributed to Gateway API, whether in the form of commits to the repo, discussion, ideas, or general support. We could never have made this kind of progress without the support of this dedicated and active community.

Gateway API v1.3.0: Advancements in Request Mirroring, CORS, Gateway Merging, and Retry Budgets (June 2025)
Gateway API v1.2: WebSockets, Timeouts, Retries, and More (November 2024)
Gateway API v1.1: Service mesh, GRPCRoute, and a whole lot more (May 2024)
New Experimental Features in Gateway API v1.0 (November 2023)
Gateway API v1.0: GA Release (October 2023)

7 Common Kubernetes Pitfalls (and How I Learned to Avoid Them)

Mon, 20 Oct 2025 08:30:00 -0700

It’s no secret that Kubernetes can be both powerful and frustrating at times. When I first started dabbling with container orchestration, I made more than my fair share of mistakes enough to compile a whole list of pitfalls. In this post, I want to walk through seven big gotchas I’ve encountered (or seen others run into) and share some tips on how to avoid them. Whether you’re just kicking the tires on Kubernetes or already managing production clusters, I hope these insights help you steer clear of a little extra stress.

1. Skipping resource requests and limits

The pitfall: Not specifying CPU and memory requirements in Pod specifications. This typically happens because Kubernetes does not require these fields, and workloads can often start and run without them—making the omission easy to overlook in early configurations or during rapid deployment cycles.

Context: In Kubernetes, resource requests and limits are critical for efficient cluster management. Resource requests ensure that the scheduler reserves the appropriate amount of CPU and memory for each pod, guaranteeing that it has the necessary resources to operate. Resource limits cap the amount of CPU and memory a pod can use, preventing any single pod from consuming excessive resources and potentially starving other pods. When resource requests and limits are not set:

Resource Starvation: Pods may get insufficient resources, leading to degraded performance or failures. This is because Kubernetes schedules pods based on these requests. Without them, the scheduler might place too many pods on a single node, leading to resource contention and performance bottlenecks.
Resource Hoarding: Conversely, without limits, a pod might consume more than its fair share of resources, impacting the performance and stability of other pods on the same node. This can lead to issues such as other pods getting evicted or killed by the Out-Of-Memory (OOM) killer due to lack of available memory.

How to avoid it:

Start with modest requests (for example 100m CPU, 128Mi memory) and see how your app behaves.
Monitor real-world usage and refine your values; the HorizontalPodAutoscaler can help automate scaling based on metrics.
Keep an eye on kubectl top pods or your logging/monitoring tool to confirm you’re not over- or under-provisioning.

My reality check: Early on, I never thought about memory limits. Things seemed fine on my local cluster. Then, on a larger environment, Pods got OOMKilled left and right. Lesson learned. For detailed instructions on configuring resource requests and limits for your containers, please refer to Assign Memory Resources to Containers and Pods (part of the official Kubernetes documentation).

2. Underestimating liveness and readiness probes

The pitfall: Deploying containers without explicitly defining how Kubernetes should check their health or readiness. This tends to happen because Kubernetes will consider a container “running” as long as the process inside hasn’t exited. Without additional signals, Kubernetes assumes the workload is functioning—even if the application inside is unresponsive, initializing, or stuck.

Context:
Liveness, readiness, and startup probes are mechanisms Kubernetes uses to monitor container health and availability.

Liveness probes determine if the application is still alive. If a liveness check fails, the container is restarted.
Readiness probes control whether a container is ready to serve traffic. Until the readiness probe passes, the container is removed from Service endpoints.
Startup probes help distinguish between long startup times and actual failures.

How to avoid it:

Add a simple HTTP livenessProbe to check a health endpoint (for example /healthz) so Kubernetes can restart a hung container.
Use a readinessProbe to ensure traffic doesn’t reach your app until it’s warmed up.
Keep probes simple. Overly complex checks can create false alarms and unnecessary restarts.

My reality check: I once forgot a readiness probe for a web service that took a while to load. Users hit it prematurely, got weird timeouts, and I spent hours scratching my head. A 3-line readiness probe would have saved the day.

For comprehensive instructions on configuring liveness, readiness, and startup probes for containers, please refer to Configure Liveness, Readiness and Startup Probes in the official Kubernetes documentation.

3. “We’ll just look at container logs” (famous last words)

The pitfall: Relying solely on container logs retrieved via kubectl logs. This often happens because the command is quick and convenient, and in many setups, logs appear accessible during development or early troubleshooting. However, kubectl logs only retrieves logs from currently running or recently terminated containers, and those logs are stored on the node’s local disk. As soon as the container is deleted, evicted, or the node is restarted, the log files may be rotated out or permanently lost.

How to avoid it:

Centralize logs using CNCF tools like Fluentd or Fluent Bit to aggregate output from all Pods.
Adopt OpenTelemetry for a unified view of logs, metrics, and (if needed) traces. This lets you spot correlations between infrastructure events and app-level behavior.
Pair logs with Prometheus metrics to track cluster-level data alongside application logs. If you need distributed tracing, consider CNCF projects like Jaeger.

My reality check: The first time I lost Pod logs to a quick restart, I realized how flimsy “kubectl logs” can be on its own. Since then, I’ve set up a proper pipeline for every cluster to avoid missing vital clues.

4. Treating dev and prod exactly the same

The pitfall: Deploying the same Kubernetes manifests with identical settings across development, staging, and production environments. This often occurs when teams aim for consistency and reuse, but overlook that environment-specific factors—such as traffic patterns, resource availability, scaling needs, or access control—can differ significantly. Without customization, configurations optimized for one environment may cause instability, poor performance, or security gaps in another.

How to avoid it:

Use environment overlays or kustomize to maintain a shared base while customizing resource requests, replicas, or config for each environment.
Extract environment-specific configuration into ConfigMaps and / or Secrets. You can use a specialized tool such as Sealed Secrets to manage confidential data.
Plan for scale in production. Your dev cluster can probably get away with minimal CPU/memory, but prod might need significantly more.

My reality check: One time, I scaled up replicaCount from 2 to 10 in a tiny dev environment just to “test.” I promptly ran out of resources and spent half a day cleaning up the aftermath. Oops.

5. Leaving old stuff floating around

The pitfall: Leaving unused or outdated resources—such as Deployments, Services, ConfigMaps, or PersistentVolumeClaims—running in the cluster. This often happens because Kubernetes does not automatically remove resources unless explicitly instructed, and there is no built-in mechanism to track ownership or expiration. Over time, these forgotten objects can accumulate, consuming cluster resources, increasing cloud costs, and creating operational confusion, especially when stale Services or LoadBalancers continue to route traffic.

How to avoid it:

Label everything with a purpose or owner label. That way, you can easily query resources you no longer need.
Regularly audit your cluster: run kubectl get all -n <namespace> to see what’s actually running, and confirm it’s all legit.
Adopt Kubernetes’ Garbage Collection: K8s docs show how to remove dependent objects automatically.
Leverage policy automation: Tools like Kyverno can automatically delete or block stale resources after a certain period, or enforce lifecycle policies so you don’t have to remember every single cleanup step.

My reality check: After a hackathon, I forgot to tear down a “test-svc” pinned to an external load balancer. Three weeks later, I realized I’d been paying for that load balancer the entire time. Facepalm.

6. Diving too deep into networking too soon

The pitfall: Introducing advanced networking solutions—such as service meshes, custom CNI plugins, or multi-cluster communication—before fully understanding Kubernetes' native networking primitives. This commonly occurs when teams implement features like traffic routing, observability, or mTLS using external tools without first mastering how core Kubernetes networking works: including Pod-to-Pod communication, ClusterIP Services, DNS resolution, and basic ingress traffic handling. As a result, network-related issues become harder to troubleshoot, especially when overlays introduce additional abstractions and failure points.

How to avoid it:

Start small: a Deployment, a Service, and a basic ingress controller such as one based on NGINX (e.g., Ingress-NGINX).
Make sure you understand how traffic flows within the cluster, how service discovery works, and how DNS is configured.
Only move to a full-blown mesh or advanced CNI features when you actually need them, complex networking adds overhead.

My reality check: I tried Istio on a small internal app once, then spent more time debugging Istio itself than the actual app. Eventually, I stepped back, removed Istio, and everything worked fine.

7. Going too light on security and RBAC

The pitfall: Deploying workloads with insecure configurations, such as running containers as the root user, using the latest image tag, disabling security contexts, or assigning overly broad RBAC roles like cluster-admin. These practices persist because Kubernetes does not enforce strict security defaults out of the box, and the platform is designed to be flexible rather than opinionated. Without explicit security policies in place, clusters can remain exposed to risks like container escape, unauthorized privilege escalation, or accidental production changes due to unpinned images.

How to avoid it:

Use RBAC to define roles and permissions within Kubernetes. While RBAC is the default and most widely supported authorization mechanism, Kubernetes also allows the use of alternative authorizers. For more advanced or external policy needs, consider solutions like OPA Gatekeeper (based on Rego), Kyverno, or custom webhooks using policy languages such as CEL or Cedar.
Pin images to specific versions (no more :latest!). This helps you know what’s actually deployed.
Look into Pod Security Admission (or other solutions like Kyverno) to enforce non-root containers, read-only filesystems, etc.

My reality check: I never had a huge security breach, but I’ve heard plenty of cautionary tales. If you don’t tighten things up, it’s only a matter of time before something goes wrong.

Final thoughts

Kubernetes is amazing, but it’s not psychic, it won’t magically do the right thing if you don’t tell it what you need. By keeping these pitfalls in mind, you’ll avoid a lot of headaches and wasted time. Mistakes happen (trust me, I’ve made my share), but each one is a chance to learn more about how Kubernetes truly works under the hood. If you’re curious to dive deeper, the official docs and the community Slack are excellent next steps. And of course, feel free to share your own horror stories or success tips, because at the end of the day, we’re all in this cloud native adventure together.

Happy Shipping!

Spotlight on Policy Working Group

Sat, 18 Oct 2025 00:00:00 +0000

(Note: The Policy Working Group has completed its mission and is no longer active. This article reflects its work, accomplishments, and insights into how a working group operates.)

In the complex world of Kubernetes, policies play a crucial role in managing and securing clusters. But have you ever wondered how these policies are developed, implemented, and standardized across the Kubernetes ecosystem? To answer that, let's take a look back at the work of the Policy Working Group.

The Policy Working Group was dedicated to a critical mission: providing an overall architecture that encompasses both current policy-related implementations and future policy proposals in Kubernetes. Their goal was both ambitious and essential: to develop a universal policy architecture that benefits developers and end-users alike.

Through collaborative methods, this working group strove to bring clarity and consistency to the often complex world of Kubernetes policies. By focusing on both existing implementations and future proposals, they ensured that the policy landscape in Kubernetes remains coherent and accessible as the technology evolves.

This blog post dives deeper into the work of the Policy Working Group, guided by insights from its former co-chairs:

Interviewed by Arujjwal Negi.

These co-chairs explained what the Policy Working Group was all about.

Introduction

Hello, thank you for the time! Let’s start with some introductions, could you tell us a bit about yourself, your role, and how you got involved in Kubernetes?

Jim Bugwadia: My name is Jim Bugwadia, and I am a co-founder and the CEO at Nirmata which provides solutions that automate security and compliance for cloud-native workloads. At Nirmata, we have been working with Kubernetes since it started in 2014. We initially built a Kubernetes policy engine in our commercial platform and later donated it to CNCF as the Kyverno project. I joined the CNCF Kubernetes Policy Working Group to help build and standardize various aspects of policy management for Kubernetes and later became a co-chair.

Andy Suderman: My name is Andy Suderman and I am the CTO of Fairwinds, a managed Kubernetes-as-a-Service provider. I began working with Kubernetes in 2016 building a web conferencing platform. I am an author and/or maintainer of several Kubernetes-related open-source projects such as Goldilocks, Pluto, and Polaris. Polaris is a JSON-schema-based policy engine, which started Fairwinds' journey into the policy space and my involvement in the Policy Working Group.

Poonam Lamba: My name is Poonam Lamba, and I currently work as a Product Manager for Google Kubernetes Engine (GKE) at Google. My journey with Kubernetes began back in 2017 when I was building an SRE platform for a large enterprise, using a private cloud built on Kubernetes. Intrigued by its potential to revolutionize the way we deployed and managed applications at the time, I dove headfirst into learning everything I could about it. Since then, I've had the opportunity to build the policy and compliance products for GKE. I lead and contribute to GKE CIS benchmarks. I am involved with the Gatekeeper project as well as I have contributed to Policy-WG for over 2 years and served as a co-chair for the group.

Responses to the following questions represent an amalgamation of insights from the former co-chairs.

About Working Groups

One thing even I am not aware of is the difference between a working group and a SIG. Can you help us understand what a working group is and how it is different from a SIG?

Unlike SIGs, working groups are temporary and focused on tackling specific, cross-cutting issues or projects that may involve multiple SIGs. Their lifespan is defined, and they disband once they've achieved their objective. Generally, working groups don't own code or have long-term responsibility for managing a particular area of the Kubernetes project.

(To know more about SIGs, visit the list of Special Interest Groups)

You mentioned that Working Groups involve multiple SIGS. What SIGS was the Policy WG closely involved with, and how did you coordinate with them?

The group collaborated closely with Kubernetes SIG Auth throughout our existence, and more recently, the group also worked with SIG Security since its formation. Our collaboration occurred in a few ways. We provided periodic updates during the SIG meetings to keep them informed of our progress and activities. Additionally, we utilize other community forums to maintain open lines of communication and ensured our work aligned with the broader Kubernetes ecosystem. This collaborative approach helped the group stay coordinated with related efforts across the Kubernetes community.

Policy WG

Why was the Policy Working Group created?

To enable a broad set of use cases, we recognize that Kubernetes is powered by a highly declarative, fine-grained, and extensible configuration management system. We've observed that a Kubernetes configuration manifest may have different portions that are important to various stakeholders. For example, some parts may be crucial for developers, while others might be of particular interest to security teams or address operational concerns. Given this complexity, we believe that policies governing the usage of these intricate configurations are essential for success with Kubernetes.

Our Policy Working Group was created specifically to research the standardization of policy definitions and related artifacts. We saw a need to bring consistency and clarity to how policies are defined and implemented across the Kubernetes ecosystem, given the diverse requirements and stakeholders involved in Kubernetes deployments.

Can you give me an idea of the work you did in the group?

We worked on several Kubernetes policy-related projects. Our initiatives included:

We worked on a Kubernetes Enhancement Proposal (KEP) for the Kubernetes Policy Reports API. This aims to standardize how policy reports are generated and consumed within the Kubernetes ecosystem.
We conducted a CNCF survey to better understand policy usage in the Kubernetes space. This helped gauge the practices and needs across the community at the time.
We wrote a paper that will guide users in achieving PCI-DSS compliance for containers. This is intended to help organizations meet important security standards in their Kubernetes environments.
We also worked on a paper highlighting how shifting security down can benefit organizations. This focuses on the advantages of implementing security measures earlier in the development and deployment process.

Can you tell us what were the main objectives of the Policy Working Group and some of your key accomplishments?

The charter of the Policy WG was to help standardize policy management for Kubernetes and educate the community on best practices.

To accomplish this we updated the Kubernetes documentation (Policies | Kubernetes), produced several whitepapers (Kubernetes Policy Management, Kubernetes GRC), and created the Policy Reports API (API reference) which standardizes reporting across various tools. Several popular tools such as Falco, Trivy, Kyverno, kube-bench, and others support the Policy Report API. A major milestone for the Policy WG was promoting the Policy Reports API to a SIG-level API or finding it a stable home.

Beyond that, as ValidatingAdmissionPolicy and MutatingAdmissionPolicy approached GA in Kubernetes, a key goal of the WG was to guide and educate the community on the tradeoffs and appropriate usage patterns for these built-in API objects and other CNCF policy management solutions like OPA/Gatekeeper and Kyverno.

Challenges

What were some of the major challenges that the Policy Working Group worked on?

During our work in the Policy Working Group, we encountered several challenges:

One of the main issues we faced was finding time to consistently contribute. Given that many of us have other professional commitments, it can be difficult to dedicate regular time to the working group's initiatives.
Another challenge we experienced was related to our consensus-driven model. While this approach ensures that all voices are heard, it can sometimes lead to slower decision-making processes. We valued thorough discussion and agreement, but this can occasionally delay progress on our projects.
We've also encountered occasional differences of opinion among group members. These situations require careful navigation to ensure that we maintain a collaborative and productive environment while addressing diverse viewpoints.
Lastly, we've noticed that newcomers to the group may find it difficult to contribute effectively without consistent attendance at our meetings. The complex nature of our work often requires ongoing context, which can be challenging for those who aren't able to participate regularly.

Can you tell me more about those challenges? How did you discover each one? What has the impact been? What were some strategies you used to address them?

There are no easy answers, but having more contributors and maintainers greatly helps! Overall the CNCF community is great to work with and is very welcoming to beginners. So, if folks out there are hesitating to get involved, I highly encourage them to attend a WG or SIG meeting and just listen in.

It often takes a few meetings to fully understand the discussions, so don't feel discouraged if you don't grasp everything right away. We made a point to emphasize this and encouraged new members to review documentation as a starting point for getting involved.

Additionally, differences of opinion were valued and encouraged within the Policy-WG. We adhered to the CNCF core values and resolve disagreements by maintaining respect for one another. We also strove to timebox our decisions and assign clear responsibilities to keep things moving forward.

This is where our discussion about the Policy Working Group ends. The working group, and especially the people who took part in this article, hope this gave you some insights into the group's aims and workings. You can get more info about Working Groups here.

Introducing Headlamp Plugin for Karpenter - Scaling and Visibility

Mon, 06 Oct 2025 00:00:00 +0000

Headlamp is an open‑source, extensible Kubernetes SIG UI project designed to let you explore, manage, and debug cluster resources.

Karpenter is a Kubernetes Autoscaling SIG node provisioning project that helps clusters scale quickly and efficiently. It launches new nodes in seconds, selects appropriate instance types for workloads, and manages the full node lifecycle, including scale-down.

The new Headlamp Karpenter Plugin adds real-time visibility into Karpenter’s activity directly from the Headlamp UI. It shows how Karpenter resources relate to Kubernetes objects, displays live metrics, and surfaces scaling events as they happen. You can inspect pending pods during provisioning, review scaling decisions, and edit Karpenter-managed resources with built-in validation. The Karpenter plugin was made as part of a LFX mentor project.

The Karpenter plugin for Headlamp aims to make it easier for Kubernetes users and operators to understand, debug, and fine-tune autoscaling behavior in their clusters. Now we will give a brief tour of the Headlamp plugin.

Map view of Karpenter Resources and how they relate to Kubernetes resources

Easily see how Karpenter Resources like NodeClasses, NodePool and NodeClaims connect with core Kubernetes resources like Pods, Nodes etc.

Visualization of Karpenter Metrics

Get instant insights of Resource Usage v/s Limits, Allowed disruptions, Pending Pods, Provisioning Latency and many more .

Scaling decisions

Shows which instances are being provisioned for your workloads and understand the reason behind why Karpenter made those choices. Helpful while debugging.

Config editor with validation support

Make live edits to Karpenter configurations. The editor includes diff previews and resource validation for safer adjustments.

Real time view of Karpenter resources

View and track Karpenter specific resources in real time such as “NodeClaims” as your cluster scales up and down.

Dashboard for Pending Pods

View all pending pods with unmet scheduling requirements/Failed Scheduling highlighting why they couldn't be scheduled.

Karpenter Providers

This plugin should work with most Karpenter providers, but has only so far been tested on the ones listed in the table. Additionally, each provider gives some extra information, and the ones in the table below are displayed by the plugin.

Provider Name	Tested	Extra provider specific info supported
AWS	✅	✅
Azure	✅	✅
AlibabaCloud	❌	❌
Bizfly Cloud	❌	❌
Cluster API	❌	❌
GCP	❌	❌
Proxmox	❌	❌
Oracle Cloud Infrastructure (OCI)	❌	❌

Please submit an issue if you test one of the untested providers or if you want support for this provider (PRs also gladly accepted).

How to use

Please see the plugins/karpenter/README.md for instructions on how to use.

Feedback and Questions

Please submit an issue if you use Karpenter and have any other ideas or feedback. Or come to the Kubernetes slack headlamp channel for a chat.

Announcing Changed Block Tracking API support (alpha)

Thu, 25 Sep 2025 05:00:00 -0800

We're excited to announce the alpha support for a changed block tracking mechanism. This enhances the Kubernetes storage ecosystem by providing an efficient way for CSI storage drivers to identify changed blocks in PersistentVolume snapshots. With a driver that can use the feature, you could benefit from faster and more resource-efficient backup operations.

If you're eager to try this feature, you can skip to the Getting Started section.

What is changed block tracking?

Changed block tracking enables storage systems to identify and track modifications at the block level between snapshots, eliminating the need to scan entire volumes during backup operations. The improvement is a change to the Container Storage Interface (CSI), and also to the storage support in Kubernetes itself. With the alpha feature enabled, your cluster can:

Identify allocated blocks within a CSI volume snapshot
Determine changed blocks between two snapshots of the same volume
Streamline backup operations by focusing only on changed data blocks

For Kubernetes users managing large datasets, this API enables significantly more efficient backup processes. Backup applications can now focus only on the blocks that have changed, rather than processing entire volumes.

Note:

As of now, the Changed Block Tracking API is supported only for block volumes and not for file volumes. CSI drivers that manage file-based storage systems will not be able to implement this capability.

Benefits of changed block tracking support in Kubernetes

As Kubernetes adoption grows for stateful workloads managing critical data, the need for efficient backup solutions becomes increasingly important. Traditional full backup approaches face challenges with:

Long backup windows: Full volume backups can take hours for large datasets, making it difficult to complete within maintenance windows.
High resource utilization: Backup operations consume substantial network bandwidth and I/O resources, especially for large data volumes and data-intensive applications.
Increased storage costs: Repetitive full backups store redundant data, causing storage requirements to grow linearly even when only a small percentage of data actually changes between backups.

The Changed Block Tracking API addresses these challenges by providing native Kubernetes support for incremental backup capabilities through the CSI interface.

Key components

The implementation consists of three primary components:

CSI SnapshotMetadata Service API: An API, offered by gRPC, that provides volume snapshot and changed block data.
SnapshotMetadataService API: A Kubernetes CustomResourceDefinition (CRD) that advertises CSI driver metadata service availability and connection details to cluster clients.
External Snapshot Metadata Sidecar: An intermediary component that connects CSI drivers to backup applications via a standardized gRPC interface.

Implementation requirements

Storage provider responsibilities

If you're an author of a storage integration with Kubernetes and want to support the changed block tracking feature, you must implement specific requirements:

Implement CSI RPCs: Storage providers need to implement the SnapshotMetadata service as defined in the CSI specifications protobuf. This service requires server-side streaming implementations for the following RPCs:
- GetMetadataAllocated: For identifying allocated blocks in a snapshot
- GetMetadataDelta: For determining changed blocks between two snapshots
Storage backend capabilities: Ensure the storage backend has the capability to track and report block-level changes.
Deploy external components: Integrate with the external-snapshot-metadata sidecar to expose the snapshot metadata service.
Register custom resource: Register the SnapshotMetadataService resource using a CustomResourceDefinition and create a SnapshotMetadataService custom resource that advertises the availability of the metadata service and provides connection details.
Support error handling: Implement proper error handling for these RPCs according to the CSI specification requirements.

Backup solution responsibilities

A backup solution looking to leverage this feature must:

Set up authentication: The backup application must provide a Kubernetes ServiceAccount token when using the Kubernetes SnapshotMetadataService API. Appropriate access grants, such as RBAC RoleBindings, must be established to authorize the backup application ServiceAccount to obtain such tokens.
Implement streaming client-side code: Develop clients that implement the streaming gRPC APIs defined in the schema.proto file. Specifically:
- Implement streaming client code for GetMetadataAllocated and GetMetadataDelta methods
- Handle server-side streaming responses efficiently as the metadata comes in chunks
- Process the SnapshotMetadataResponse message format with proper error handling
The external-snapshot-metadata GitHub repository provides a convenient iterator support package to simplify client implementation.
Handle large dataset streaming: Design clients to efficiently handle large streams of block metadata that could be returned for volumes with significant changes.
Optimize backup processes: Modify backup workflows to use the changed block metadata to identify and only transfer changed blocks to make backups more efficient, reducing both backup duration and resource consumption.

Getting started

To use changed block tracking in your cluster:

Ensure your CSI driver supports volume snapshots and implements the snapshot metadata capabilities with the required external-snapshot-metadata sidecar
Make sure the SnapshotMetadataService custom resource is registered using CRD
Verify the presence of a SnapshotMetadataService custom resource for your CSI driver
Create clients that can access the API using appropriate authentication (via Kubernetes ServiceAccount tokens)

The API provides two main functions:

GetMetadataAllocated: Lists blocks allocated in a single snapshot
GetMetadataDelta: Lists blocks changed between two snapshots

What’s next?

Depending on feedback and adoption, the Kubernetes developers hope to push the CSI Snapshot Metadata implementation to Beta in the future releases.

Where can I learn more?

For those interested in trying out this new feature:

Official Kubernetes CSI Developer Documentation
The enhancement proposal for the snapshot metadata feature.
GitHub repository for implementation and release status of external-snapshot-metadata
Complete gRPC protocol definitions for snapshot metadata API: schema.proto
Example snapshot metadata client implementation: snapshot-metadata-lister
End-to-end example with csi-hostpath-driver: example documentation

How do I get involved?

This project, like all of Kubernetes, is the result of hard work by many contributors from diverse backgrounds working together. On behalf of SIG Storage, I would like to offer a huge thank you to the contributors who helped review the design and implementation of the project, including but not limited to the following:

Ben Swartzlander (bswartz)
Carl Braganza (carlbraganza)
Daniil Fedotov (hairyhum)
Ivan Sim (ihcsim)
Nikhil Ladha (Nikhil-Ladha)
Prasad Ghangal (PrasadG193)
Praveen M (iPraveenParihar)
Rakshith R (Rakshith-R)
Xing Yang (xing-yang)

Thank also to everyone who has contributed to the project, including others who helped review the KEP and the CSI spec PR

For those interested in getting involved with the design and development of CSI or any part of the Kubernetes Storage system, join the Kubernetes Storage Special Interest Group (SIG). We always welcome new contributors.

The SIG also holds regular Data Protection Working Group meetings. New attendees are welcome to join our discussions.

Kubernetes v1.34: Pod Level Resources Graduated to Beta

Mon, 22 Sep 2025 10:30:00 -0800

On behalf of the Kubernetes community, I am thrilled to announce that the Pod Level Resources feature has graduated to Beta in the Kubernetes v1.34 release and is enabled by default! This significant milestone introduces a new layer of flexibility for defining and managing resource allocation for your Pods. This flexibility stems from the ability to specify CPU and memory resources for the Pod as a whole. Pod level resources can be combined with the container-level specifications to express the exact resource requirements and limits your application needs.

Pod-level specification for resources

Until recently, resource specifications that applied to Pods were primarily defined at the individual container level. While effective, this approach sometimes required duplicating or meticulously calculating resource needs across multiple containers within a single Pod. As a beta feature, Kubernetes allows you to specify the CPU, memory and hugepages resources at the Pod-level. This means you can now define resource requests and limits for an entire Pod, enabling easier resource sharing without requiring granular, per-container management of these resources where it's not needed.

Why does Pod-level specification matter?

This feature enhances resource management in Kubernetes by offering flexible resource management at both the Pod and container levels.

It provides a consolidated approach to resource declaration, reducing the need for meticulous, per-container management, especially for Pods with multiple containers.
Pod-level resources enable containers within a pod to share unused resoures amongst themselves, promoting efficient utilization within the pod. For example, it prevents sidecar containers from becoming performance bottlenecks. Previously, a sidecar (e.g., a logging agent or service mesh proxy) hitting its individual CPU limit could be throttled and slow down the entire Pod, even if the main application container had plenty of spare CPU. With pod-level resources, the sidecar and the main container can share Pod's resource budget, ensuring smooth operation during traffic spikes - either the whole Pod is throttled or all containers work.
When both pod-level and container-level resources are specified, pod-level requests and limits take precedence. This gives you – and cluster administrators - a powerful way to enforce overall resource boundaries for your Pods.

For scheduling, if a pod-level request is explicitly defined, the scheduler uses that specific value to find a suitable node, insteaf of the aggregated requests of the individual containers. At runtime, the pod-level limit acts as a hard ceiling for the combined resource usage of all containers. Crucially, this pod-level limit is the absolute enforcer; even if the sum of the individual container limits is higher, the total resource consumption can never exceed the pod-level limit.
Pod-level resources are prioritized in influencing the Quality of Service (QoS) class of the Pod.
For Pods running on Linux nodes, the Out-Of-Memory (OOM) score adjustment calculation considers both pod-level and container-level resources requests.
Pod-level resources are designed to be compatible with existing Kubernetes functionalities, ensuring a smooth integration into your workflows.

How to specify resources for an entire Pod

Using PodLevelResources feature gate requires Kubernetes v1.34 or newer for all cluster components, including the control plane and every node. This feature gate is in beta and enabled by default in v1.34.

Example manifest

You can specify CPU, memory and hugepages resources directly in the Pod spec manifest at the resources field for the entire Pod.

Here’s an example demonstrating a Pod with both CPU and memory requests and limits defined at the Pod level:

apiVersion: v1
kind: Pod
metadata:
  name: pod-resources-demo
  namespace: pod-resources-example
spec:
  # The 'resources' field at the Pod specification level defines the overall
  # resource budget for all containers within this Pod combined.
  resources: # Pod-level resources
    # 'limits' specifies the maximum amount of resources the Pod is allowed to use.
    # The sum of the limits of all containers in the Pod cannot exceed these values.
    limits:
      cpu: "1" # The entire Pod cannot use more than 1 CPU core.
      memory: "200Mi" # The entire Pod cannot use more than 200 MiB of memory.
    # 'requests' specifies the minimum amount of resources guaranteed to the Pod.
    # This value is used by the Kubernetes scheduler to find a node with enough capacity.
    requests:
      cpu: "1" # The Pod is guaranteed 1 CPU core when scheduled.
      memory: "100Mi" # The Pod is guaranteed 100 MiB of memory when scheduled.
  containers:
  - name: main-app-container
    image: nginx
    ...
    # This container has no resource requests or limits specified.
  - name: auxiliary-container
    image: fedora
    command: ["sleep", "inf"]
    ...
    # This container has no resource requests or limits specified.

In this example, the pod-resources-demo Pod as a whole requests 1 CPU and 100 MiB of memory, and is limited to 1 CPU and 200 MiB of memory. The containers within will operate under these overall Pod-level constraints, as explained in the next section.

Interaction with container-level resource requests or limits

When both pod-level and container-level resources are specified, pod-level requests and limits take precedence. This means the node allocates resources based on the pod-level specifications.

Consider a Pod with two containers where pod-level CPU and memory requests and limits are defined, and only one container has its own explicit resource definitions:

apiVersion: v1
kind: Pod
metadata:
  name: pod-resources-demo
  namespace: pod-resources-example
spec:
  resources:
    limits:
      cpu: "1"
      memory: "200Mi"
    requests:
      cpu: "1"
      memory: "100Mi"
  containers:
  - name: main-app-container
    image: nginx
    resources:
      requests:
        cpu: "0.5"
        memory: "50Mi"
  - name: auxiliary-container
    image: fedora
    command: [ "sleep", "inf"]
    # This container has no resource requests or limits specified.

Pod-Level Limits: The pod-level limits (cpu: "1", memory: "200Mi") establish an absolute boundary for the entire Pod. The sum of resources consumed by all its containers is enforced at this ceiling and cannot be surpassed.
Resource Sharing and Bursting: Containers can dynamically borrow any unused capacity, allowing them to burst as needed, so long as the Pod's aggregate usage stays within the overall limit.
Pod-Level Requests: The pod-level requests (cpu: "1", memory: "100Mi") serve as the foundational resource guarantee for the entire Pod. This value informs the scheduler's placement decision and represents the minimum resources the Pod can rely on during node-level contention.
Container-Level Requests: Container-level requests create a priority system within the Pod's guaranteed budget. Because main-app-container has an explicit request (cpu: "0.5", memory: "50Mi"), it is given precedence for its share of resources under resource pressure over the auxiliary-container, which has no such explicit claim.

Limitations

First of all, in-place resize of pod-level resources is not supported for Kubernetes v1.34 (or earlier). Attempting to modify the pod-level resource limits or requests on a running Pod results in an error: the resize is rejected. The v1.34 implementation of Pod level resources focuses on allowing initial declaration of an overall resource envelope, that applies to the entire Pod. That is distinct from in-place pod resize, which (despite what the name might suggest) allows you to make dynamic adjustments to container resource requests and limits, within a running Pod, and potentially without a container restart. In-place resizing is also not yet a stable feature; it graduated to Beta in the v1.33 release.
Only CPU, memory, and hugepages resources can be specified at pod-level.
Pod-level resources are not supported for Windows pods. If the Pod specification explicitly targets Windows (e.g., by setting spec.os.name: "windows"), the API server will reject the Pod during the validation step. If the Pod is not explicitly marked for Windows but is scheduled to a Windows node (e.g., via a nodeSelector), the Kubelet on that Windows node will reject the Pod during its admission process.
The Topology Manager, Memory Manager and CPU Manager do not align pods and containers based on pod-level resources as these resource managers don't currently support pod-level resources.

Getting started and providing feedback

Ready to explore Pod Level Resources feature? You'll need a Kubernetes cluster running version 1.34 or later. Remember to enable the PodLevelResources feature gate across your control plane and all nodes.

As this feature moves through Beta, your feedback is invaluable. Please report any issues or share your experiences via the standard Kubernetes communication channels:

Kubernetes v1.34: Recovery From Volume Expansion Failure (GA)

Fri, 19 Sep 2025 10:30:00 -0800

Have you ever made a typo when expanding your persistent volumes in Kubernetes? Meant to specify 2TB but specified 20TiB? This seemingly innocuous problem was kinda hard to fix - and took the project almost 5 years to fix. Automated recovery from storage expansion has been around for a while in beta; however, with the v1.34 release, we have graduated this to general availability.

While it was always possible to recover from failing volume expansions manually, it usually required cluster-admin access and was tedious to do (See aformentioned link for more information).

What if you make a mistake and then realize immediately? With Kubernetes v1.34, you should be able to reduce the requested size of the PersistentVolumeClaim (PVC) and, as long as the expansion to previously requested size hadn't finished, you can amend the size requested. Kubernetes will automatically work to correct it. Any quota consumed by failed expansion will be returned to the user and the associated PersistentVolume should be resized to the latest size you specified.

I'll walk through an example of how all of this works.

Reducing PVC size to recover from failed expansion

Imagine that you are running out of disk space for one of your database servers, and you want to expand the PVC from previously specified 10TB to 100TB - but you make a typo and specify 1000TB.

kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: myclaim
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 1000TB # newly specified size - but incorrect!

Now, you may be out of disk space on your disk array or simply ran out of allocated quota on your cloud-provider. But, assume that expansion to 1000TB is never going to succeed.

In Kubernetes v1.34, you can simply correct your mistake and request a new PVC size, that is smaller than the mistake, provided it is still larger than the original size of the actual PersistentVolume.

kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: myclaim
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 100TB # Corrected size; has to be greater than 10TB.
                     # You cannot shrink the volume below its actual size.

This requires no admin intervention. Even better, any surplus Kubernetes quota that you temporarily consumed will be automatically returned.

This fault recovery mechanism does have a caveat: whatever new size you specify for the PVC, it must be still higher than the original size in .status.capacity. Since Kubernetes doesn't support shrinking your PV objects, you can never go below the size that was originally allocated for your PVC request.

Improved error handling and observability of volume expansion

Implementing what might look like a relatively minor change also required us to almost fully redo how volume expansion works under the hood in Kubernetes. There are new API fields available in PVC objects which you can monitor to observe progress of volume expansion.

Improved observability of in-progress expansion

You can query .status.allocatedResourceStatus['storage'] of a PVC to monitor progress of a volume expansion operation. For a typical block volume, this should transition between ControllerResizeInProgress, NodeResizePending and NodeResizeInProgress and become nil/empty when volume expansion has finished.

If for some reason, volume expansion to requested size is not feasible it should accordingly be in states like - ControllerResizeInfeasible or NodeResizeInfeasible.

You can also observe size towards which Kubernetes is working by watching pvc.status.allocatedResources.

Improved error handling and reporting

Kubernetes should now retry your failed volume expansions at slower rate, it should make fewer requests to both storage system and Kubernetes apiserver.

Errors observerd during volume expansion are now reported as condition on PVC objects and should persist unlike events. Kubernetes will now populate pvc.status.conditions with error keys ControllerResizeError or NodeResizeError when volume expansion fails.

Fixes long standing bugs in resizing workflows

This feature also has allowed us to fix long standing bugs in resizing workflow such as Kubernetes issue #115294. If you observe anything broken, please report your bugs to https://github.com/kubernetes/kubernetes/issues, along with details about how to reproduce the problem.

Working on this feature through its lifecycle was challenging and it wouldn't have been possible to reach GA without feedback from @msau42, @jsafrane and @xing-yang.

All of the contributors who worked on this also appreciate the input provided by @thockin and @liggitt at various Kubernetes contributor summits.

Kubernetes v1.34: DRA Consumable Capacity

Thu, 18 Sep 2025 10:30:00 -0800

Dynamic Resource Allocation (DRA) is a Kubernetes API for managing scarce resources across Pods and containers. It enables flexible resource requests, going beyond simply allocating N number of devices to support more granular usage scenarios. With DRA, users can request specific types of devices based on their attributes, define custom configurations tailored to their workloads, and even share the same resource among multiple containers or Pods.

In this blog, we focus on the device sharing feature and dive into a new capability introduced in Kubernetes 1.34: DRA consumable capacity, which extends DRA to support finer-grained device sharing.

From the beginning, DRA introduced the ability for multiple Pods to share a device by referencing the same ResourceClaim. This design decouples resource allocation from specific hardware, allowing for more dynamic and reusable provisioning of devices.

In Kubernetes 1.33, the new support for partitionable devices allowed resource drivers to advertise slices of a device that are available, rather than exposing the entire device as an all-or-nothing resource. This enabled Kubernetes to model shareable hardware more accurately.

But there was still a missing piece: it didn't yet support scenarios where the device driver manages fine-grained, dynamic portions of a device resource — like network bandwidth — based on user demand, or to share those resources independently of ResourceClaims, which are restricted by their spec and namespace.

That’s where consumable capacity for DRA comes in.

Benefits of DRA consumable capacity support

Here's a taste of what you get in a cluster with the DRAConsumableCapacity feature gate enabled.

Resource drivers can now support sharing the same device — or even a slice of a device — across multiple ResourceClaims or across multiple DeviceRequests.

This means that Pods from different namespaces can simultaneously share the same device, if permitted and supported by the specific DRA driver.

Device resource allocation

Kubernetes extends the allocation algorithm in the scheduler to support allocating a portion of a device's resources, as defined in the capacity field. The scheduler ensures that the total allocated capacity across all consumers never exceeds the device’s total capacity, even when shared across multiple ResourceClaims or DeviceRequests. This is very similar to the way the scheduler allows Pods and containers to share allocatable resources on Nodes; in this case, it allows them to share allocatable (consumable) resources on Devices.

This feature expands support for scenarios where the device driver is able to manage resources within a device and on a per-process basis — for example, allocating a specific amount of memory (e.g., 8 GiB) from a virtual GPU, or setting bandwidth limits on virtual network interfaces allocated to specific Pods. This aims to provide safe and efficient resource sharing.

DistinctAttribute constraint

This feature also introduces a new constraint: DistinctAttribute, which is the complement of the existing MatchAttribute constraint.

The primary goal of DistinctAttribute is to prevent the same underlying device from being allocated multiple times within a single ResourceClaim, which could happen since we are allocating shares (or subsets) of devices. This constraint ensures that each allocation refers to a distinct resource, even if they belong to the same device class.

It is useful for use cases such as allocating network devices connecting to different subnets to expand coverage or provide redundancy across failure domains.

How to use consumable capacity?

DRAConsumableCapacity is introduced as an alpha feature in Kubernetes 1.34. The feature gate DRAConsumableCapacity must be enabled in kubelet, kube-apiserver, kube-scheduler and kube-controller-manager.

--feature-gates=...,DRAConsumableCapacity=true

As a DRA driver developer

As a DRA driver developer writing in Golang, you can make a device within a ResourceSlice allocatable to multiple ResourceClaims (or devices.requests) by setting AllowMultipleAllocations to true.

Device {
  ...
  AllowMultipleAllocations: ptr.To(true),
  ...
}

Additionally, you can define a policy to restrict how each device's Capacity should be consumed by each DeviceRequest by defining RequestPolicy field in the DeviceCapacity. The example below shows how to define a policy that requires a GPU with 40 GiB of memory to allocate at least 5 GiB per request, with each allocation in multiples of 5 GiB.

DeviceCapacity{
  Value: resource.MustParse("40Gi"),
  RequestPolicy: &CapacityRequestPolicy{
    Default: ptr.To(resource.MustParse("5Gi")),
    ValidRange: &CapacityRequestPolicyRange {
      Min: ptr.To(resource.MustParse("5Gi")),
      Step: ptr.To(resource.MustParse("5Gi")),
    }
  }
}

This will be published to the ResourceSlice, as partially shown below:

apiVersion: resource.k8s.io/v1
kind: ResourceSlice
...
spec:
  devices:
  - name: gpu0
    allowMultipleAllocations: true
    capacity:
      memory:
        value: 40Gi
        requestPolicy:
          default: 5Gi
          validRange:
            min: 5Gi
            step: 5Gi

An allocated device with a specified portion of consumed capacity will have a ShareID field set in the allocation status.

claim.Status.Allocation.Devices.Results[i].ShareID

This ShareID allows the driver to distinguish between different allocations that refer to the same device or same statically-partitioned slice but come from different ResourceClaim requests.
It acts as a unique identifier for each shared slice, enabling the driver to manage and enforce resource limits independently across multiple consumers.

As a consumer

As a consumer (or user), the device resource can be requested with a ResourceClaim like this:

apiVersion: resource.k8s.io/v1
kind: ResourceClaim
...
spec:
  devices:
    requests: # for devices
    - name: req0
      exactly:
        deviceClassName: resource.example.com
        capacity:
          requests: # for resources which must be provided by those devices
            memory: 10Gi

This configuration ensures that the requested device can provide at least 10GiB of memory.

Notably that any resource.example.com device that has at least 10GiB of memory can be allocated. If a device that does not support multiple allocations is chosen, the allocation would consume the entire device. To filter only devices that support multiple allocations, you can define a selector like this:

selectors:
  - cel:
      expression: |-
        device.allowMultipleAllocations == true

Integration with DRA device status

In device sharing, general device information is provided through the resource slice. However, some details are set dynamically after allocation. These can be conveyed using the .status.devices field of a ResourceClaim. That field is only published in clusters where the DRAResourceClaimDeviceStatus feature gate is enabled.

If you do have device status support available, a driver can expose additional device-specific information beyond the ShareID. One particularly useful use case is for virtual networks, where a driver can include the assigned IP address(es) in the status. This is valuable for both network service operations and troubleshooting.

You can find more information by watching our recording at: KubeCon Japan 2025 - Reimagining Cloud Native Networks: The Critical Role of DRA.

What can you do next?

Check out the CNI DRA Driver project for an example of DRA integration in Kubernetes networking. Try integrating with network resources like macvlan, ipvlan, or smart NICs.
Start enabling the DRAConsumableCapacity feature gate and experimenting with virtualized or partitionable devices. Specify your workloads with consumable capacity (for example: fractional bandwidth or memory).
Let us know your feedback:
- ✅ What worked well?
- ⚠️ What didn’t?
If you encountered issues to fix or opportunities to enhance, please file a new issue and reference KEP-5075 there, or reach out via Slack (#wg-device-management).

Conclusion

Consumable capacity support enhances the device sharing capability of DRA by allowing effective device sharing across namespaces, across claims, and tailored to each Pod’s actual needs. It also empowers drivers to enforce capacity limits, improves scheduling accuracy, and unlocks new use cases like bandwidth-aware networking and multi-tenant device sharing.

Try it out, experiment with consumable resources, and help shape the future of dynamic resource allocation in Kubernetes!

Kubernetes v1.34: Pods Report DRA Resource Health

Wed, 17 Sep 2025 10:30:00 -0800

The rise of AI/ML and other high-performance workloads has made specialized hardware like GPUs, TPUs, and FPGAs a critical component of many Kubernetes clusters. However, as discussed in a previous blog post about navigating failures in Pods with devices, when this hardware fails, it can be difficult to diagnose, leading to significant downtime. With the release of Kubernetes v1.34, we are excited to announce a new alpha feature that brings much-needed visibility into the health of these devices.

This work extends the functionality of KEP-4680, which first introduced a mechanism for reporting the health of devices managed by Device Plugins. Now, this capability is being extended to Dynamic Resource Allocation (DRA). Controlled by the ResourceHealthStatus feature gate, this enhancement allows DRA drivers to report device health directly into a Pod's .status field, providing crucial insights for operators and developers.

Why expose device health in Pod status?

For stateful applications or long-running jobs, a device failure can be disruptive and costly. By exposing device health in the .status field for a Pod, Kubernetes provides a standardized way for users and automation tools to quickly diagnose issues. If a Pod is failing, you can now check its status to see if an unhealthy device is the root cause, saving valuable time that might otherwise be spent debugging application code.

How it works

This feature introduces a new, optional communication channel between the Kubelet and DRA drivers, built on three core components.

A new gRPC health service

A new gRPC service, DRAResourceHealth, is defined in the dra-health/v1alpha1 API group. DRA drivers can implement this service to stream device health updates to the Kubelet. The service includes a NodeWatchResources server-streaming RPC that sends the health status (Healthy, Unhealthy, or Unknown) for the devices it manages.

Kubelet integration

The Kubelet’s DRAPluginManager discovers which drivers implement the health service. For each compatible driver, it starts a long-lived NodeWatchResources stream to receive health updates. The DRA Manager then consumes these updates and stores them in a persistent healthInfoCache that can survive Kubelet restarts.

Populating the Pod status

When a device's health changes, the DRA manager identifies all Pods affected by the change and triggers a Pod status update. A new field, allocatedResourcesStatus, is now part of the v1.ContainerStatus API object. The Kubelet populates this field with the current health of each device allocated to the container.

A practical example

If a Pod is in a CrashLoopBackOff state, you can use kubectl describe pod <pod-name> to inspect its status. If an allocated device has failed, the output will now include the allocatedResourcesStatus field, clearly indicating the problem:

status:
  containerStatuses:
  - name: my-gpu-intensive-container
    # ... other container statuses
    allocatedResourcesStatus:
    - name: "claim:my-gpu-claim"
      resources:
      - resourceID: "example.com/gpu-a1b2-c3d4"
        health: "Unhealthy"

This explicit status makes it clear that the issue is with the underlying hardware, not the application.

Now you can improve the failure detection logic to react on the unhealthy devices associated with the Pod by de-scheduling a Pod.

How to use this feature

As this is an alpha feature in Kubernetes v1.34, you must take the following steps to use it:

Enable the ResourceHealthStatus feature gate on your kube-apiserver and kubelets.
Ensure you are using a DRA driver that implements the v1alpha1 DRAResourceHealth gRPC service.

DRA drivers

If you are developing a DRA driver, make sure to think about device failure detection strategy and ensure that your driver is integrated with this feature. This way, your driver will improve the user experience and simplify debuggability of hardware issues.

What's next?

This is the first step in a broader effort to improve how Kubernetes handles device failures. As we gather feedback on this alpha feature, the community is planning several key enhancements before graduating to Beta:

Detailed health messages: To improve the troubleshooting experience, we plan to add a human-readable message field to the gRPC API. This will allow DRA drivers to provide specific context for a health status, such as "GPU temperature exceeds threshold" or "NVLink connection lost".
Configurable health timeouts: The timeout for marking a device's health as "Unknown" is currently hardcoded. We plan to make this configurable, likely on a per-driver basis, to better accommodate the different health-reporting characteristics of various hardware.
Improved post-mortem troubleshooting: We will address a known limitation where health updates may not be applied to pods that have already terminated. This fix will ensure that the health status of a device at the time of failure is preserved, which is crucial for troubleshooting batch jobs and other "run-to-completion" workloads.

This feature was developed as part of KEP-4680, and community feedback is crucial as we work toward graduating it to Beta. We have more improvements of device failure handling in k8s and encourage you to try it out and share your experiences with the SIG Node community!

Kubernetes v1.34: Moving Volume Group Snapshots to v1beta2

Tue, 16 Sep 2025 10:30:00 -0800

Volume group snapshots were introduced as an Alpha feature with the Kubernetes 1.27 release and moved to Beta in the Kubernetes 1.32 release. The recent release of Kubernetes v1.34 moved that support to a second beta. The support for volume group snapshots relies on a set of extension APIs for group snapshots. These APIs allow users to take crash consistent snapshots for a set of volumes. Behind the scenes, Kubernetes uses a label selector to group multiple PersistentVolumeClaims for snapshotting. A key aim is to allow you restore that set of snapshots to new volumes and recover your workload based on a crash consistent recovery point.

This new feature is only supported for CSI volume drivers.

What's new in Beta 2?

While testing the beta version, we encountered an issue where the restoreSize field is not set for individual VolumeSnapshotContents and VolumeSnapshots if CSI driver does not implement the ListSnapshots RPC call. We evaluated various options here and decided to make this change releasing a new beta for the API.

Specifically, a VolumeSnapshotInfo struct is added in v1beta2, it contains information for an individual volume snapshot that is a member of a volume group snapshot. VolumeSnapshotInfoList, a list of VolumeSnapshotInfo, is added to VolumeGroupSnapshotContentStatus, replacing VolumeSnapshotHandlePairList. VolumeSnapshotInfoList is a list of snapshot information returned by the CSI driver to identify snapshots on the storage system. VolumeSnapshotInfoList is populated by the csi-snapshotter sidecar based on the CSI CreateVolumeGroupSnapshotResponse returned by the CSI driver's CreateVolumeGroupSnapshot call.

The existing v1beta1 API objects will be converted to the new v1beta2 API objects by a conversion webhook.

What’s next?

Depending on feedback and adoption, the Kubernetes project plans to push the volume group snapshot implementation to general availability (GA) in a future release.

How can I learn more?

The design spec for the volume group snapshot feature.
The code repository for volume group snapshot APIs and controller.
CSI documentation on the group snapshot feature.

How do I get involved?

This project, like all of Kubernetes, is the result of hard work by many contributors from diverse backgrounds working together. On behalf of SIG Storage, I would like to offer a huge thank you to the contributors who stepped up these last few quarters to help the project reach beta:

Ben Swartzlander (bswartz)
Hemant Kumar (gnufied)
Jan Šafránek (jsafrane)
Madhu Rajanna (Madhu-1)
Michelle Au (msau42)
Niels de Vos (nixpanic)
Leonardo Cecchi (leonardoce)
Saad Ali (saad-ali)
Xing Yang (xing-yang)
Yati Padia (yati1998)

We also hold regular Data Protection Working Group meetings. New attendees are welcome to join our discussions.

Kubernetes v1.34: Decoupled Taint Manager Is Now Stable

Mon, 15 Sep 2025 10:30:00 -0800

This enhancement separates the responsibility of managing node lifecycle and pod eviction into two distinct components. Previously, the node lifecycle controller handled both marking nodes as unhealthy with NoExecute taints and evicting pods from them. Now, a dedicated taint eviction controller manages the eviction process, while the node lifecycle controller focuses solely on applying taints. This separation not only improves code organization but also makes it easier to improve taint eviction controller or build custom implementations of the taint based eviction.

What's new?

The feature gate SeparateTaintEvictionController has been promoted to GA in this release. Users can optionally disable taint-based eviction by setting --controllers=-taint-eviction-controller in kube-controller-manager.

How can I learn more?

For more details, refer to the KEP and to the beta announcement article: Kubernetes 1.29: Decoupling taint manager from node lifecycle controller.

How to get involved?

We offer a huge thank you to all the contributors who helped with design, implementation, and review of this feature and helped move it from beta to stable:

Ed Bartosh (@bart0sh)
Yuan Chen (@yuanchen8911)
Aldo Culquicondor (@alculquicondor)
Baofa Fan (@carlory)
Sergey Kanzhelev (@SergeyKanzhelev)
Tim Bannister (@lmktfy)
Maciej Skoczeń (@macsko)
Maciej Szulik (@soltysh)
Wojciech Tyczynski (@wojtek-t)

Kubernetes v1.34: Autoconfiguration for Node Cgroup Driver Goes GA

Fri, 12 Sep 2025 10:30:00 -0800

Historically, configuring the correct cgroup driver has been a pain point for users running new Kubernetes clusters. On Linux systems, there are two different cgroup drivers: cgroupfs and systemd. In the past, both the kubelet and CRI implementation (like CRI-O or containerd) needed to be configured to use the same cgroup driver, or else the kubelet would misbehave without any explicit error message. This was a source of headaches for many cluster admins. Now, we've (almost) arrived at the end of that headache.

Automated cgroup driver detection

In v1.28.0, the SIG Node community introduced the feature gate KubeletCgroupDriverFromCRI, which instructs the kubelet to ask the CRI implementation which cgroup driver to use. You can read more here. After many releases of waiting for each CRI implementation to have major versions released and packaged in major operating systems, this feature has gone GA as of Kubernetes 1.34.0.

In addition to setting the feature gate, a cluster admin needs to ensure their CRI implementation is new enough:

containerd: Support was added in v2.0.0
CRI-O: Support was added in v1.28.0

Announcement: Kubernetes is deprecating containerd v1.y support

While CRI-O releases versions that match Kubernetes versions, and thus CRI-O versions without this behavior are no longer supported, containerd maintains its own release cycle. containerd support for this feature is only in v2.0 and later, but Kubernetes 1.34 still supports containerd 1.7 and other LTS releases of containerd.

The Kubernetes SIG Node community has formally agreed upon a final support timeline for containerd v1.y. The last Kubernetes release to offer this support will be the last released version of v1.35, and support will be dropped in v1.36.0. To assist administrators in managing this future transition, a new detection mechanism is available. You are able to monitor the kubelet_cri_losing_support metric to determine if any nodes in your cluster are using a containerd version that will soon be outdated. The presence of this metric with a version label of 1.36.0 will indicate that the node's containerd runtime is not new enough for the upcoming requirements. Consequently, an administrator will need to upgrade containerd to v2.0 or a later version before, or at the same time as, upgrading the kubelet to v1.36.0.

Kubernetes v1.34: Mutable CSI Node Allocatable Graduates to Beta

Thu, 11 Sep 2025 10:30:00 -0800

The functionality for CSI drivers to update information about attachable volume count on the nodes, first introduced as Alpha in Kubernetes v1.33, has graduated to Beta in the Kubernetes v1.34 release! This marks a significant milestone in enhancing the accuracy of stateful pod scheduling by reducing failures due to outdated attachable volume capacity information.

Background

Traditionally, Kubernetes CSI drivers report a static maximum volume attachment limit when initializing. However, actual attachment capacities can change during a node's lifecycle for various reasons, such as:

Manual or external operations attaching/detaching volumes outside of Kubernetes control.
Dynamically attached network interfaces or specialized hardware (GPUs, NICs, etc.) consuming available slots.
Multi-driver scenarios, where one CSI driver’s operations affect available capacity reported by another.

Static reporting can cause Kubernetes to schedule pods onto nodes that appear to have capacity but don't, leading to pods stuck in a ContainerCreating state.

Dynamically adapting CSI volume limits

With this new feature, Kubernetes enables CSI drivers to dynamically adjust and report node attachment capacities at runtime. This ensures that the scheduler, as well as other components relying on this information, have the most accurate, up-to-date view of node capacity.

How it works

Kubernetes supports two mechanisms for updating the reported node volume limits:

Periodic Updates: CSI drivers specify an interval to periodically refresh the node's allocatable capacity.
Reactive Updates: An immediate update triggered when a volume attachment fails due to exhausted resources (ResourceExhausted error).

Enabling the feature

To use this beta feature, the MutableCSINodeAllocatableCount feature gate must be enabled in these components:

kube-apiserver
kubelet

Example CSI driver configuration

Below is an example of configuring a CSI driver to enable periodic updates every 60 seconds:

apiVersion: storage.k8s.io/v1
kind: CSIDriver
metadata:
  name: example.csi.k8s.io
spec:
  nodeAllocatableUpdatePeriodSeconds: 60

This configuration directs kubelet to periodically call the CSI driver's NodeGetInfo method every 60 seconds, updating the node’s allocatable volume count. Kubernetes enforces a minimum update interval of 10 seconds to balance accuracy and resource usage.

Immediate updates on attachment failures

When a volume attachment operation fails due to a ResourceExhausted error (gRPC code 8), Kubernetes immediately updates the allocatable count instead of waiting for the next periodic update. The Kubelet then marks the affected pods as Failed, enabling their controllers to recreate them. This prevents pods from getting permanently stuck in the ContainerCreating state.

Getting started

To enable this feature in your Kubernetes v1.34 cluster:

Enable the feature gate MutableCSINodeAllocatableCount on the kube-apiserver and kubelet components.
Update your CSI driver configuration by setting nodeAllocatableUpdatePeriodSeconds.
Monitor and observe improvements in scheduling accuracy and pod placement reliability.

Next steps

This feature is currently in beta and the Kubernetes community welcomes your feedback. Test it, share your experiences, and help guide its evolution to GA stability.

Join discussions in the Kubernetes Storage Special Interest Group (SIG-Storage) to shape the future of Kubernetes storage capabilities.

Kubernetes v1.34: Use An Init Container To Define App Environment Variables

Wed, 10 Sep 2025 10:30:00 -0800

Kubernetes typically uses ConfigMaps and Secrets to set environment variables, which introduces additional API calls and complexity, For example, you need to separately manage the Pods of your workloads and their configurations, while ensuring orderly updates for both the configurations and the workload Pods.

Alternatively, you might be using a vendor-supplied container that requires environment variables (such as a license key or a one-time token), but you don’t want to hard-code them or mount volumes just to get the job done.

If that's the situation you are in, you now have a new (alpha) way to achieve that. Provided you have the EnvFiles feature gate enabled across your cluster, you can tell the kubelet to load a container's environment variables from a volume (the volume must be part of the Pod that the container belongs to). this feature gate allows you to load environment variables directly from a file in an emptyDir volume without actually mounting that file into the container. It’s a simple yet elegant solution to some surprisingly common problems.

What’s this all about?

At its core, this feature allows you to point your container to a file, one generated by an initContainer, and have Kubernetes parse that file to set your environment variables. The file lives in an emptyDir volume (a temporary storage space that lasts as long as the pod does), Your main container doesn’t need to mount the volume. The kubelet will read the file and inject these variables when the container starts.

How It Works

Here's a simple example:

apiVersion: v1
kind: Pod
spec:
  initContainers:
  - name: generate-config
    image: busybox
    command: ['sh', '-c', 'echo "CONFIG_VAR=HELLO" > /config/config.env']
    volumeMounts:
    - name: config-volume
      mountPath: /config
  containers:
  - name: app-container
    image: gcr.io/distroless/static
    env:
    - name: CONFIG_VAR
      valueFrom:
        fileKeyRef:
          path: config.env
          volumeName: config-volume
          key: CONFIG_VAR
  volumes:
  - name: config-volume
    emptyDir: {}

Using this approach is a breeze. You define your environment variables in the pod spec using the fileKeyRef field, which tells Kubernetes where to find the file and which key to pull. The file itself resembles the standard for .env syntax (think KEY=VALUE), and (for this alpha stage at least) you must ensure that it is written into an emptyDir volume. Other volume types aren't supported for this feature. At least one init container must mount that emptyDir volume (to write the file), but the main container doesn’t need to—it just gets the variables handed to it at startup.

A word on security

While this feature supports handling sensitive data such as keys or tokens, note that its implementation relies on emptyDir volumes mounted into pod. Operators with node filesystem access could therefore easily retrieve this sensitive data through pod directory paths.

If storing sensitive data like keys or tokens using this feature, ensure your cluster security policies effectively protect nodes against unauthorized access to prevent exposure of confidential information.

Summary

This feature will eliminate a number of complex workarounds used today, simplifying apps authoring, and opening doors for more use cases. Kubernetes stays flexible and open for feedback. Tell us how you use this feature or what is missing.

Kubernetes v1.34: Snapshottable API server cache

Tue, 09 Sep 2025 10:30:00 -0800

For years, the Kubernetes community has been on a mission to improve the stability and performance predictability of the API server. A major focus of this effort has been taming list requests, which have historically been a primary source of high memory usage and heavy load on the etcd datastore. With each release, we've chipped away at the problem, and today, we're thrilled to announce the final major piece of this puzzle.

The snapshottable API server cache feature has graduated to Beta in Kubernetes v1.34, culminating a multi-release effort to allow virtually all read requests to be served directly from the API server's cache.

Evolving the cache for performance and stability

The path to the current state involved several key enhancements over recent releases that paved the way for today's announcement.

Consistent reads from cache (Beta in v1.31)

While the API server has long used a cache for performance, a key milestone was guaranteeing consistent reads of the latest data from it. This v1.31 enhancement allowed the watch cache to be used for strongly-consistent read requests for the first time, a huge win as it enabled filtered collections (e.g. "a list of pods bound to this node") to be safely served from the cache instead of etcd, dramatically reducing its load for common workloads.

Taming large responses with streaming (Beta in v1.33)

Another key improvement was tackling the problem of memory spikes when transmitting large responses. The streaming encoder, introduced in v1.33, allowed the API server to send list items one by one, rather than buffering the entire multi-gigabyte response in memory. This made the memory cost of sending a response predictable and minimal, regardless of its size.

The missing piece

Despite these huge improvements, a critical gap remained. Any request for a historical LIST—most commonly used for paginating through large result sets—still had to bypass the cache and query etcd directly. This meant that the cost of retrieving the data was still unpredictable and could put significant memory pressure on the API server.

Kubernetes 1.34: snapshots complete the picture

The snapshottable API server cache solves this final piece of the puzzle. This feature enhances the watch cache, enabling it to generate efficient, point-in-time snapshots of its state.

Here’s how it works: for each update, the cache creates a lightweight snapshot. These snapshots are "lazy copies," meaning they don't duplicate objects but simply store pointers, making them incredibly memory-efficient.

When a list request for a historical resourceVersion arrives, the API server now finds the corresponding snapshot and serves the response directly from its memory. This closes the final major gap, allowing paginated requests to be served entirely from the cache.

A new era of API Server performance 🚀

With this final piece in place, the synergy of these three features ushers in a new era of API server predictability and performance:

Get Data from Cache: Consistent reads and snapshottable cache work together to ensure nearly all read requests—whether for the latest data or a historical snapshot—are served from the API server's memory.
Send data via stream: Streaming list responses ensure that sending this data to the client has a minimal and constant memory footprint.

The result is a system where the resource cost of read operations is almost fully predictable and much more resiliant to spikes in request load. This means dramatically reduced memory pressure, a lighter load on etcd, and a more stable, scalable, and reliable control plane for all Kubernetes clusters.

How to get started

With its graduation to Beta, the SnapshottableCache feature gate is enabled by default in Kubernetes v1.34. There are no actions required to start benefiting from these performance and stability improvements.

Acknowledgements

Special thanks for designing, implementing, and reviewing these critical features go to:

Ahmad Zolfaghari (@ah8ad3)
Ben Luddy (@benluddy) – Red Hat
Chen Chen (@z1cheng) – Microsoft
Davanum Srinivas (@dims) – Nvidia
David Eads (@deads2k) – Red Hat
Han Kang (@logicalhan) – CoreWeave
haosdent (@haosdent) – Shopee
Joe Betz (@jpbetz) – Google
Jordan Liggitt (@liggitt) – Google
Łukasz Szaszkiewicz (@p0lyn0mial) – Red Hat
Maciej Borsz (@mborsz) – Google
Madhav Jivrajani (@MadhavJivrajani) – UIUC
Marek Siarkowicz (@serathius) – Google
NKeert (@NKeert)
Tim Bannister (@lmktfy)
Wei Fu (@fuweid) - Microsoft
Wojtek Tyczyński (@wojtek-t) – Google

...and many others in SIG API Machinery. This milestone is a testament to the community's dedication to building a more scalable and robust Kubernetes.

Kubernetes v1.34: VolumeAttributesClass for Volume Modification GA

Mon, 08 Sep 2025 10:30:00 -0800

The VolumeAttributesClass API, which empowers users to dynamically modify volume attributes, has officially graduated to General Availability (GA) in Kubernetes v1.34. This marks a significant milestone, providing a robust and stable way to tune your persistent storage directly within Kubernetes.

What is VolumeAttributesClass?

At its core, VolumeAttributesClass is a cluster-scoped resource that defines a set of mutable parameters for a volume. Think of it as a "profile" for your storage, allowing cluster administrators to expose different quality-of-service (QoS) levels or performance tiers.

Users can then specify a volumeAttributesClassName in their PersistentVolumeClaim (PVC) to indicate which class of attributes they desire. The magic happens through the Container Storage Interface (CSI): when a PVC referencing a VolumeAttributesClass is updated, the associated CSI driver interacts with the underlying storage system to apply the specified changes to the volume.

This means you can now:

Dynamically scale performance: Increase IOPS or throughput for a busy database, or reduce it for a less critical application.
Optimize costs: Adjust attributes on the fly to match your current needs, avoiding over-provisioning.
Simplify operations: Manage volume modifications directly within the Kubernetes API, rather than relying on external tools or manual processes.

What is new from Beta to GA

There are two major enhancements from beta.

Cancellation support when errors occur

To improve resilience and user experience, the GA release introduces explicit cancel support when a requested volume modification encounters an error. If the underlying storage system or CSI driver indicates that the requested changes cannot be applied (e.g., due to invalid arguments), users can cancel the operation and revert the volume to its previous stable configuration, preventing the volume from being left in an inconsistent state.

Quota support based on scope

While VolumeAttributesClass doesn't add a new quota type, the Kubernetes control plane can be configured to enforce quotas on PersistentVolumeClaims that reference a specific VolumeAttributesClass.

This is achieved by using the scopeSelector field in a ResourceQuota to target PVCs that have .spec.volumeAttributesClassName set to a particular VolumeAttributesClass name. Please see more details here.

Drivers support VolumeAttributesClass

Amazon EBS CSI Driver: The AWS EBS CSI driver has robust support for VolumeAttributesClass and allows you to modify parameters like volume type (e.g., gp2 to gp3, io1 to io2), IOPS, and throughput of EBS volumes dynamically.
Google Compute Engine (GCE) Persistent Disk CSI Driver (pd.csi.storage.gke.io): This driver also supports dynamic modification of persistent disk attributes, including IOPS and throughput, via VolumeAttributesClass.

Contact

For any inquiries or specific questions related to VolumeAttributesClass, please reach out to the SIG Storage community.

Kubernetes v1.34: Pod Replacement Policy for Jobs Goes GA

Fri, 05 Sep 2025 10:30:00 -0800

In Kubernetes v1.34, the Pod replacement policy feature has reached general availability (GA). This blog post describes the Pod replacement policy feature and how to use it in your Jobs.

About Pod Replacement Policy

By default, the Job controller immediately recreates Pods as soon as they fail or begin terminating (when they have a deletion timestamp).

As a result, while some Pods are terminating, the total number of running Pods for a Job can temporarily exceed the specified parallelism. For Indexed Jobs, this can even mean multiple Pods running for the same index at the same time.

This behavior works fine for many workloads, but it can cause problems in certain cases.

For example, popular machine learning frameworks like TensorFlow and JAX expect exactly one Pod per worker index. If two Pods run at the same time, you might encounter errors such as:

/job:worker/task:4: Duplicate task registration with task_name=/job:worker/replica:0/task:4

Additionally, starting replacement Pods before the old ones fully terminate can lead to:

Scheduling delays by kube-scheduler as the nodes remain occupied.
Unnecessary cluster scale-ups to accommodate the replacement Pods.
Temporary bypassing of quota checks by workload orchestrators like Kueue.

With Pod replacement policy, Kubernetes gives you control over when the control plane replaces terminating Pods, helping you avoid these issues.

How Pod Replacement Policy works

This enhancement means that Jobs in Kubernetes have an optional field .spec.podReplacementPolicy.
You can choose one of two policies:

TerminatingOrFailed (default): Replaces Pods as soon as they start terminating.
Failed: Replaces Pods only after they fully terminate and transition to the Failed phase.

Setting the policy to Failed ensures that a new Pod is only created after the previous one has completely terminated.

For Jobs with a Pod Failure Policy, the default podReplacementPolicy is Failed, and no other value is allowed. See Pod Failure Policy to learn more about Pod Failure Policies for Jobs.

You can check how many Pods are currently terminating by inspecting the Job’s .status.terminating field:

kubectl get job myjob -o=jsonpath='{.status.terminating}'

Example

Here’s a Job example that executes a task two times (spec.completions: 2) in parallel (spec.parallelism: 2) and replaces Pods only after they fully terminate (spec.podReplacementPolicy: Failed):

apiVersion: batch/v1
kind: Job
metadata:
  name: example-job
spec:
  completions: 2
  parallelism: 2
  podReplacementPolicy: Failed
  template:
    spec:
      restartPolicy: Never
      containers:
      - name: worker
        image: your-image

If a Pod receives a SIGTERM signal (deletion, eviction, preemption...), it begins terminating. When the container handles termination gracefully, cleanup may take some time.

When the Job starts, we will see two Pods running:

kubectl get pods

NAME                READY   STATUS    RESTARTS   AGE
example-job-qr8kf   1/1     Running   0          2s
example-job-stvb4   1/1     Running   0          2s

Let's delete one of the Pods (example-job-qr8kf).

With the TerminatingOrFailed policy, as soon as one Pod (example-job-qr8kf) starts terminating, the Job controller immediately creates a new Pod (example-job-b59zk) to replace it.

kubectl get pods

NAME                READY   STATUS        RESTARTS   AGE
example-job-b59zk   1/1     Running       0          1s
example-job-qr8kf   1/1     Terminating   0          17s
example-job-stvb4   1/1     Running       0          17s

With the Failed policy, the new Pod (example-job-b59zk) is not created while the old Pod (example-job-qr8kf) is terminating.

kubectl get pods

NAME                READY   STATUS        RESTARTS   AGE
example-job-qr8kf   1/1     Terminating   0          17s
example-job-stvb4   1/1     Running       0          17s

When the terminating Pod has fully transitioned to the Failed phase, a new Pod is created:

kubectl get pods

NAME                READY   STATUS        RESTARTS   AGE
example-job-b59zk   1/1     Running       0          1s
example-job-stvb4   1/1     Running       0          25s

How can you learn more?

Read the user-facing documentation for Pod Replacement Policy, Backoff Limit per Index, and Pod Failure Policy.
Read the KEPs for Pod Replacement Policy, Backoff Limit per Index, and Pod Failure Policy.

Acknowledgments

As with any Kubernetes feature, multiple people contributed to getting this done, from testing and filing bugs to reviewing code.

As this feature moves to stable after 2 years, we would like to thank the following people:

Kevin Hannon - for writing the KEP and the initial implementation.
Michał Woźniak - for guidance, mentorship, and reviews.
Aldo Culquicondor - for guidance, mentorship, and reviews.
Maciej Szulik - for guidance, mentorship, and reviews.
Dejan Zele Pejchev - for taking over the feature and promoting it from Alpha through Beta to GA.

Get involved

This work was sponsored by the Kubernetes batch working group in close collaboration with the SIG Apps community.

If you are interested in working on new features in the space we recommend subscribing to our Slack channel and attending the regular community meetings.

Kubernetes v1.34: PSI Metrics for Kubernetes Graduates to Beta

Thu, 04 Sep 2025 10:30:00 -0800

As Kubernetes clusters grow in size and complexity, understanding the health and performance of individual nodes becomes increasingly critical. We are excited to announce that as of Kubernetes v1.34, Pressure Stall Information (PSI) Metrics has graduated to Beta.

What is Pressure Stall Information (PSI)?

Pressure Stall Information (PSI) is a feature of the Linux kernel (version 4.20 and later) that provides a canonical way to quantify pressure on infrastructure resources, in terms of whether demand for a resource exceeds current supply. It moves beyond simple resource utilization metrics and instead measures the amount of time that tasks are stalled due to resource contention. This is a powerful way to identify and diagnose resource bottlenecks that can impact application performance.

PSI exposes metrics for CPU, memory, and I/O, categorized as either some or full pressure:

some: The percentage of time that at least one task is stalled on a resource. This indicates some level of resource contention.
full: The percentage of time that all non-idle tasks are stalled on a resource simultaneously. This indicates a more severe resource bottleneck.

PSI: 'Some' vs. 'Full' Pressure

These metrics are aggregated over 10-second, 1-minute, and 5-minute rolling windows, providing a comprehensive view of resource pressure over time.

PSI metrics in Kubernetes

With the KubeletPSI feature gate enabled, the kubelet can now collect PSI metrics from the Linux kernel and expose them through two channels: the Summary API and the /metrics/cadvisor Prometheus endpoint. This allows you to monitor and alert on resource pressure at the node, pod, and container level.

The following new metrics are available in Prometheus exposition format via /metrics/cadvisor:

container_pressure_cpu_stalled_seconds_total
container_pressure_cpu_waiting_seconds_total
container_pressure_memory_stalled_seconds_total
container_pressure_memory_waiting_seconds_total
container_pressure_io_stalled_seconds_total
container_pressure_io_waiting_seconds_total

These metrics, along with the data from the Summary API, provide a granular view of resource pressure, enabling you to pinpoint the source of performance issues and take corrective action. For example, you can use these metrics to:

Identify memory leaks: A steadily increasing some pressure for memory can indicate a memory leak in an application.
Optimize resource requests and limits: By understanding the resource pressure of your workloads, you can more accurately tune their resource requests and limits.
Autoscale workloads: You can use PSI metrics to trigger autoscaling events, ensuring that your workloads have the resources they need to perform optimally.

How to enable PSI metrics

To enable PSI metrics in your Kubernetes cluster, you need to:

Ensure your nodes are running a Linux kernel version 4.20 or later and are using cgroup v2.
Enable the KubeletPSI feature gate on the kubelet.

Once enabled, you can start scraping the /metrics/cadvisor endpoint with your Prometheus-compatible monitoring solution or query the Summary API to collect and visualize the new PSI metrics. Note that PSI is a Linux-kernel feature, so these metrics are not available on Windows nodes. Your cluster can contain a mix of Linux and Windows nodes, and on the Windows nodes the kubelet does not expose PSI metrics.

What's next?

We are excited to bring PSI metrics to the Kubernetes community and look forward to your feedback. As a beta feature, we are actively working on improving and extending this functionality towards a stable GA release. We encourage you to try it out and share your experiences with us.

To learn more about PSI metrics, check out the official Kubernetes documentation. You can also get involved in the conversation on the #sig-node Slack channel.

Kubernetes v1.34: Service Account Token Integration for Image Pulls Graduates to Beta

Wed, 03 Sep 2025 10:30:00 -0800

The Kubernetes community continues to advance security best practices by reducing reliance on long-lived credentials. Following the successful alpha release in Kubernetes v1.33, Service Account Token Integration for Kubelet Credential Providers has now graduated to beta in Kubernetes v1.34, bringing us closer to eliminating long-lived image pull secrets from Kubernetes clusters.

This enhancement allows credential providers to use workload-specific service account tokens to obtain registry credentials, providing a secure, ephemeral alternative to traditional image pull secrets.

What's new in beta?

The beta graduation brings several important changes that make the feature more robust and production-ready:

Required `cacheType` field

Breaking change from alpha: The cacheType field is required in the credential provider configuration when using service account tokens. This field is new in beta and must be specified to ensure proper caching behavior.

# CAUTION: this is not a complete configuration example, just a reference for the 'tokenAttributes.cacheType' field.
tokenAttributes:
  serviceAccountTokenAudience: "my-registry-audience"
  cacheType: "ServiceAccount"  # Required field in beta
  requireServiceAccount: true

Choose between two caching strategies:

Token: Cache credentials per service account token (use when credential lifetime is tied to the token). This is useful when the credential provider transforms the service account token into registry credentials with the same lifetime as the token, or when registries support Kubernetes service account tokens directly. Note: The kubelet cannot send service account tokens directly to registries; credential provider plugins are needed to transform tokens into the username/password format expected by registries.
ServiceAccount: Cache credentials per service account identity (use when credentials are valid for all pods using the same service account)

Isolated image pull credentials

The beta release provides stronger security isolation for container images when using service account tokens for image pulls. It ensures that pods can only access images that were pulled using ServiceAccounts they're authorized to use. This prevents unauthorized access to sensitive container images and enables granular access control where different workloads can have different registry permissions based on their ServiceAccount.

When credential providers use service account tokens, the system tracks ServiceAccount identity (namespace, name, and UID) for each pulled image. When a pod attempts to use a cached image, the system verifies that the pod's ServiceAccount matches exactly with the ServiceAccount that was used to originally pull the image.

Administrators can revoke access to previously pulled images by deleting and recreating the ServiceAccount, which changes the UID and invalidates cached image access.

For more details about this capability, see the image pull credential verification documentation.

How it works

Configuration

Credential providers opt into using ServiceAccount tokens by configuring the tokenAttributes field:

#
# CAUTION: this is an example configuration.
#          Do not use this for your own cluster!
#
apiVersion: kubelet.config.k8s.io/v1
kind: CredentialProviderConfig
providers:
- name: my-credential-provider
  matchImages:
  - "*.myregistry.io/*"
  defaultCacheDuration: "10m"
  apiVersion: credentialprovider.kubelet.k8s.io/v1
  tokenAttributes:
    serviceAccountTokenAudience: "my-registry-audience"
    cacheType: "ServiceAccount"  # New in beta
    requireServiceAccount: true
    requiredServiceAccountAnnotationKeys:
    - "myregistry.io/identity-id"
    optionalServiceAccountAnnotationKeys:
    - "myregistry.io/optional-annotation"

Image pull flow

At a high level, kubelet coordinates with your credential provider and the container runtime as follows:

When the image is not present locally:
- kubelet checks its credential cache using the configured cacheType (Token or ServiceAccount)
- If needed, kubelet requests a ServiceAccount token for the pod's ServiceAccount and passes it, plus any required annotations, to the credential provider
- The provider exchanges that token for registry credentials and returns them to kubelet
- kubelet caches credentials per the cacheType strategy and pulls the image with those credentials
- kubelet records the ServiceAccount coordinates (namespace, name, UID) associated with the pulled image for later authorization checks
When the image is already present locally:
- kubelet verifies the pod's ServiceAccount coordinates match the coordinates recorded for the cached image
- If they match exactly, the cached image can be used without pulling from the registry
- If they differ, kubelet performs a fresh pull using credentials for the new ServiceAccount
With image pull credential verification enabled:
- Authorization is enforced using the recorded ServiceAccount coordinates, ensuring pods only use images pulled by a ServiceAccount they are authorized to use
- Administrators can revoke access by deleting and recreating a ServiceAccount; the UID changes and previously recorded authorization no longer matches

Audience restriction

The beta release builds on service account node audience restriction (beta since v1.33) to ensure kubelet can only request tokens for authorized audiences. Administrators configure allowed audiences using RBAC to enable kubelet to request service account tokens for image pulls:

#
# CAUTION: this is an example configuration.
#          Do not use this for your own cluster!
#
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: kubelet-credential-provider-audiences
rules:
- verbs: ["request-serviceaccounts-token-audience"]
  apiGroups: [""]
  resources: ["my-registry-audience"]
  resourceNames: ["registry-access-sa"]  # Optional: specific SA

Getting started with beta

Prerequisites

Kubernetes v1.34 or later
Feature gate enabled: KubeletServiceAccountTokenForCredentialProviders=true (beta, enabled by default)
Credential provider support: Update your credential provider to handle ServiceAccount tokens

Migration from alpha

If you're already using the alpha version, the migration to beta requires minimal changes:

Add cacheType field: Update your credential provider configuration to include the required cacheType field
Review caching strategy: Choose between Token and ServiceAccount cache types based on your provider's behavior
Test audience restrictions: Ensure your RBAC configuration, or other cluster authorization rules, will properly restrict token audiences

Example setup

Here's a complete example for setting up a credential provider with service account tokens (this example assumes your cluster uses RBAC authorization):

#
# CAUTION: this is an example configuration.
#          Do not use this for your own cluster!
#

# Service Account with registry annotations
apiVersion: v1
kind: ServiceAccount
metadata:
  name: registry-access-sa
  namespace: default
  annotations:
    myregistry.io/identity-id: "user123"
---
# RBAC for audience restriction
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: registry-audience-access
rules:
- verbs: ["request-serviceaccounts-token-audience"]
  apiGroups: [""]
  resources: ["my-registry-audience"]
  resourceNames: ["registry-access-sa"]  # Optional: specific ServiceAccount
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: kubelet-registry-audience
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: registry-audience-access
subjects:
- kind: Group
  name: system:nodes
  apiGroup: rbac.authorization.k8s.io
---
# Pod using the ServiceAccount
apiVersion: v1
kind: Pod
metadata:
  name: my-pod
spec:
  serviceAccountName: registry-access-sa
  containers:
  - name: my-app
    image: myregistry.example/my-app:latest

What's next?

For Kubernetes v1.35, we - Kubernetes SIG Auth - expect the feature to stay in beta, and we will continue to solicit feedback.

You can learn more about this feature on the service account token for image pulls page in the Kubernetes documentation.

You can also follow along on the KEP-4412 to track progress across the coming Kubernetes releases.

Call to action

In this blog post, I have covered the beta graduation of ServiceAccount token integration for Kubelet Credential Providers in Kubernetes v1.34. I discussed the key improvements, including the required cacheType field and enhanced integration with Ensure Secret Pull Images.

We have been receiving positive feedback from the community during the alpha phase and would love to hear more as we stabilize this feature for GA. In particular, we would like feedback from credential provider implementors as they integrate with the new beta API and caching mechanisms. Please reach out to us on the #sig-auth-authenticators-dev channel on Kubernetes Slack.

How to get involved

If you are interested in getting involved in the development of this feature, share feedback, or participate in any other ongoing SIG Auth projects, please reach out on the #sig-auth channel on Kubernetes Slack.

You are also welcome to join the bi-weekly SIG Auth meetings, held every other Wednesday.

Kubernetes v1.34: Introducing CPU Manager Static Policy Option for Uncore Cache Alignment

Tue, 02 Sep 2025 10:30:00 -0800

A new CPU Manager Static Policy Option called prefer-align-cpus-by-uncorecache was introduced in Kubernetes v1.32 as an alpha feature, and has graduated to beta in Kubernetes v1.34. This CPU Manager Policy Option is designed to optimize performance for specific workloads running on processors with a split uncore cache architecture. In this article, I'll explain what that means and why it's useful.

Understanding the feature

What is uncore cache?

Until relatively recently, nearly all mainstream computer processors had a monolithic last-level-cache cache that was shared across every core in a multiple CPU package. This monolithic cache is also referred to as uncore cache (because it is not linked to a specific core), or as Level 3 cache. As well as the Level 3 cache, there is other cache, commonly called Level 1 and Level 2 cache, that is associated with a specific CPU core.

In order to reduce access latency between the CPU cores and their cache, recent AMD64 and ARM architecture based processors have introduced a split uncore cache architecture, where the last-level-cache is divided into multiple physical caches, that are aligned to specific CPU groupings within the physical package. The shorter distances within the CPU package help to reduce latency.

Kubernetes is able to place workloads in a way that accounts for the cache topology within the CPU package(s).

Cache-aware workload placement

The matrix below shows the CPU-to-CPU latency measured in nanoseconds (lower is better) when passing a packet between CPUs, via its cache coherence protocol on a processor that uses split uncore cache. In this example, the processor package consists of 2 uncore caches. Each uncore cache serves 8 CPU cores. Blue entries in the matrix represent latency between CPUs sharing the same uncore cache, while grey entries indicate latency between CPUs corresponding to different uncore caches. Latency between CPUs that correspond to different caches are higher than the latency between CPUs that belong to the same cache.

With prefer-align-cpus-by-uncorecache enabled, the static CPU Manager attempts to allocates CPU resources for a container, such that all CPUs assigned to a container share the same uncore cache. This policy operates on a best-effort basis, aiming to minimize the distribution of a container's CPU resources across uncore caches, based on the container's requirements, and accounting for allocatable resources on the node.

By running a workload, where it can, on a set of CPUS that use the smallest feasible number of uncore caches, applications benefit from reduced cache latency (as seen in the matrix above), and from reduced contention against other workloads, which can result in overall higher throughput. The benefit only shows up if your nodes use a split uncore cache topology for their processors.

The following diagram below illustrates uncore cache alignment when the feature is enabled.

By default, Kubernetes does not account for uncore cache topology; containers are assigned CPU resources using a packed methodology. As a result, Container 1 and Container 2 can experience a noisy neighbor impact due to cache access contention on Uncore Cache 0. Additionally, Container 2 will have CPUs distributed across both caches which can introduce a cross-cache latency.

With prefer-align-cpus-by-uncorecache enabled, each container is isolated on an individual cache. This resolves the cache contention between the containers and minimizes the cache latency for the CPUs being utilized.

Use cases

Common use cases can include telco applications like vRAN, Mobile Packet Core, and Firewalls. It's important to note that the optimization provided by prefer-align-cpus-by-uncorecache can be dependent on the workload. For example, applications that are memory bandwidth bound may not benefit from uncore cache alignment, as utilizing more uncore caches can increase memory bandwidth access.

Enabling the feature

To enable this feature, set the CPU Manager Policy to static and enable the CPU Manager Policy Options with prefer-align-cpus-by-uncorecache.

For Kubernetes 1.34, the feature is in the beta stage and requires the CPUManagerPolicyBetaOptions feature gate to also be enabled.

Append the following to the kubelet configuration file:

kind: KubeletConfiguration
apiVersion: kubelet.config.k8s.io/v1beta1
featureGates:
  ...
  CPUManagerPolicyBetaOptions: true
cpuManagerPolicy: "static"
cpuManagerPolicyOptions:
  prefer-align-cpus-by-uncorecache: "true"
reservedSystemCPUs: "0"
...

If you're making this change to an existing node, remove the cpu_manager_state file and then restart kubelet.

prefer-align-cpus-by-uncorecache can be enabled on nodes with a monolithic uncore cache processor. The feature will mimic a best-effort socket alignment effect and will pack CPU resources on the socket similar to the default static CPU Manager policy.

Getting involved

This feature is driven by SIG Node. If you are interested in helping develop this feature, sharing feedback, or participating in any other ongoing SIG Node projects, please attend the SIG Node meeting for more details.

Kubernetes v1.34: DRA has graduated to GA

Mon, 01 Sep 2025 10:30:00 -0800

Kubernetes 1.34 is here, and it has brought a huge wave of enhancements for Dynamic Resource Allocation (DRA)! This release marks a major milestone with many APIs in the resource.k8s.io group graduating to General Availability (GA), unlocking the full potential of how you manage devices on Kubernetes. On top of that, several key features have moved to beta, and a fresh batch of new alpha features promise even more expressiveness and flexibility.

Let's dive into what's new for DRA in Kubernetes 1.34!

The core of DRA is now GA

The headline feature of the v1.34 release is that the core of DRA has graduated to General Availability.

Kubernetes Dynamic Resource Allocation (DRA) provides a flexible framework for managing specialized hardware and infrastructure resources, such as GPUs or FPGAs. DRA provides APIs that enable each workload to specify the properties of the devices it needs, but leaving it to the scheduler to allocate actual devices, allowing increased reliability and improved utilization of expensive hardware.

With the graduation to GA, DRA is stable and will be part of Kubernetes for the long run. The community can still expect a steady stream of new features being added to DRA over the next several Kubernetes releases, but they will not make any breaking changes to DRA. So users and developers of DRA drivers can start adopting DRA with confidence.

Starting with Kubernetes 1.34, DRA is enabled by default; the DRA features that have reached beta are also enabled by default. That's because the default API version for DRA is now the stable v1 version, and not the earlier versions (eg: v1beta1 or v1beta2) that needed explicit opt in.

Features promoted to beta

Several powerful features have been promoted to beta, adding more control, flexibility, and observability to resource management with DRA.

Admin access labelling has been updated. In v1.34, you can restrict device support to people (or software) authorized to use it. This is meant as a way to avoid privilege escalation if a DRA driver grants additional privileges when admin access is requested and to avoid accessing devices which are in use by normal applications, potentially in another namespace. The restriction works by ensuring that only users with access to a namespace with the resource.k8s.io/admin-access: "true" label are authorized to create ResourceClaim or ResourceClaimTemplates objects with the adminAccess field set to true. This ensures that non-admin users cannot misuse the feature.

Prioritized list lets users specify a list of acceptable devices for their workloads, rather than just a single type of device. So while the workload might run best on a single high-performance GPU, it might also be able to run on 2 mid-level GPUs. The scheduler will attempt to satisfy the alternatives in the list in order, so the workload will be allocated the best set of devices available on the node.

The kubelet's API has been updated to report on Pod resources allocated through DRA. This allows node monitoring agents to know the allocated DRA resources for Pods on a node and makes it possible to use the DRA information in the PodResources API to develop new features and integrations.

New alpha features

Kubernetes 1.34 also introduces several new alpha features that give us a glimpse into the future of resource management with DRA.

Extended resource mapping support in DRA allows cluster administrators to advertise DRA-managed resources as extended resources, allowing developers to consume them using the familiar, simpler request syntax while still benefiting from dynamic allocation. This makes it possible for existing workloads to start using DRA without modifications, simplifying the transition to DRA for both application developers and cluster administrators.

Consumable capacity introduces a flexible device sharing model where multiple, independent resource claims from unrelated pods can each be allocated a share of the same underlying physical device. This new capability is managed through optional, administrator-defined sharing policies that govern how a device's total capacity is divided and enforced by the platform for each request. This allows for sharing of devices in scenarios where pre-defined partitions are not viable.

For more information, see Kubernetes v1.34: DRA Consumable Capacity

Binding conditions improve scheduling reliability for certain classes of devices by allowing the Kubernetes scheduler to delay binding a pod to a node until its required external resources, such as attachable devices or FPGAs, are confirmed to be fully prepared. This prevents premature pod assignments that could lead to failures and ensures more robust, predictable scheduling by explicitly modeling resource readiness before the pod is committed to a node.

Resource health status for DRA improves observability by exposing the health status of devices allocated to a Pod via Pod Status. This works whether the device is allocated through DRA or Device Plugin. This makes it easier to understand the cause of an unhealthy device and respond properly.

For more information, see Kubernetes v1.34: Pods Report DRA Resource Health

What’s next?

While DRA got promoted to GA this cycle, the hard work on DRA doesn't stop. There are several features in alpha and beta that we plan to bring to GA in the next couple of releases and we are looking to continue to improve performance, scalability and reliability of DRA. So expect an equally ambitious set of features in DRA for the 1.35 release.

Getting involved

A good starting point is joining the WG Device Management Slack channel and meetings, which happen at US/EU and EU/APAC friendly time slots.

Not all enhancement ideas are tracked as issues yet, so come talk to us if you want to help or have some ideas yourself! We have work to do at all levels, from difficult core changes to usability enhancements in kubectl, which could be picked up by newcomers.

Acknowledgments

A huge thanks to the new contributors to DRA this cycle:

Alay Patel (alaypatel07)
Gaurav Kumar Ghildiyal (gauravkghildiyal)
JP (Jpsassine)
Kobayashi Daisuke (KobayashiD27)
Laura Lorenz (lauralorenz)
Sunyanan Choochotkaew (sunya-ch)
Swati Gupta (guptaNswati)
Yu Liao (yliaog)

Kubernetes v1.34: Finer-Grained Control Over Container Restarts

Fri, 29 Aug 2025 10:30:00 -0800

With the release of Kubernetes 1.34, a new alpha feature is introduced that gives you more granular control over container restarts within a Pod. This feature, named Container Restart Policy and Rules, allows you to specify a restart policy for each container individually, overriding the Pod's global restart policy. In addition, it also allows you to conditionally restart individual containers based on their exit codes. This feature is available behind the alpha feature gate ContainerRestartRules.

This has been a long-requested feature. Let's dive into how it works and how you can use it.

The problem with a single restart policy

Before this feature, the restartPolicy was set at the Pod level. This meant that all containers in a Pod shared the same restart policy (Always, OnFailure, or Never). While this works for many use cases, it can be limiting in others.

For example, consider a Pod with a main application container and an init container that performs some initial setup. You might want the main container to always restart on failure, but the init container should only run once and never restart. With a single Pod-level restart policy, this wasn't possible.

Introducing per-container restart policies

With the new ContainerRestartRules feature gate, you can now specify a restartPolicy for each container in your Pod's spec. You can also define restartPolicyRules to control restarts based on exit codes. This gives you the fine-grained control you need to handle complex scenarios.

Use cases

Let's look at some real-life use cases where per-container restart policies can be beneficial.

In-place restarts for training jobs

In ML research, it's common to orchestrate a large number of long-running AI/ML training workloads. In these scenarios, workload failures are unavoidable. When a workload fails with a retriable exit code, you want the container to restart quickly without rescheduling the entire Pod, which consumes a significant amount of time and resources. Restarting the failed container "in-place" is critical for better utilization of compute resources. The container should only restart "in-place" if it failed due to a retriable error; otherwise, the container and Pod should terminate and possibly be rescheduled.

This can now be achieved with container-level restartPolicyRules. The workload can exit with different codes to represent retriable and non-retriable errors. With restartPolicyRules, the workload can be restarted in-place quickly, but only when the error is retriable.

Try-once init containers

Init containers are often used to perform initialization work for the main container, such as setting up environments and credentials. Sometimes, you want the main container to always be restarted, but you don't want to retry initialization if it fails.

With a container-level restartPolicy, this is now possible. The init container can be executed only once, and its failure would be considered a Pod failure. If the initialization succeeds, the main container can be always restarted.

Pods with multiple containers

For Pods that run multiple containers, you might have different restart requirements for each container. Some containers might have a clear definition of success and should only be restarted on failure. Others might need to be always restarted.

This is now possible with a container-level restartPolicy, allowing individual containers to have different restart policies.

How to use it

To use this new feature, you need to enable the ContainerRestartRules feature gate on your Kubernetes cluster control-plane and worker nodes running Kubernetes 1.34+. Once enabled, you can specify the restartPolicy and restartPolicyRules fields in your container definitions.

Here are some examples:

Example 1: Restarting on specific exit codes

In this example, the container should restart if and only if it fails with a retriable error, represented by exit code 42.

To achieve this, the container has restartPolicy: Never, and a restart policy rule that tells Kubernetes to restart the container in-place if it exits with code 42.

apiVersion: v1
kind: Pod
metadata:
  name: restart-on-exit-codes
  annotations:
    kubernetes.io/description: "This Pod only restart the container only when it exits with code 42."
spec:
  restartPolicy: Never
  containers:
  - name: restart-on-exit-codes
    image: docker.io/library/busybox:1.28
    command: ['sh', '-c', 'sleep 60 && exit 0']
    restartPolicy: Never     # Container restart policy must be specified if rules are specified
    restartPolicyRules:      # Only restart the container if it exits with code 42
    - action: Restart
      exitCodes:
        operator: In
        values: [42]

Example 2: A try-once init container

In this example, a Pod should always be restarted once the initialization succeeds. However, the initialization should only be tried once.

To achieve this, the Pod has an Always restart policy. The init-once init container will only try once. If it fails, the Pod will fail. This allows the Pod to fail if the initialization failed, but also keep running once the initialization succeeds.

apiVersion: v1
kind: Pod
metadata:
  name: fail-pod-if-init-fails
  annotations:
    kubernetes.io/description: "This Pod has an init container that runs only once. After initialization succeeds, the main container will always be restarted."
spec:
  restartPolicy: Always
  initContainers:
  - name: init-once      # This init container will only try once. If it fails, the Pod will fail.
    image: docker.io/library/busybox:1.28
    command: ['sh', '-c', 'echo "Failing initialization" && sleep 10 && exit 1']
    restartPolicy: Never
  containers:
  - name: main-container # This container will always be restarted once initialization succeeds.
    image: docker.io/library/busybox:1.28
    command: ['sh', '-c', 'sleep 1800 && exit 0']

Example 3: Containers with different restart policies

In this example, there are two containers with different restart requirements. One should always be restarted, while the other should only be restarted on failure.

This is achieved by using a different container-level restartPolicy on each of the two containers.

apiVersion: v1
kind: Pod
metadata:
  name: on-failure-pod
  annotations:
    kubernetes.io/description: "This Pod has two containers with different restart policies."
spec:
  containers:
  - name: restart-on-failure
    image: docker.io/library/busybox:1.28
    command: ['sh', '-c', 'echo "Not restarting after success" && sleep 10 && exit 0']
    restartPolicy: OnFailure
  - name: restart-always
    image: docker.io/library/busybox:1.28
    command: ['sh', '-c', 'echo "Always restarting" && sleep 1800 && exit 0']
    restartPolicy: Always

Learn more

Read the documentation for container restart policy.
Read the KEP for the Container Restart Rules

Roadmap

More actions and signals to restart Pods and containers are coming! Notably, there are plans to add support for restarting the entire Pod. Planning and discussions on these features are in progress. Feel free to share feedback or requests with the SIG Node community!

Your feedback is welcome!

This is an alpha feature, and the Kubernetes project would love to hear your feedback. Please try it out. This feature is driven by the SIG Node. If you are interested in helping develop this feature, sharing feedback, or participating in any other ongoing SIG Node projects, please reach out to the SIG Node community!

You can reach SIG Node by several means:

Kubernetes v1.34: User preferences (kuberc) are available for testing in kubectl 1.34

Thu, 28 Aug 2025 10:30:00 -0800

Have you ever wished you could enable interactive delete, by default, in kubectl? Or maybe, you'd like to have custom aliases defined, but not necessarily generate hundreds of them manually? Look no further. SIG-CLI has been working hard to add user preferences to kubectl, and we are happy to announce that this functionality is reaching beta as part of the Kubernetes v1.34 release.

How it works

A full description of this functionality is available in our official documentation, but this blog post will answer both of the questions from the beginning of this article.

Before we dive into details, let's quickly cover what the user preferences file looks like and where to place it. By default, kubectl will look for kuberc file in your default kubeconfig directory, which is $HOME/.kube. Alternatively, you can specify this location using --kuberc option or the KUBERC environment variable.

Just like every Kubernetes manifest, kuberc file will start with an apiVersion and kind:

apiVersion: kubectl.config.k8s.io/v1beta1
kind: Preference
# the user preferences will follow here

Defaults

Let's start by setting default values for kubectl command options. Our goal is to always use interactive delete, which means we want the --interactive option for kubectl delete to always be set to true. This can be achieved with the following addition to our kuberc file:

defaults:
- command: delete
  options:
  - name: interactive
    default: "true"

In the above example, I'm introducing defaults section, which allows users to define default values for kubectl options. In this case, we're setting the interactive option for kubectl delete to be true by default. This default can be overridden if a user explicitly provides a different value such as kubectl delete --interactive=false, in which case the explicit option takes precedence.

Another highly encouraged default from SIG-CLI, is using Server-Side Apply. To do so, you can add the following snippet to your preferences:

# continuing defaults section
- command: apply
  options:
  - name: server-side
    default: "true"

Aliases

The ability to define aliases allows us to save precious seconds when typing commands. I bet that you most likely have one defined for kubectl, because typing seven letters is definitely longer than just pressing k.

For this reason, the ability to define aliases was a must-have when we decided to implement user preferences, alongside defaulting. To define an alias for any of the built-in commands, expand your kuberc file with the following addition:

aliases:
- name: gns
  command: get
  prependArgs:
   - namespace
  options:
   - name: output
     default: json

There's a lot going on above, so let me break this down. First, we're introducing a new section: aliases. Here, we're defining a new alias gns, which is mapped to the command get command. Next, we're defining arguments (namespace resource) that will be inserted right after the command name. Additionally, we're setting --output=json option for this alias. The structure of options block is identical to the one in the defaults section.

You probably noticed that we've introduced a mechanism for prepending arguments, and you might wonder if there is a complementary setting for appending them (in other words, adding to the end of the command, after user-provided arguments). This can be achieved through appendArgs block, which is presented below:

# continuing aliases section
- name: runx
  command: run
  options:
    - name: image
      default: busybox
    - name: namespace
      default: test-ns
  appendArgs:
    - --
    - custom-arg

Here, we're introducing another alias: runx, which invokes kubectl run command, passing --image and --namespace options with predefined values, and also appending -- and custom-arg at the end of the invocation.

Debugging

We hope that kubectl user preferences will open up new possibilities for our users. Whenever you're in doubt, feel free to run kubectl with increased verbosity. At -v=5, you should get all the possible debugging information from this feature, which will be crucial when reporting issues.

To learn more, I encourage you to read through our official documentation and the actual proposal.

Get involved

Kubectl user preferences feature has reached beta, and we are very interested in your feedback. We'd love to hear what you like about it and what problems you'd like to see it solve. Feel free to join SIG-CLI slack channel, or open an issue against kubectl repository. You can also join us at our community meetings, which happen every other Wednesday, and share your stories with us.

Kubernetes Blog

Spotlight on SIG Architecture: API Governance

Introduction

Goals and scope of API Governance

The impact of Custom Resource Definitions

API Governance in context

Getting involved

Introducing Node Readiness Controller

Why the Node Readiness Controller?

Core concepts and features

Flexible enforcement modes

Condition reporting

Operational safety with dry run

Example: CNI bootstrapping

Getting involved

New Conversion from cgroup v1 CPU Shares to v2 CPU Weight

Background

Problems with previous conversion formula

1. Reduced priority against non-Kubernetes workloads

2. Unmanageable granularity

New conversion formula

Description

How it solves the problems

Adoption and integration

Impact on existing deployments

Where can I learn more?

How do I get involved?

Ingress NGINX: Statement from the Kubernetes Steering and Security Response Committees

Experimenting with Gateway API using kind

Caution:

Overview

Prerequisites

Create a kind cluster

Install cloud-provider-kind

Experimenting with Gateway API

Deploy a Gateway

Deploy a demo application

Create an HTTPRoute

Test your route

Troubleshooting

Check the Gateway status

Check the HTTPRoute status

Check controller logs

Cleanup

Remove Kubernetes resources

Stop cloud-provider-kind

Delete the kind cluster

Next steps

A final word of caution

Cluster API v1.12: Introducing In-place Updates and Chained Upgrades

Emphasis on simplicity and usability

In-place Updates

Chained Upgrades

Release team

What’s next?

Headlamp in 2025: Project Highlights

Updates

Joining Kubernetes SIG UI

Linux Foundation mentorship

New changes

Multi-cluster view

Projects

Navigation and Activities

Search and map

OIDC and authentication

App Catalog and Helm

Performance, accessibility, and UX

Plugins and extensibility

Headlamp AI Assistant

New plugins additions

Other plugins updates

Plugin development

Security upgrades

Conclusion

Announcing the Checkpoint/Restore Working Group

Motivation and use cases

Related events

Connect with us

Uniform API server access using clientcmd

General philosophy

Introducing the `RestartAllContainers` action