Back to feed
GKE Inference Quickstart (GIQ) now offers recommendations for distributed AI inference.
GKE Inference Quickstart (GIQ) now offers recommendations for distributed AI inference. This enables you to deploy optimized, full configurations for advanced models, such as the Qwen and gpt-oss model families, on NVIDIA GPUs and Cloud TPUs.
This release introduces GKE Inference Gateway by integrating llm-d inference scheduling. You can select optimized configurations for workloads like Advanced Customer Support, Code Completion, and Deep Research. This tunes your infrastructure to meet the specific latency and throughput requirements of these applications.
For more information, see Analyze model serving performance and costs with Inference Quickstart.