Tagged "throughput-optimization"
- Prefill Is Compute-Bound, Decode Is Memory-Bound: Optimizing GPU Utilization for LLM Inference
- Qwen 3.5 27B Achieves 1.1M Tokens/Second on B200 GPUs with Optimized vLLM Config
- P-EAGLE: Faster LLM Inference with Parallel Speculative Decoding in vLLM
- DeepSeek Releases DualPath: Addressing Storage Bandwidth Bottlenecks in Agentic Inference