技能 硬件工程 Qdrant查询延迟优化指南

Qdrant查询延迟优化指南

v20260420
qdrant-minimize-latency
本指南旨在指导用户优化Qdrant的查询延迟。内容涵盖了深入的性能调优技术,重点关注内存管理、分段数量调整和HNSW参数优化。详细介绍了如何通过量化、垂直扩展以及使用本地NVMe等策略,确保搜索具有持续的低延迟性能,并警示了性能优化时应避免的常见误区。
获取技能
434 次下载
概览

Scaling for Query Latency

Latency of a single query is determined by the slowest component in the query execution path. It is sometimes correlated with throughput, but not always — throughput and latency are opposite tuning directions.

Low latency optimization is aimed at utilising maximum resource saturation for a single query, while throughput optimization is aimed at minimizing per-query resource usage to allow more parallel queries.

Performance Tuning for Lower Latency

  • Increase segment count to match CPU cores (default_segment_number: 16) Minimizing latency
  • Keep quantized vectors and HNSW in RAM (always_ram=true)
  • Reduce hnsw_ef at query time (trade recall for speed) Search params
  • Use local NVMe, avoid network-attached storage

Memory Pressure and Latency

RAM is the most critical resource for latency. If working set exceeds available RAM, OS cache eviction causes severe, sustained latency degradation.

  • Vertical scale RAM first. Critical if working set >80%.
  • Use quantization: scalar (4x reduction) or binary (16x reduction) Quantization
  • Move payload indexes to disk if filtering is infrequent On-disk payload index
  • Set optimizer_cpu_budget to limit background optimization CPUs
  • Schedule indexing: set high indexing_threshold during peak hours

Vertical Scaling for Latency

More RAM and faster CPU directly reduce latency. See Vertical Scaling for node sizing guidelines.

What NOT to Do

  • Do not expect to optimize latency and throughput simultaneously on the same node
  • Do not use few large segments for latency-sensitive workloads (each segment takes longer to search)
  • Do not run at >90% RAM (cache eviction causes severe latency degradation that can last days)
  • Do not ignore optimizer status during performance debugging
  • Do not scale down RAM without load testing (cache eviction causes days-long latency incidents)
信息
Category 硬件工程
Name qdrant-minimize-latency
版本 v20260420
大小 2.51KB
更新时间 2026-04-24
语言