OV-DEIM: Real-time DETR-Style Open-Vocabulary Object Detection with GridSynthetic Augmentation

Wang, Leilei; Liu, Longfei; Shen, Xi; Yu, Xuanlong; He, Ying Tiffany; Yu, Fei Richard; Chen, Yingyi

OV-DEIM: Real-time DETR-Style Open-Vocabulary Object Detection with GridSynthetic Augmentation

Leilei Wang^, Longfei Liu^, Xi Shen^, Xuanlong Yu^, Ying Tiffany He^, Fei Richard Yu^, Yingyi Chen

Intellindust AI Lab
Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), Shenzhen, China
College of Computer Science and Software Engineering, Shenzhen University, China
School of Information Technology, Carleton University, Canada
Institute for Research in Biomedicine, Bellinzona, Switzerland

Paper Code

Architecture of OV-DEIM Framework.

Abstract

Real-time open-vocabulary object detection (OVOD) is essential for practical deployment in dynamic environments, where models must recognize a large and evolving set of categories under strict latency constraints. Current real-time OVOD methods are predominantly built upon YOLO-style models. In contrast, real-time DETR-based methods still lag behind in terms of inference latency, model lightweightness, and overall performance. In this work, we present OV-DEIM, an end-to-end DETR-style open-vocabulary detector built upon recent DEIMv2 framework with integrated vision–language modeling for efficient open-vocabulary inference. We further introduce a simple query supplement strategy that improves Fixed AP without compromising inference speed. Beyond architectural improvements, we introduce GridSynthetic, a simple yet effective data augmentation strategy that composes multiple training samples into structured image grids. By exposing to model to richer object co-occurrence patterns and spatial layouts within a single forward pass, GridSynthetic mitigates negative impact of noisy localization signals on the classification loss and improves semantic discrimination, particularly for rare categories. Extensive experiments demonstrate that OV-DEIM achieves state-of-the-art performance on open-vocabulary detection benchmarks, delivering superior efficiency and notable improvements on challenging rare categories.

Method Overview

OV-DEIM extends to real-time closed-set detector DEIMv2 to open-vocabulary setting by incorporating language modeling and vision-text alignment into detection pipeline. The framework consists of six components: (1) an image backbone for visual feature extraction, (2) a hybrid encoder for multi-scale feature aggregation, (3) a text encoder for language representation, (4) a text-aware query selection module that identifies high-quality object queries conditioned on language, (5) a transformer decoder that refines selected queries for final prediction, and (6) a prediction head that outputs vision-text alignment scores and bounding box coordinates.

Key Contributions

DETR-Style Architecture: OV-DEIM leverages one-to-one matching to eliminate NMS post-processing, enabling fast and stable inference that scales favorably with vocabulary size.
Query Supplement Trick: A lightweight strategy to increase detection candidates for improved Fixed AP evaluation without additional inference cost.
GridSynthetic Augmentation: A novel data augmentation that constructs synthetic training images through structured composition, reducing localization difficulty and enhancing semantic alignment.

Zero-Shot Detection Results

Visualizations of zero-shot inference on LVIS. OV-DEIM accurately localizes objects in crowded scenes and small-scale instances.

Effectiveness of GridSynthetic. The EMA-smoothed GIoU loss curves show that GridSynthetic consistently achieves the lowest loss throughout training, alleviating the difficulty of localization and leading to more sufficient classification supervision.

Zero-Shot Evaluation on LVIS

OV-DEIM surpasses previous state-of-the-art methods in both zero-shot performance and inference speed. OV-DEIM-S/M/L outperform YOLOEv8-S/M/L by 2.0/0.7/0.4 AP, while achieving comparable Fixed AP, together with 8.9×/6.4×/5.4× inference speedups on an NVIDIA T4 GPU.

Method	Backbone	Params	Pre-trained Data	FPS	AP/AP^Fixed	AP_r/AP_r^Fixed	AP_c/AP_c^Fixed	AP_f/AP_f^Fixed
GLIP-T	Swin-T	232M	OG	-	-/24.9	-/17.7	-/19.5	-/31.0
GLIPv2-T	Swin-T	232M	OG,Cap4M	-	-/29.0	-/-	-/-	-/-
GDINO-T	Swin-T	172M	OG	-	-/25.6	-/14.4	-/19.6	-/32.2
G1.5-Edge	EfficientViT-L1	-	G-20M	-	-/33.5	-/28.0	-/34.3	-/33.9
YOLO-Worldv2-S	YOLOv8-S	13M	OG	-	-/24.4	-/17.1	-/22.5	-/27.3
YOLO-Worldv2-M	YOLOv8-M	29M	OG	-	-/32.4	-/28.4	-/29.6	35.5
YOLO-Worldv2-L	YOLOv8-L	48M	OG	-	-/35.5	-/25.6	-/34.6	-/38.1
YOLOEv8-S	YOLOv8-S	12M	OG	216/18*	25.7/27.9	19.0/22.3	25.9/27.8	26.7/29.0
YOLOEv8-M	YOLOv8-M	27M	OG	145/17*	29.9/32.6	23.6/26.9	29.2/31.9	31.7/34.4
YOLOEv8-L	YOLOv8-L	45M	OG	103/17*	33.3/35.9	30.8/33.2	32.2/34.8	34.6/37.3
YOLOEv11-S	YOLOv11-S	10M	OG	216/18*	25.2/27.5	19.3/21.4	24.4/26.8	26.7/29.3
YOLOEv11-M	YOLOv11-M	21M	OG	151/18*	30.5/33.0	22.4/26.9	30.4/32.5	32.1/34.5
YOLOEv11-L	YOLOv11-L	26M	OG	122/17*	32.4/35.2	25.6/29.1	31.9/35.0	34.1/36.5
OV-DEIM-S	ViT-T	11M	OG	161	27.7/29.6	23.6/25.2	28.1/30.2	28.0/30.0
OV-DEIM-M	ViT-T+	20M	OG	109	30.6/32.6	25.3/26.9	30.2/31.5	31.9/34.1
OV-DEIM-L	ViT-S	36M	OG	91	33.7/35.9	34.3/36.8	33.4/35.5	34.0/36.0

Inference speed (FPS) is measured on an NVIDIA T4 GPU using TensorRT. * FPS includes NMS post-processing overhead. OG = Objects365v1, Cap4M is a collection of 4M image-text pairs from GLIP, G-20M = Grounding-20M. Evaluated on LVIS minival split.

Zero-Shot Evaluation on COCO

OV-DEIM demonstrates strong zero-shot transfer capability on COCO. Across all model scales, OV-DEIM consistently outperforms YOLO-World in zero-shot transfer, and also outperforms linear-probing results of YOLOE.

Method	Backbone	Params	AP	AP₅₀	AP₇₅
Zero-shot transfer
YOLO-Worldv1-S	YOLOv8-S	13M	37.6	52.3	40.7
YOLO-Worldv1-M	YOLOv8-M	29M	42.8	58.3	46.4
YOLO-Worldv1-L	YOLOv8-L	48M	44.4	59.8	48.3
Linear probing
YOLOEv8-S	YOLOv8-S	12M	35.6	51.5	38.9
YOLOEv8-M	YOLOv8-M	27M	42.2	59.2	46.3
YOLOEv8-L	YOLOv8-L	45M	45.4	63.3	50.0
YOLOEv11-S	YOLOv11-S	10M	37.0	52.9	40.4
YOLOEv11-M	YOLOv11-M	21M	43.1	60.6	47.4
YOLOEv11-L	YOLOv11-L	26M	45.1	62.8	49.5
Zero-shot transfer
OV-DEIM-S	ViT-T	11M	40.8	56.3	44.4
OV-DEIM-M	ViT-T+	20M	43.3	60.2	48.0
OV-DEIM-L	ViT-S	35M	45.9	62.3	49.9

Evaluated on COCO val2017 split with standard COCO AP metrics.

BibTeX

@misc{wang2026ovdeim,
      title={OV-DEIM: Real-time DETR-Style Open-Vocabulary Object Detection with GridSynthetic Augmentation}, 
      author={Leilei Wang and Longfei Liu and Xi Shen and Xuanlong Yu and Ying Tiffany He and Fei Richard Yu and Yingyi Chen},
      year={2026},
      eprint={2603.07022},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2603.07022}, 
}