DiffusionEngine: Diffusion Model is Scalable Data Engine for Object Detection

1School of Computer Science and Engineering, Sun Yat-sen University, 2ByteDance Inc
*Equal Contribution, Corresponding Author
DiffusionEngine generates high-quality detection-oriented training pairs in a single stage.


Data is the cornerstone of deep learning. This paper reveals that the recently-developed Diffusion Model is a scalable data engine for object detection.

Existing methods for scaling up detection-oriented data often require manual collection or generative models to obtain target images, followed by data augmentation and labeling to produce training pairs, which are costly, complex, or lacking diversity. To address these issues, we present DiffusionEngine (DE), a data scaling-up engine that provides high-quality detection-oriented training pairs in a single stage. DE consists of a pre-trained diffusion model and an effective Detection-Adapter, contributing to generating scalable, diverse and generalizable detection data in a plug-and-play manner. Detection-Adapter is learned to align the implicit semantic and location knowledge in off-the-shelf diffusion models with detection-aware signals to make better bounding-box predictions. Additionally, we contribute two datasets, i.e., COCO-DE and VOC-DE, to scale up existing detection benchmarks for facilitating follow-up research. Extensive experiments demonstrate that data scaling-up via DE can achieve significant improvements in diverse scenarios, such as various detection algorithms, self-supervised pre-training, data-sparse, label-scarce, cross-domain, and semi-supervised learning. For example, when using DE with a DINO-based detection-adapter to scaling-up data, mAP is improved by 3.1% on COCO, 7.6% on VOC and 11.5% on Clipart.


The upper figure shows the training procedure of DiffusionEngine, where the LDM is frozen in this stage. Each image undergoes a 1-step noise adding, followed by a single-step denoising process to simulate the last image generation step in the LDM. The detection adapter learns to leverage the extracted pyramid features from the denoising U-Net decoder for detection. The figure below shows how we use the trained DiffusionEngine for data scaling-up. A reference image undergoes a random number \(k\) of noise-adding steps, resulting in a noisy sample \(z_k\), which is then denoised using text embedding as guidance. Finally, low-confidence detections are filtered out.


Data Scaling-Up with DiffusionEngine

DiffusionEngine is scalable, diversified, and can generalize robustly across domains.

  • (a) Data scaling-up for photo.

  • (b) Data scaling-up for clipart.

  • (c) Data scaling-up across-domain.

More Animations


      title={DiffusionEngine: Diffusion Model is Scalable Data Engine for Object Detection}, 
      author={Manlin Zhang and Jie Wu and Yuxi Ren and Ming Li and Jie Qin and Xuefeng Xiao and Wei Liu and Rui Wang and Min Zheng and Andy J. Ma},