Automated Labeling of Object Detection Datasets Utilizing GroundingDino


A sensible information to tag object detection datasets with the GroundingDino algorithm. Code included.

1*C8 FnPBhZur mA5 3UfcJQ
Annotations by the writer utilizing GroundingDino with the ‘ripened tomato’ immediate. Picture by Markus Spiske.


Till not too long ago, object detection fashions carried out a particular activity, like detecting penguins in a picture. Nonetheless, latest developments in deep studying have given rise to basis fashions. These are giant fashions skilled on large datasets in a basic method, making them adaptable for a variety of duties. Examples of such fashions embrace CLIP for picture classification, SAM for segmentation, and GroundingDino for object detection. Basis fashions are typically giant and computationally demanding. When having no sources limitations, they can be utilized immediately for zero-shot inference. In any other case, they can be utilized to tag a datasets for coaching a smaller, extra particular mannequin in a course of referred to as distillation.

On this information, we’ll learn to use GroundingDino mannequin for zero-shot inference of a tomatoes picture. We’ll discover the algorithm’s capabilities and use it to tag a whole tomato dataset. The resulted dataset can then be used to coach a downstream goal mannequin resembling YOLO.



GroundingDino is a state-of-the-art (SOTA) algorithm developed by IDEA-Analysis in 2023 [1]. It detects objects from photos utilizing textual content prompts. The title “GroundingDino” is a mixture of “grounding” (a course of that hyperlinks imaginative and prescient and language understanding in AI techniques) and the transformer-based detector “DINO” [2]. This algorithm is a zero-shot object detector, which implies it may well determine objects from classes it was not particularly skilled on, without having to see any examples (photographs).


  1. The mannequin takes pairs of picture and textual content description as inputs.
  2. Picture options are extracted with an picture spine resembling Swin Transformer, and textual content options with a textual content spine like BERT.
  3. To fuse picture and textual content modalities right into a single illustration, each kinds of options are fed into the Characteristic Enhancer module.
  4. Subsequent, the ‘Language-guided Question Choice’ module selects the options most related to the enter textual content to make use of as decoder queries.
  5. These queries are then fed right into a decoder to refine the prediction of object detection bins that greatest align with the textual content data.
  6. The mannequin outputs 900 object bounding bins and their similarity scores to the enter phrases. The bins with similarity scores above the box_threshold are chosen, and phrases whose similarities are larger than the text_threshold as predicted labels.
Picture by Xiangyu et al., 2023 [3]

Immediate Engineering

The GroundingDino mannequin encodes textual content prompts right into a discovered latent house. Altering the prompts can result in totally different textual content options, which might have an effect on the efficiency of the detector. To boost prediction efficiency, it’s advisable to experiment with a number of prompts, selecting the one which delivers the very best outcomes. It’s necessary to notice that whereas writing this text I needed to strive a number of prompts earlier than discovering the perfect one, typically encountering surprising outcomes.

Code Implementation

Getting Began

To start, we’ll clone the GroundingDino repository from GitHub, arrange the setting by putting in the required dependencies, and obtain the pre-trained mannequin weights.

# Clone:
!git clone

# Set up
%cd GroundingDINO/
!pip set up -r necessities.txt
!pip set up -q -e .

# Get weights
!wget -q

Inference on an picture

We’ll begin our exploration of the thing detection algorithm by making use of it to a single picture of tomatoes. Our preliminary purpose is to detect all of the tomatoes within the picture, so we’ll use the textual content immediate tomato. If you wish to use totally different class names, you possibly can separate them with a dot .. Word that the colours of the bounding bins are random and don’t have any explicit that means.

python3 demo/ 
--config_file 'groundingdino/config/'
--checkpoint_path 'groundingdino_swint_ogc.pth'
--image_path 'tomatoes_dataset/tomatoes1.jpg'
--text_prompt 'tomato'
--box_threshold 0.35
--text_threshold 0.01
--output_dir 'outputs'
Annotations with the ‘tomato’ immediate. Picture by Markus Spiske.

GroundingDino not solely detects objects as classes, resembling tomato, but in addition comprehends the enter textual content, a activity referred to as Referring Expression Comprehension (REC). Let’s change the textual content immediate from tomato to ripened tomato, and acquire the end result:

python3 demo/ 
--config_file 'groundingdino/config/'
--checkpoint_path 'groundingdino_swint_ogc.pth'
--image_path 'tomatoes_dataset/tomatoes1.jpg'
--text_prompt 'ripened tomato'
--box_threshold 0.35
--text_threshold 0.01
--output_dir 'outputs'
Annotations with the ‘ripened tomato’ immediate. Picture by Markus Spiske.

Remarkably, the mannequin can ‘perceive’ the textual content and differentiate between a ‘tomato’ and a ‘ripened tomato’. It even tags partially ripened tomatoes that aren’t absolutely crimson. If our activity requires tagging solely absolutely ripened crimson tomatoes, we will regulate the box_threshold from the default 0.35 to 0.5.

python3 demo/ 
--config_file 'groundingdino/config/'
--checkpoint_path 'groundingdino_swint_ogc.pth'
--image_path 'tomatoes_dataset/tomatoes1.jpg'
--text_prompt 'ripened tomato'
--box_threshold 0.5
--text_threshold 0.01
--output_dir 'outputs'
1*C8 FnPBhZur mA5 3UfcJQ
Annotations with the ‘ripened tomato’ immediate, with box_threshold = 0.5. Picture by Markus Spiske.

Era of tagged dataset

Regardless that GroundingDino has outstanding capabilities, it’s a big and sluggish mannequin. If real-time object detection is required, think about using a quicker mannequin like YOLO. Coaching YOLO and related fashions require numerous tagged information, which will be costly and time-consuming to provide. Nonetheless, in case your information isn’t distinctive, you should use GroundingDino to tag it. To study extra about environment friendly YOLO coaching, consult with my earlier article [4].

The GroundingDino repository features a script to annotate picture datasets within the COCO format, which is appropriate for YOLOx, as an example.

from demo.create_coco_dataset import essential

essential(image_directory= 'tomatoes_dataset',
text_prompt= 'tomato',
box_threshold= 0.35,
text_threshold = 0.01,
export_dataset = True,
view_dataset = False,
export_annotated_images = True,
weights_path = 'groundingdino_swint_ogc.pth',
config_path = 'groundingdino/config/',
subsample = None
  • export_dataset — If set to True, the COCO format annotations can be saved in a listing named ‘coco_dataset’.
  • view_dataset — If set to True, the annotated dataset can be displayed for visualization within the FiftyOne app.
  • export_annotated_images — If set to True, the annotated photos can be saved in a listing named ‘images_with_bounding_boxes’.
  • subsample (int) — If specified, solely this variety of photos from the dataset can be annotated.

Completely different YOLO algorithms require totally different annotation codecs. If you happen to’re planning to coach YOLOv5 or YOLOv8, you’ll have to export your dataset within the YOLOv5 format. Though the export sort is hard-coded in the principle script, you possibly can simply change it by adjusting the dataset_type argument in create_coco_dataset.essential, from fo.sorts.COCODetectionDataset to fo.sorts.YOLOv5Dataset(line 72). To maintain issues organized, we’ll additionally change the output listing title from ‘coco_dataset’ to ‘yolov5_dataset’. After altering the script, run create_coco_dataset.essential once more.

  if export_dataset:

Concluding remarks

GroundingDino affords a big leap in object detection annotations by utilizing textual content prompts. On this tutorial, we now have explored find out how to use the mannequin for automated labeling of a picture or an entire dataset. It’s essential, nonetheless, to manually overview and confirm these annotations earlier than they’re utilized in coaching subsequent fashions.


A user-friendly Jupyter pocket book containing the whole code is included on your comfort:

Thanks for studying!

Wish to study extra?


[1] Grounding DINO: Marrying DINO with Grounded Pre-Coaching for Open-Set Object Detection, 2023.

[2] Dino: Detr with improved denoising anchor bins for end-to-end object detection, 2022.

[3] An Open and Complete Pipeline for Unified Object Grounding and Detection, 2023.

[4] The sensible information for Object Detection with YOLOv5 algorithm, by Dr. Lihi Gur Arie.


Automated Labeling of Object Detection Datasets Utilizing GroundingDino was initially printed in In direction of Information Science on Medium, the place individuals are persevering with the dialog by highlighting and responding to this story.

Supply hyperlink


Please enter your comment!
Please enter your name here