| Multimodal Keypoint Detection Research < Jerod Weinman < CompSci < Grinnell | ||||||||||||||||||||||||||||||||
Motivation
Figure 1. Example of keypoints with rich
descriptions. Attributes are drawn from CUB images and applied
to corresponding keypoints of NABird species. (Photo Credit:
Laura Erickson via
NABirds)
Multimodal (text+image) models demonstrate strong object detection performance in both zero-shot and finetuned cases, in part because they replace a limited set of object classes with the natural language descriptions used by people, supporting open-world phrase grounding by learning the associations between words and pixels. We investigated whether the same approach could apply to smaller-scale object keypoint detection (i.e., for pose estimation), allowing us to use not only part names but also attributes (adjectives) describing those parts. Overview
Figure 2. GLIP-KP model overview. (Photo credit: Davor Desancic via
NABirds)
The GLIP model fuses the languistic and visual encodings in an interlinking multilayer head for the region proposal network. Originally designed for phrase grounding, we adapt GLIP to keypoint detection by treating ground truth keypoints as small objects (40×40 pixel boxes). To evaluate this approach, we train and test on the North American Birds (NABirds) data set, which contains nearly 50,000 images, 400 unique species and annotates eleven bird keypoints for each image. The captions we ask the model to pinpoint are the common names of each of the eleven keypoints. However, to test the added value of the language stream, we also evaluate a variant that cuts off semantic contribution by replacing each keypoint label with a single character. This forces the deep fusion layers to rely on the visual characteristics with only a symbol that has no intrinsic meaning. One strength of multimodal models is the ability to leverage declarative or descriptive attributes for few-shot transfer learning to novel categories. We therefore create a set of richer descriptive attributes for the keypoint query captions, as shown in the Table below.
Table 1: Keypoint names with their grammatically ordered
attributes and select examples.
This work is the first to incorporate such attributes for keypoint detection, rather than whole-image classification or phrase grounding. The three train/test variants of the data are:
Evaluation
Figure 3. The range of OKS thresholds used for the mAP
calculation. (Photo credit: Davor Desancic via
NABirds)
Both our system and our evaluation combines elements of object and keypoint detection. We note that unlike traditional keypoint detection metrics, our system can learn not to ground a keypoint caption label. Both PCK (percentage of keypoint correct) and the COCO OKS (object keypoint similarity) ignore false positives for keypoints that are not visible. These measures can be misleading in scenarios where keypoint visibility is unknown yet false positives should be minimized (for instance, when the labeled keypoints are to be used in teaching people). To remedy these shortcomings, we adapt an anisotropic version of the the COCO OKS measure, using the score in the COCO object detection framework by replacing IOU with our OKS calculation so that
The range of Object Keypoint Similarity (OKS) values used for sweeping mean Average Precision (mAP) are shown here in Figure 3. Note the anisotropy of the breast and crown, while the bill and eye are far more stringent (having smaller rings) than the belly and wings. Results
Figure 4. Test results for data-limited scenarios: zero-, one-, and few-shot learning.
Using only the pretrained GLIP weights, the keypoint names provide low, but measurably better results than meaningless symbols in a zero-shot test scenario. With one-shot learning, the absolute results improve with the use of part names, though the relative benefit of names remains roughly the same. With ten examples of each part in few shot learning, results continue to improve and the gap widens. In all the limited learning scenarios, including attributes adds a 1 to 2 point AP boost over names alone.
Figure 4. Test results for fully-finetuned models.
After training to convergence, the KP-GLIP model outperforms earlier heatmap-based approaches, even with the same vision backbone. Using the keypoint names performs better than the unsemantic symbols, but only marginally. Adding the descriptions to the keypoint names raises the performance a full six points. Thus, adding descriptive attributes to keypoint names improves performance significantly. But there is an equally important hidden story, which is the time to train. Finetuning the symbolic model, which ignores linguistic cues, takes 51 epochs to converge. However, using the keypoint names takes 20% less time. What’s more, the richer descriptive attributes reduce train time another 15%. Related PaperExperimental Data+CodeIf this data or code is used in a publication, please cite the appropriate paper above. Complete AnnotationsThe annotations with descriptive attributes used in the ICVS 2023 paper are available. Note: These are only the COCO-style annotation files; the original images from NABirds must be acquired separately.
Experimental CodeThis repository contains the modifications to GLIPv1 that are used in the ICVS 2023 paper.
Contributors
Acknowledgments
The text herein is adapted in part from the ICVS paper referenced
above. This work was supported in part by the Pioneer Centre for
AI, DNRF grant number P1.
| ||||||||||||||||||||||||||||||||