In recent years, modern methods in the field of visual tracking and segmentation have achieved excellent results. A major reason for this progress is the adoption of memory-based approaches. One particularly promising method is the Segment Anything Model 2 (SAM2). Despite their sophistication, new methods face challenges when tracking parts of objects. The main reason for this is that the methods are trained on datasets containing annotations of entire objects. Examples of object part tracking are rare in existing datasets, and a specialized dataset for this task does not yet exist.
In this work, we present a training dataset YT-VOS-PT (train) and an evaluation dataset YT-VOS-PT (eval), both based on the YouTube-VOS dataset and containing annotated examples of object part tracking. The training dataset is used to retrain the SAM2 method. Various training mechanisms of the method are evaluated on the T-VOS-PT (eval) dataset, where we demonstrate an improvement of the J&F score by up to 7%. On selected examples from the DiDi dataset, where we track object parts, we show tracking quality improvements of up to 16%, and up to 39% when integrated with DAM4SAM.
|