University of Illinois Urbana-Champaign Logo

  • LOGIN & Help

Assignment-Space-based Multi-Object Tracking and Segmentation

  • Agricultural and Biological Engineering
  • Electrical and Computer Engineering
  • Computer Science
  • Aerospace Engineering
  • Coordinated Science Lab
  • National Center for Supercomputing Applications (NCSA)
  • Carl R. Woese Institute for Genomic Biology

Research output : Chapter in Book/Report/Conference proceeding › Conference contribution

Multi-object tracking and segmentation (MOTS) is important for understanding dynamic scenes in video data. Existing methods perform well on multi-object detection and segmentation for independent video frames, but tracking of objects over time remains a challenge. MOTS methods formulate tracking locally, i.e., frame-by-frame, leading to sub-optimal results. Classical global methods on tracking operate directly on object detections, which leads to a combinatorial growth in the detection space. In contrast, we formulate a global method for MOTS over the space of assignments rather than detections: First, we find all top-k assignments of objects detected and segmented between any two consecutive frames and develop a structured prediction formulation to score assignment sequences across any number of consecutive frames. We use dynamic programming to find the global optimizer of this formulation in polynomial time. Second, we connect objects which reappear after having been out of view for some time. For this we formulate an assignment problem. On the challenging KITTI-MOTS and MOTSChallenge datasets, this achieves state-of-the-art results among methods which don't use depth data.

Publication series

Asjc scopus subject areas.

  • Computer Vision and Pattern Recognition

Online availability

  • 10.1109/ICCV48922.2021.01334

Library availability

Related links.

  • Link to publication in Scopus
  • Link to the citations in Scopus

Fingerprint

  • Object tracking Engineering & Materials Science 100%
  • Object detection Engineering & Materials Science 47%
  • Polynomials Engineering & Materials Science 17%

T1 - Assignment-Space-based Multi-Object Tracking and Segmentation

AU - Choudhuri, Anwesa

AU - Chowdhary, Girish

AU - Schwing, Alexander G.

N1 - Publisher Copyright: © 2021 IEEE

N2 - Multi-object tracking and segmentation (MOTS) is important for understanding dynamic scenes in video data. Existing methods perform well on multi-object detection and segmentation for independent video frames, but tracking of objects over time remains a challenge. MOTS methods formulate tracking locally, i.e., frame-by-frame, leading to sub-optimal results. Classical global methods on tracking operate directly on object detections, which leads to a combinatorial growth in the detection space. In contrast, we formulate a global method for MOTS over the space of assignments rather than detections: First, we find all top-k assignments of objects detected and segmented between any two consecutive frames and develop a structured prediction formulation to score assignment sequences across any number of consecutive frames. We use dynamic programming to find the global optimizer of this formulation in polynomial time. Second, we connect objects which reappear after having been out of view for some time. For this we formulate an assignment problem. On the challenging KITTI-MOTS and MOTSChallenge datasets, this achieves state-of-the-art results among methods which don't use depth data.

AB - Multi-object tracking and segmentation (MOTS) is important for understanding dynamic scenes in video data. Existing methods perform well on multi-object detection and segmentation for independent video frames, but tracking of objects over time remains a challenge. MOTS methods formulate tracking locally, i.e., frame-by-frame, leading to sub-optimal results. Classical global methods on tracking operate directly on object detections, which leads to a combinatorial growth in the detection space. In contrast, we formulate a global method for MOTS over the space of assignments rather than detections: First, we find all top-k assignments of objects detected and segmented between any two consecutive frames and develop a structured prediction formulation to score assignment sequences across any number of consecutive frames. We use dynamic programming to find the global optimizer of this formulation in polynomial time. Second, we connect objects which reappear after having been out of view for some time. For this we formulate an assignment problem. On the challenging KITTI-MOTS and MOTSChallenge datasets, this achieves state-of-the-art results among methods which don't use depth data.

UR - http://www.scopus.com/inward/record.url?scp=85127824049&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85127824049&partnerID=8YFLogxK

U2 - 10.1109/ICCV48922.2021.01334

DO - 10.1109/ICCV48922.2021.01334

M3 - Conference contribution

AN - SCOPUS:85127824049

T3 - Proceedings of the IEEE International Conference on Computer Vision

BT - Proceedings - 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021

PB - Institute of Electrical and Electronics Engineers Inc.

T2 - 18th IEEE/CVF International Conference on Computer Vision, ICCV 2021

Y2 - 11 October 2021 through 17 October 2021

Subscribe to the PwC Newsletter

Join the community, add a new evaluation result row, multi-object tracking and segmentation.

16 papers with code • 2 benchmarks • 3 datasets

Multiple object tracking and segmentation requires detecting, tracking, and segmenting objects belonging to a set of given classes.

(Image and definition credit: Prototypical Cross-Attention Networks for Multiple Object Tracking and Segmentation , NeurIPS 2021, Spotlight )

assignment space based multi object tracking and segmentation

Benchmarks Add a Result

assignment space based multi object tracking and segmentation

Most implemented papers

Bdd100k: a diverse driving dataset for heterogeneous multitask learning.

bdd100k/bdd100k • CVPR 2020

Datasets drive vision progress, yet existing driving datasets are impoverished in terms of visual content and supported tasks to study multitask learning for autonomous driving.

EagerMOT: 3D Multi-Object Tracking via Sensor Fusion

aleksandrkim61/EagerMOT • 29 Apr 2021

Multi-object tracking (MOT) enables mobile robots to perform well-informed motion planning and navigation by localizing surrounding objects in 3D space and time.

D2Conv3D: Dynamic Dilated Convolutions for Object Segmentation in Videos

assignment space based multi object tracking and segmentation

We further show that D2Conv3D out-performs trivial extensions of existing dilated and deformable convolutions to 3D.

Segment as Points for Efficient Online Multi-Object Tracking and Segmentation

The resulting online MOTS framework, named PointTrack, surpasses all the state-of-the-art methods including 3D tracking methods by large margins (5. 4% higher MOTSA and 18 times faster over MOTSFusion) with the near real-time speed (22 FPS).

PointTrack++ for Effective Online Multi-Object Tracking and Segmentation

In this work, we present PointTrack++, an effective on-line framework for MOTS, which remarkably extends our recently proposed PointTrack framework.

Online Multi-Object Tracking and Segmentation with GMPHD Filter and Mask-based Affinity Fusion

SonginCV/MAF_HDA • 31 Aug 2020

One affinity, for position and motion, is computed by using the GMPHD filter, and the other affinity, for appearance is computed by using the responses from a single object tracker such as a kernalized correlation filter.

Continuous Copy-Paste for One-Stage Multi-Object Tracking and Segmentation

Current one-step multi-object tracking and segmentation (MOTS) methods lag behind recent two-step methods.

Assignment-Space-Based Multi-Object Tracking and Segmentation

In contrast, we formulate a global method for MOTS over the space of assignments rather than detections: First, we find all top-k assignments of objects detected and segmented between any two consecutive frames and develop a structured prediction formulation to score assignment sequences across any number of consecutive frames.

Prototypical Cross-Attention Networks for Multiple Object Tracking and Segmentation

We propose Prototypical Cross-Attention Network (PCAN), capable of leveraging rich spatio-temporal information for online multiple object tracking and segmentation.

Do Different Tracking Tasks Require Different Appearance Models?

We show how most tracking tasks can be solved within this framework, and that the same appearance model can be successfully used to obtain results that are competitive against specialised methods for most of the tasks considered.

MOTS: Multi-Object Tracking and Segmentation

Ieee account.

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

The moving target tracking and segmentation method based on space-time fusion

  • Open access
  • Published: 21 September 2022
  • Volume 82 , pages 12245–12262, ( 2023 )

Cite this article

You have full access to this open access article

assignment space based multi object tracking and segmentation

  • Jie Wang 1 ,
  • Shibin Xuan   ORCID: orcid.org/0000-0002-9301-2953 1 , 2 ,
  • Hao Zhang 1 &
  • Xuyang Qin 1  

1216 Accesses

1 Altmetric

Explore all metrics

At present, the target tracking method based on the correlation operation mainly uses deep learning to extract spatial information from video frames and then performs correlations on this basis. However, it does not extract the motion features of tracking targets on the time axis, and thus tracked targets can be easily lost when occlusion occurs. To this end, a spatiotemporal motion target tracking model incorporating Kalman filtering is proposed with the aim of alleviating the problem of occlusion in the tracking process. In combination with the segmentation model, a suitable model is selected by scores to predict or detect the current state of the target. We use an elliptic fitting strategy to evaluate the bounding boxes online. Experiments demonstrate that our approach performs well and is stable in the face of multiple challenges (such as occlusion) on the VOT2016 and VOT2018 datasets with guaranteed real-time algorithm performance.

Similar content being viewed by others

assignment space based multi object tracking and segmentation

BoostTrack: boosting the similarity measure and detection confidence for improved multiple object tracking

assignment space based multi object tracking and segmentation

Deep Learning Techniques—R-CNN to Mask R-CNN: A Survey

assignment space based multi object tracking and segmentation

HOTA: A Higher Order Metric for Evaluating Multi-object Tracking

Avoid common mistakes on your manuscript.

1 Introduction

Target tracking has become a popular research topic in the field of computer vision because of its wide application and great potential in areas such as intelligent surveillance, driverless driving, human–computer interaction, and intelligent transportation. Before 2010, target tracking was mostly done using classical algorithms such as particle filtering [ 10 ], Kalman filtering [ 32 ], mean drift [ 35 ], and the optical flow method. In 2010, Bolme et al. [ 3 ] applied the correlation filtering method to tracking; later KCF [ 16 ], BACF [ 19 ], SRDCF [ 7 ], DSST [ 8 ], CACF [ 24 ], and Siamese [ 2 ] methods were employed. In 2016, Bertinetto et al. [ 2 ] proposed a tracking method that combines a Siamese network in deep learning with correlation filtering and achieved great success. So far, most deep-learning-based target tracking methods [ 4 , 5 , 17 ,  20 , 22 , 29 , 30 , 36 ] have been based on this method. In 2018, Li et al. [ 21 ] proposed the SiamRPN method that combines SiameseFC with RPN, abandoning the traditional multiscale detection method. Wang et al. [ 29 ] proposed the SiamMask method that combines SiamRPN networks with sharp segmentation networks, ensuring that the SiameseFC and RPN networks can be used in the same way. Tracking accuracy is significantly improved by tracking targets based on segmentation results, while compromising on the accuracy of the SiamRPN method.

Although these methods have achieved excellent results, they focus more on the merits of the features of the target and neglect the construction of motion models for target tracking. Even though good features can easily improve the performance of a tracker, target tracking faces problems such as occlusion, lighting changes, scale changes, deformation, and motion blur. Moreover, there are no methods available for extracting good features in every scenario. In the face of heavy occlusion, it is difficult for detection or segmentation-based methods to extract sufficient features as the target does not appear in the field of view. In contrast, Kalman filtering methods can accurately predict the state of a target by learning from the state of the target in the past frame in the absence of sufficient target features.

The major contributions of this study are as follows. Firstly, we propose the use of Kalman filtering to build motion models in combination with the SiamMask method to address the problem of missing tracked targets in complex environments resulting from the lack of accurate information on segmented target objects. Secondly, an elliptical fitting strategy is used to evaluate the angle and size of the rotating bounding boxes, and an attention mechanism is used to focus the model more on the contribution of the target subject area and to reduce the influence of the background.

The strengths of the proposed system are as follows. Firstly, we refine the tracking results of the tracking model by combining the segmentation model. This makes this method can still maintain high accuracy and robustness in a variety of complex environments. Secondly, by combining Kalman filter method, we propose a spatiotemporal motion model which can effectively alleviate the negative impact of occlusion. Benefit from this, even if we can’t extract enough effective target appearance features, we can also track them in a short time. Thirdly, we use the ellipse fitting strategy to refine the final boundary box, which helps us greatly improve the accuracy of the algorithm on the premise of consuming minimal resources.

Section 1 focuses on a brief introduction to our approach. Section 2 introduces related work, and Section 3 describes our approach in detail, including the main structure and core modules of the algorithm. Section 4 compares our approach to other popular algorithms on two datasets, VOT2016 and VOT2018; the strengths and weaknesses of our method are analyzed, and we propose future directions to address the weaknesses. Finally, the full text is reviewed and a reasonable conclusion is provided in Section 5 .

2 Related works

In this section, we briefly review the research progress on Siamese networks in the target tracking field in recent years. Bertinetto et al. [ 2 ] proposed the SiamFC method, combining the Siamese network with related filtering methods for the first time and successfully applying it to target tracking. However, the SiamFC method has weak adaptability to the environment, cannot be adapted to changes in scale, and its accuracy and precision cannot meet the complex circumstances of tracking requirements. SiamRPN was proposed by Li et al. [ 21 ]. This method focuses on the introduction of RPN networks. By pre-setting multiple anchors, the position and size of the target in the current frame are determined through pre-learned classification branches and position branches. Mask_RCNN [ 15 ] adds a branch on the basis of Faster RCNN [ 27 ] to segment the target instance while achieving target detection. SiamMask [ 29 ] refers to the Mask_RCNN method and adds segmentation branches on the basis of SiamRPN, maps the segmentation image back to the original image, and uses the segmentation object as the final tracking result to achieve real-time target tracking. The SiamMask method greatly improves tracking accuracy. However, because only the influence of positive samples on the tracking results is considered, SiamMask often incorrectly segments backgrounds with high similarity to the target into the target when intraclass interference and severe occlusion occur, resulting in inaccurate tracking results or even loss.

The above methods all focus on the complete network parameters of offline training, and there is almost no online learning strategy. However, the uncertainty of the tracked object and the complex, changeable tracking scene mean that it is difficult for the pre-trained network to fully represent changeable target tracking and the influence of the background on target tracking for each video image. Therefore, a reasonable online learning strategy is necessary.

The main emphasis of the current popular method is to track the network based on offline training. Zhang [ 36 ] proposed relying on temporal and spatial context information to model the temporal and spatial information of the tracking target through a Bayesian framework to obtain a correlation between the target and the surrounding features. The Kalman filter is based on the state transition equation and the observation state, and an optimal estimation is obtained by combining these two Gaussian distributions, which is used for linear filtering and prediction problems.

In this study, Kalman filtering is used to construct a motion model and to predict the state of the target when there are obvious deviations and errors in the tracking. It is experimentally demonstrated that this method is superior to the SiamMask algorithm when faced with occlusion problems and in-class interference problems.

3 Our method

In this section, we will describe our approach in detail. We divide the tracking system into the following modules: a prediction module, a segmentation module, and a correction module. The prediction module uses an efficient prediction of the state of the object in the video image that appears heavily occluded or subject to in-class interference. The segmentation module uses a Siamese network with a segment branch to efficiently segment the target object in each frame. The correction module uses an elliptic fitting strategy to correct the final bounding box for the segmentation results in the video image. The main structure of the algorithm is shown in Fig.  1 .

figure 1

Algorithm structure diagram

3.1 Prediction module

In target tracking, most of the existing methods only focus on how to extract quality features, while ignoring the continuity of target tracking in space–time. SiamMask [ 29 ] determines the final state of the target based on the segmentation results, but in several experiments we found that the segmentation branch can easily confuse similar objects in the background with the tracked target when they are disturbed by inner classes. Kalman filtering does not depend on the merits of the extracted features but rather relies on the movement trend of the target in the spatiotemporal sequence. Given a set of video observation sequences Y t , the observation state can be expressed linearly by the state variable Z t as

where H t is the observation matrix, n t is the observation noise, Z t represents the state of the target at time t . The transfer of the state of the target can be represented by the linear state transfer equation

where Φ t, t ‐ 1 is the state transfer matrix, w t is the error of the state model, and its covariance matrix is Q t for the error of the state transfer model.

The state update process is a two-step process: state prediction and error matrix prediction, Kalman gain calculation, status update, error matrix update, and status update. The process is as follows: The state prediction equation is

The covariance prediction equation is

The Kalman gain equation is

The state update equation is

The covariance update equation is

In this study, the center point of the target is modeled as a characteristic point X  = [ x ,  y ] T as a uniformly accelerated motion, and a quadratic polynomial motion model is obtained:

The error of the state transition model, Φ t , t  − 1 , and the observation matrix H t are

where I 2 represents a two-dimensional identity matrix and 0 2 represents a two-dimensional zero matrix. According to Eqs. ( 1 ) and ( 2 ), the Kalman filter can be used to accurately predict the state of the target. The final state of target S t in this study is determined by

where M t is the target state obtained by splitting the branch, Z t is the target state predicted by Kalman filtering, dist is the Euclidean distance between the state of the target center in the previous frame S t  − 1 and M t , and score is the score of the feature with dimensions 1 × 1 × 256, which represents the similarity between target and candidate samples. The more similar the candidate and the template are, the higher the score will be. We choose a more reasonable target state between M t and Z t by using Eqs. ( 10 ). If the target is severely blocked or out of view, all score values are lower than η . However, if there are similar objects in the candidate area, the actual possible state of the target cannot be distinguished by the score value alone. Here we default to the case in which the target does not move in a large range between two frames. The rationale for this decision is that the target state M t obtained by segmentation may have large deviations, and similar objects in the candidate area may be mistakenly identified as tracking targets. Through many experiments, we found that, when score < η , problems such as the target being occluded in a large area or the target leaving the field of view often appear in the video image. In such a case, the original tracker still considers that most of the target should be visible in the field of view. Therefore, we have to choose a background with a higher similarity to the template as the target to continue tracking. We have derived optimal values for the parameters in numerous experiments and have set σ  = 100 and η  = 0.9. At the same time, to ensure the stability of the tracker, we assume that the value of dist should be within a certain range and that the target’s trajectory will not exhibit large-scale fluctuations. This is because in the experiment we found that, except for images with fast-moving objects, the targets in the other images rarely move long distances. The long-distance movement of the tracking result is often caused by the loss of the tracking target. Therefore, to improve the robustness of the algorithm, we have used Eq. ( 10 ) as the selection criteria.

3.2 Segmentation module

We used SiamMask [ 29 ] as the segmentation module of this study. SiamMask uses RPN [ 21 ] to calculate simple classification scores and bounding boxes, so that the candidate window of a fully convolutional Siamese network encodes the necessary information to generate a pixel-level binary segmentation mask. Two inputs (a template and a search area) go through the same convolutional neural network f θ , and a deep cross-correlation of the two feature maps is performed to obtain

SiamMask uses a simple two-layer neural network h ϕ with a learned parameter ϕ to predict a binary mask of size w  ×  h . The predicted mask of the n th candidate window \( {g}_{\theta}^n\left(T, SR\right) \) is

From Fig. 1 , we can see that there are three branches paralleling the segmentation branch: classification branch, a regression branch, and a segmentation branch. The classification branch is used to distinguish the target from the background. It predicts each sample as a target and a background score, and its loss function is recorded as L cls . The regression branch fine-tunes the candidate area to obtain the predicted position and bounding box size, and the loss function is recorded as L reg . The segmentation branch extracts the feature with the highest score in the feature map and decodes it to generate a segmented binary mask; its loss function is denoted as L mask . The total loss function of the SiamMask method is therefore

where λ 1 , λ 2 ,and λ 3 are the parameters.

3.3 Correction module

After many experiments, we found that the segmentation results often do not perfectly strip the target from the background. The SiamMask method uses the smallest rectangular bounding box of the segmentation mask as the final result in the current frame. Even if the segmentation result contains a small part of the background, it will have a greater impact on the final bounding box. In this study, the ellipse-fitting strategy is used to finely select the rotating bounding box, so that the final result is more biased toward the torso of the target to reduce the accuracy drop caused by the inaccurate segmentation of a small part. An ellipse can be represented by a conical equation with the following constraints:

where a , b , c , d , e , and f are the coefficients of the ellipse and i , j is a point on the ellipse. Because the image needs to be rotated around the center of the ellipse, the following transfer matrix is used to calculate the coordinates of the transferred point in the original image:

where θ is the rotation angle and ( i cen ,  j cen ) is the center point. If Mask a is the set of all points in the segmentation mask, then Mask b , the point set of the segmentation mask after the transfer, is given by

rec a is the smallest rectangular bounding box of the ellipse of the target mask after rotation. The smallest rectangle of the segmentation result is rec mask . The intersection rec l of rec a and rec mask is calculated as the optimized bounding rectangle. The segmented image rec l is then rotated back to the original position according to the rotation angle θ , and the rotated rec l θ is outputted as the final bounding box. Figure 2 shows the main flow of the calculations.

figure 2

Ellipse fitting strategy

4 Experiment

In this section, we evaluate the improved methods we propose on the VOT2016 and VOT2018 datasets, and we compare them with a number of popular methods. The experimental results demonstrate that this proposed method has great accuracy and precision. To reflect the fairness of comparison, the SiamMask part employed here uses the same structure and parameters as in Wang et al. [ 29 ]. Our experimental setup made use of computer with a Ryzen7 4800 h CPU, a GeForce GTX 1650Ti GPU, and 16 GB of memory, running on a Windows 10 operating system under a Python program.

4.1 Evaluation criteria

The evaluation indicators used in this study were the average overlap ratio, tracking length, failure rate, and robustness. The average overlap ratio is the intersection ratio between the area of the predicted target and the real area. The larger the value of this ration, the greater is the error. The tracking length is the number of frames in which the error from the start of tracking to the center point is lower than the acceptable range of the threshold. The failure rate is specified as follows: When the overlap rate is lower than the threshold, the tracking has failed, and the bounding box is reinitialized. The shorter the track length of each segment, the greater is the failure rate. During the k th algorithm repeated measurement process, the video robustness is calculated using

where M is the average time of failures, F 1 is the total time of failures, N is the length of the video sequence, and a is a parameter. F ( i ,  k ) represents the number of times that the recording algorithm fails to track in the video image and reinitialize after five frames.

4.2 Experimental results

The proposed algorithm was tested and evaluated on the VOT2016 and VOT2018 datasets, and we compared the results with those from ECO [ 9 ], ECO_HC and VITAL [ 28 ], SiamMask, SiamRPN, and TADT [ 11 ], and SiamCAR [ 13 ]. The videos were divided into nine categories, and the results closely reflect the performance capabilities of each algorithm in different scenarios. The experimental results demonstrate that the algorithm has good performance when facing various challenges and that its stability is obviously stronger than that of the compared algorithms.

4.3 Analysis of experimental results

From Table 1 , we can see that our method has achieved good results in motion variation, camera motion, and scale variation and is first in average video accuracy, which demonstrates that the algorithm is robust. The segmentation results can accurately segment the target in the face of motion state changes, the Kalman filter can more accurately predict the target location and fine-tune the target state results obtained from segmentation, and the scale variation response achieves excellent results. The elliptic-fitting strategy can fine-tune the segmentation results to achieve good accuracy. The method performs well in the face of the blocking problem on the VOT2018 dataset listed in Table 2 , being superior to the other comparison algorithms, and performs worse than ECO, TADT, and VITAL on the VOT2016 dataset. These performance differences can be explained as follows: ECO uses more comprehensive features (CNN + HOG+CN) to cope with the blocking problem of the single feature target tracking algorithm. VITAL uses the generated confrontation network that randomly generates numerous membranes and retains the most robust membranes among the target features to increase the positive sample data. TADT uses pixel-level losses to guide channel selection, and the VITAL and the accuracy results of TADT are significantly higher than our algorithm; however, Compared to SiamMask our algorithm still achieves an improvement of 0.05 accuracy. From Tables  3 and 4 , we can see that the strategy proposed here has achieved first place in terms of accepted average overlap (EAO), overlap, and failure metrics, with a strong overall performance, outperforming SiamMask by almost 0.04, which indicates that the improvement of the algorithm is effective. Figures  3 and 6 list the A-R ranks of EAO metrics of various algorithms for nine types of videos in VOT2016 and VOT2018. It can be seen that the robustness and accuracy of the algorithm are higher than those other algorithms in most of the challenges.

figure 3

EAO comparison results (VOT2016)

From the expected overlap curves in Figs.  4 and 6 , it can be seen that the algorithm will not be like ECO and other methods for which the coverage decreases significantly as the number of video frames increases, because the method adopts deep learning to extract features, and the depth features are more robust. The segmentation result is not easily affected by the previous target motion state, and the selection strategy adopted in Eq. ( 10 ) is also reasonable, even in the face of long video frames. For the video challenges, robustness is still guaranteed. From the expected overlap scores in Figs.  5 and 8 , we see that the algorithm is much stronger than other algorithms in terms of the average expected overlap scores, indicating that our algorithm has high accuracy compared to other comparison algorithms. This strength also can be attributed to the segmentation branch we used and the elliptic-fitting strategy used to optimize the bounding box of the segmentation results.

figure 4

Expected overlap curve (VOT2016)

figure 5

Expected overlap score (VOT2016)

Figure 9 shows the real-time performance of this algorithm and other algorithms in multiple video frames, in which the red rotating rectangle shows the performance of this algorithm. It can be seen that the algorithm performs well in multiple video frames with large time intervals, which demonstrates the excellent stability of the algorithm Figs. 6 , 7 , 8 and 9 .

figure 6

EAO comparison results (VOT2018)

figure 7

Expected overlap curve (VOT2018)

figure 8

Expected overlap score (VOT2018)

figure 9

Effect display diagram

4.4 Future outlook

Although the experimental part of the algorithm has exhibited powerful advantages, remaining robust in the face of various challenges, the performance of the algorithm is still unsatisfactory under changing illumination in the video, as indicated in Table 1 . The Kalman filter is not accurate enough in predicting the target state, which is why it does not work well in the analysis of the results. The performance contribution of the tracker is not high enough. In the future, we can consider making full use of the time–space context information to predict the current state of the target more accurately by comparing the information between successive frames and construct a better tracker by combining detection or segmentation methods.

5 Conclusion

The performance of a tracker is commonly degraded when it is faced with a heavily occluded target because effective target features cannot be extracted. In view of this, a spatiotemporal fusion approach to motion target tracking and segmentation is proposed in this study. Based on Siamese networks and segmentation structures, the method utilizes a spatiotemporal motion target tracking model combined with Kalman filtering to mitigate the occlusion problem during tracking by extracting motion features of the tracked target on the time axis and building a motion model of the motion target on the time series. Because current target tracking methods neglect the importance of employing an online strategy, we propose to use Kalman filtering to construct a motion model of the target and to reasonably predict the motion of the target when the target is missing or heavily occluded in a short period of time. We use a segmentation network to segment the target from the background to achieve accurate tracking and elliptic-fitting strategy to correct the error caused by imprecise segmentation results and to improve the tracker’s accuracy. The experiments demonstrate that this method is feasible and achieves excellent results when compared with other algorithms. However, there remain problems of insufficient segmentation accuracy and insufficient prediction accuracy. From [ 1 , 6 , 13 , 23 , 25 , 31 , 34 ], we can foresee that the algorithm combining segmentation and tracking will become more pervasively used in the future and that the target tracking method combining deep learning and traditional methods [ 12 , 14 , 18 , 26 , 33 ] has a bright future.

Ahrnbom M, Nilsson MG, Ardö H (2021) Real-time and online segmentation multi-target tracking with track revival re-identification. In: VISIGRAPP, pp 777–784

Bertinetto L, Valmadre J, Henriques JF, et al. (2016) Fully-convolutional siamese networks for object tracking. European Conference on Computer Vision. Springer, Cham, 850–865

Bolme DS, Beveridge JR, Draper BA, et al. (2010) Visual object tracking using adaptive correlation filters. 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. IEEE

Cheng S, Zhong B, Li G, et al. (2021) Learning to filter: Siamese relation network for robust tracking. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4421–4431

Choi J, Jin Chang H, Jeong J, et al. (2016) Visual tracking using attention-modulated disintegration and integration. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4321–4330

Choudhuri A, Chowdhary G, Schwing AG (2021) Assignment-Space-based Multi-Object Tracking and Segmentation. Proceedings of the IEEE/CVF International Conference on Computer Vision, 13598–13607

Danelljan M, Hager G, Shahbaz Khan F, et al. (2015) Learning spatially regularized correlation filters for visual tracking. Proceedings of the IEEE International Conference on Computer Vision, 4310–4318

Danelljan M, Häger G, Khan FS et al (2016) Discriminative scale space tracking. IEEE Trans Pattern Anal Mach Intell 39(8):1561–1575

Article   Google Scholar  

Danelljan M, Bhat G, Shahbaz Khan F, et al. (2017) Eco: Efficient convolution operators for tracking. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 6638–6646

Gu L, Liu J, Wang C, Cao M (2013) Particle filter tracking based on fragment multi-cue integration. Int J Appl Math Stats: 31–40

Guo D, Wang J, Cui Y, et al. (2020) SiamCAR: Siamese Fully Convolutional Classification and Regression for Visual Tracking. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6269–6277

Han K, Peng J, Yang Q, Tian W (2021) An end-to-end dehazing Siamese region proposal network for high robustness object tracking. IEEE Access 9:91983–91994

Han W, Lekamalage CKL, Huang GB (2022) Efficient joint model learning, segmentation and model updating for visual tracking. Neural Netw 147:175–185

Han X, Qin Q, Wang Y, et al. (2022) CS-Siam: Siamese-Type Network Tracking Method with Added Cluster Segmentation. International Conference on Advanced Data Mining and Applications. Springer: Cham, 251–262

He K, Gkioxari G, Dollár P, et al. (2017) Mask r-cnn. Proceedings of the IEEE International Conference on Computer Vision, 2961–2969

Henriques JF, Caseiro R, Martins P et al (2014) High-speed tracking with kernelized correlation filters. IEEE Trans Pattern Anal Mach Intell 37(3):583–596

Huang B, Chen J, Xu T, Wang Y, Jiang S, Wang Y, Wang L, Li J (2021) SiamSTA: Spatio-Temporal Attention based Siamese Tracker for Tracking UAVs , Computer Vision Workshops (ICCVW) 2021 IEEE/CVF International Conference on, pp. 1204–1212

Jiang S, Xu B, Zhao J, Shen F (2021) Faster and simpler siamese network for single object tracking.  https://doi.org/10.48550/arXiv.2105.03049

Kiani Galoogahi H, Fagg A, Lucey S (2017) Learning background-aware correlation filters for visual tracking. Proceedings of the IEEE International Conference on Computer Vision, 1135–1143

Kiran M, Nguyen-Meidine LT, Sahay R, Cruz RMOE, Blais-Morin LA, Granger E (2022) Generative target update for adaptive siamese tracking. https://doi.org/10.48550/arXiv.2202.09938

Li B, Yan J, Wu W, et al. (2018) High performance visual tracking with siamese region proposal network. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 8971–8980

Li B, Wu W, Wang Q, et al. (2019) Siamrpn++: Evolution of siamese visual tracking with very deep networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4282–4291

Lukezic A, Matas J, Kristan M (2020) D3S-A Discriminative Single Shot Segmentation Tracker. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7133–7142

Mueller M, Smith N, Ghanem B (2017) Context-aware correlation filter tracking. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1396–1404

Noor S, Waqas M, Saleem MI et al (2021) Automatic object tracking and segmentation using unsupervised SiamMask. IEEE Access 9:106550–106559

Oleksiienko I, Iosifidis A (2022) 3D object detection and tracking. Deep Learning for Robot Perception and Cognition. Academic Press, 313–340

Ren S, He K, Girshick R et al (2016) Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149

Song Y, Ma C, Wu X, et al. (2018) Vital: Visual tracking via adversarial learning. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 8990–8999

Wang Q, Zhang L, Bertinetto L, et al. (2019) Fast online object tracking and segmentation: A unifying approach. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1328–1338

Wang N, Song Y, Ma C, et al. (2019) Unsupervised deep tracking. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1308–1317

Wang J, He Y, Wang X, Yu X, Chen X (2019) Prediction-tracking-segmentation. https://doi.org/10.48550/arXiv.1904.03280

Xu J, Xun J et al (2012) Data fusion for target tracking in wireless sensor networks using quantized innovations and Kalman filtering. SCIENCE CHINA Inf Sci 55(03):530–544

Article   MathSciNet   MATH   Google Scholar  

Yang D (2022) Research on multi-target tracking technology based on machine vision. Appl Nanosci:1–11.  https://doi.org/10.1007/s13204-021-02293-6

Yao R, Lin G, Xia S, Zhao J, Zhou Y (2020) Video object segmentation and tracking: A survey. ACM Transactions on Intelligent Systems and Technology (TIST):1–47.  https://doi.org/10.1145/3391743

Yin H, Chai Y, Yang SX, Yang X (2011) Fast-moving target tracking based on mean shift and frame-difference methods. J Syst Eng Electron 22(04):587–592

Zhang J, Jin X, Sun J et al (2020) Spatial and semantic convolutional features for robust visual object tracking. Multimed Tools Appl 79(21):15095–15115

Download references

Acknowledgments

This research is partially supported by National Natural Science Foundation of China(61866003).

Author information

Authors and affiliations.

School of Artificial Intelligence, Guangxi Minzu university, Nanning, 530006, China

Jie Wang, Shibin Xuan, Hao Zhang & Xuyang Qin

Guangxi Key Laboratory of Hybrid Computation and IC Design Analysis, Nanning, 530006, China

Shibin Xuan

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Shibin Xuan .

Ethics declarations

Conflict of interest.

The authors declare they have no conflict of Interests.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Wang, J., Xuan, S., Zhang, H. et al. The moving target tracking and segmentation method based on space-time fusion. Multimed Tools Appl 82 , 12245–12262 (2023). https://doi.org/10.1007/s11042-022-13703-4

Download citation

Received : 22 December 2020

Revised : 09 March 2022

Accepted : 24 August 2022

Published : 21 September 2022

Issue Date : March 2023

DOI : https://doi.org/10.1007/s11042-022-13703-4

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Target tracking
  • Kalman filtering
  • Segmentation
  • Elliptic fitting
  • Find a journal
  • Publish with us
  • Track your research

Assignment-Space-based Multi-Object Tracking and Segmentation阅读笔记

奶茶不二

提出来基于分配空间的多目标跟踪与检测方法

1.introduction

多目标跟踪与分割经典的范式一是使用整个序列,也就是离线方法实现全局的最优化,后者是逐帧跟踪,也就是在线跟踪,这对于多目标跟踪来说可能更具有实际的意义。但是传统的跟踪方法的跟踪都是基于对象检测,这导致了再全局跟踪时选项的组合增长以及在线跟踪时由于需要为对象寻找最佳路径,会使得优化变得复杂。本篇论文就是提出来基于分配空间的的跟踪。

具体来说就是利用Hungarian-Murty algorithm 找到两个连续帧上所检测的前k个分配。通过一个结构预测公式再全局得到指派序列而不是检测序列,再利用动态规划寻找全局最优解。我们共同学习了跟踪参数很检测/分配网络,为了建立长期连接,引入后处理步骤,再以前未指派的对象检测和检测序列之间够早长期链接。

多目标跟踪大致分为两类:批处理方法(离线),在线方法

批处理方法假定所有帧一定程度上都可用,全局解决目标轨迹问题。层次轨迹关联,动态规划,网络流这样的方法已经被提出。

在线方式:在线方法依赖于过往的的视频帧来估计当前帧的状态

2.2多目标跟踪和分割

深度分割网络再图像分割任务中表现很好,最近基于掩码跟踪变得越受欢迎,因为他比基于边界盒的方法更健壮

之前的工作基于匈牙利算法逐帧得到局部分配,但是一组最佳的局部分配不一定是全局最优,因此基于动态规划的方法优化视频中每个对象的最佳轨迹,不过如前所述,目标数量较大且未知的情况下全局优化时很困难的。与上述方法不同的是,我们提出了一种批处理的MOTS方法,通过构造一个结构化预测,直接推断出在整个视频帧中检测到的和分割到的目标的分配。我们认为,在一个修剪过的赋值空间上操作要比在检测空间上操作简单得多:这个公式只需要找到一个最优路径。

结构化预测用于多变量的联合预测

概述:在深度网络计算出T帧的检测和分割后,我们再T-1连续帧对之间用匈牙利算法计算前k个最好的检测分配。随后,我们在分配空间(T−1连续帧对的k个分配)上构建了一个结构化预测,如第3.1节所述。为了求解该公式并得到全局最优的最小代价路径,我们采用了动态规划方法。这将在第3.2节中讨论。我们在第3.3节中描述了结构化预测公式的参数和检测/分割深网的端到端学习。最后,正如图3所示,我们通过从步骤1开始的轨迹空间的分配公式,揭示了后处理步骤中的远程任务。详见3.4节

assignment space based multi object tracking and segmentation

假设深度网络产生了一组检测Dt={dt1,dt2....},每一个检测关联着一组节点属性,包括对应的分割掩码、外观特征向量、视频帧以及视频中前一帧与当前帧之间计算的光流。首先构建分配空间,使用k个在T-1连续帧对的最佳分配,然后提出连续分配的损失函数。

构建分配空间:在形式上,使用矩阵at表示Dt-1到Dt的赋值,at的行列和强制为1.为了使侦测不必总被赋值,我们引入辅助侦测,表示未分配。因为很难对连续帧对之间的分配矩阵进行优化,所以要减少连续帧之间的分配空间。为此,我们在连续帧的集合( )找到k个最佳可能分配,用Yt表示。kt=min(20,分配空间的可能数量),每一个yt指向一个分配矩阵at(yt),例如yt=1时

assignment space based multi object tracking and segmentation

约束集 At 通过强制行和的列和等于1来确保 At 是一个有效的分配。有人可能会说,实际的分配空间要大得多,并且将局部空间缩减为20个最佳分配是次优的。经验证明,任何 kt ≥15都是合理的选择。我们的成本函数得到了很好的优化,因此最优分配位于前20个最优局部分配中。

费用函数(从dt-1到dt)

assignment space based multi object tracking and segmentation

其中fiou(dt−1 I,dtj)是分割dtj和使用光流翘曲到帧t的分割dt−1 I之间的并集计算的交点。采用RAFT[52]计算光流。我们使用fapp(dt−1 i,dtj)来表示用于检测dt−1 i和dtj的外观特征向量之间的欧几里德距离。我们从PointTrack[58]获得了这个特性。fdist(dt−1 i,dtj)表示检测dt−1 i和dtj的边界盒中心之间的欧几里德距离。实验参数λiou、λapp、λdist是可训练的。

图2b为上面讨论的赋值空间。对于yt∈yt的每个可能的赋值,都通过一个图板来说明。请注意,每个板块,即两个连续帧之间的每个分配,在被选择时都有一个一元成本:

assignment space based multi object tracking and segmentation

直观上,我们累加由分配矩阵atyt表示的所有分配(dt−1i,dtj)的成本。此外,表示框架对(t−1,t)和(t,t +1)的一对连续板(yt,yt+1)具有配对成本

assignment space based multi object tracking and segmentation

现在我们讨论如何在Eq.(2)和Eq.(3)中定义代价的条件下,求出最优的赋值序列

给定式(2)和式(3)中定义的代价,我们的目标是求出最小代价赋值序列y∗= (y1,∗,…,yT,∗)∈y =∏T T =1 yT,每一帧对恰好取一个赋值。

assignment space based multi object tracking and segmentation

重要的是,由于式(5)中给出的规划的定域是离散的,而且损失L(y)仅由依赖于连续变量对(yt,yt+1)∀t的函数组成,所以经典动态规划是直接适用的。它在多项式时间内为式(5)中给定的程序生成全局最小值y∗。

注意,整体极小值 y 指向每帧对(t-1,t) 的分配y(t,*)这 指向一个分配矩阵at,y,反过来指向 t-1时刻 Dt-1和 t 时刻 Dt 之间选择的分配

注意,程序在推理过程中解出,即,Eq。(5)依赖于跟踪参数 λ = [ λiou,λapp,λdist,λiou,2,λapp,2,λdist,2]。此外,它还依赖于深网参数 θ,即通过方程式中所示的成本函数中使用的检测值。(1)及当量。(4).对于可读性,我们没有将可读性对 θ 的依赖性明确化。我们联合训练 λ 和 θ 端到端

为了推导学习目标,我们遵循经典结构化学习目标:正确标注的的配置比其他任何配置有着更低的话费。直观上来说,如果我们找到参数实现再大量样本下的最低成本,我们就有了一个合适的跟踪器。因此,在经典的结构化预测之后,当我们的学习目标不满足时,我们使用一个目标来线性惩罚可训练参数。此外,这个目标允许反向传播到检测-分割网络。形式上,学习目标是:

assignment space based multi object tracking and segmentation

使用随机梯度下降来更新参数,梯度w.r.t λ是很容易计算的

assignment space based multi object tracking and segmentation

在我们的例子中,我们通过对所有帧中每个帧的错误分配的总数求和得到∆。形式上,这种损失是应用于分配矩阵指向y和yGT的差异的平方frobenius范数,即

assignment space based multi object tracking and segmentation

利用第3.2节所讨论的方法,我们在分配空间中获得了连接每个帧对的路径。这条路径是全局最优的制定成本和结果在多个轨迹形成的探测空间,如图3(左)。然而,第3.2节中的公式不能恢复侦测到的多个帧之间的链接。在遮挡和错误检测的情况下,物体可能在多帧后重新出现在视频中。为了解决这个问题,我们设计了一个基于全局分配的方法来连接获得的轨迹。这在图3中作了说明,下面再加以描述。

考虑通过优化Eq.(5)中给出的程序获得的r轨迹。注意r轨迹还包含尚未分配的检测。我们构造一个r×r代价矩阵clr∈Rr×r,其中(i,j)th元素表示将第i个tracklet连接到第j个tracklet的代价。代价是基于tracklet i中的检测和tracklet j中的检测。具体来说,我们的代价函数为

assignment space based multi object tracking and segmentation

在这里,fapp,lr (i,j)表示 tracklets i 和 j. fdist 中检测的外观特征向量的平均欧几里得度量,lr (i,j)是 tracklet i 中最后一次检测和 tracklet j。我们从 PointTrack [58]获得外观特性。参数 λapp,lr 和 λdist,lr 是可训练的。具体来说,学习目标遵循情商。(6)由于没有考虑时间序列的不可表差异,即动态规划不需要用于长期作业。任务丢失 something 是0还是1,取决于特定的长程分配是否为有效分配。

注意,运动轨迹 的链接应该是时间一致的,也就是说,在帧 t 或帧 t 之前结束的 运动轨迹 不能与在帧 t + 1或帧 t + 1之前结束的 运动轨迹 合并。我们还假设那些连续40帧以上没有出现在场景中的物体已经离开了场景,并且不会再次出现。换句话说,在帧 t 或帧 t 之前结束的 运动轨迹 不能与在帧 t + 40之后开始的 运动轨迹 合并。这些限制是通过成本矩阵的相应位置的非常高的成本(= 105)实施的。在图3中,成本矩阵中的深蓝色位置表示时间不一致的任务。

在构造费用矩阵之后,我们使用 Hungar-ian 算法[33]来解决这个指派问题。如果相关成本小于经验确定的新旧曲线,则根据赋值解将曲线连接起来。否则,将为 运动轨迹 分配一个新的轨道 ID。

IMAGES

  1. Figure 1 from Assignment-Space-based Multi-Object Tracking and

    assignment space based multi object tracking and segmentation

  2. Multi-Object Tracking and Segmentation

    assignment space based multi object tracking and segmentation

  3. Figure 1 from Multi-Object Tracking and Segmentation with a Space-Time

    assignment space based multi object tracking and segmentation

  4. Assignment-Space-Based Multi-Object Tracking and Segmentation

    assignment space based multi object tracking and segmentation

  5. (PDF) Assignment-Space-Based Multi-Object Tracking and Segmentation

    assignment space based multi object tracking and segmentation

  6. CV3DST

    assignment space based multi object tracking and segmentation

VIDEO

  1. Segment Anything from Meta: strong points and limitations

  2. Multi-Object Tracking using OpenCV (C++)

  3. Real-time Multi-Object Detection and Tracking

  4. Segment Anything

  5. segment-geospatial v0.3.0 is out

  6. Track to Detect and Segment: An Online Multi-Object Tracker (CVPR2021)

COMMENTS

  1. Assignment-Space-based Multi-Object Tracking and Segmentation

    Multi-object tracking and segmentation (MOTS) is important for understanding dynamic scenes in video data. Existing methods perform well on multi-object detection and segmentation for independent video frames, but tracking of objects over time remains a challenge. MOTS methods formulate tracking locally, i.e., frame-by-frame, leading to sub-optimal results. Classical global methods on tracking ...

  2. PDF Assignment-Space-based Multi-Object Tracking and Segmentation

    Multi-object tracking and segmentation (MOTS) is im-portant for understanding dynamic scenes in video data. Existing methods perform well on multi-object detection and segmentation for independent video frames, but track-ing of objects over time remains a challenge. MOTS meth-ods formulate tracking locally, i.e., frame-by-frame, leading

  3. Assignment-Space-Based Multi-Object Tracking and Segmentation

    Assignment-Space-Based Multi-Object Tracking and Segmentation ... Multi-object tracking and segmentation (MOTS) is important for understanding dynamic scenes in video data. Existing methods perform well on multi-object detection and segmentation for independent video frames, but tracking of objects over time remains a challenge. ...

  4. ICCV 2021 Open Access Repository

    Multi-object tracking and segmentation (MOTS) is important for understanding dynamic scenes in video data. ... Anwesa and Chowdhary, Girish and Schwing, Alexander G.}, title = {Assignment-Space-Based Multi-Object Tracking and Segmentation}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month ...

  5. Assignment-Space-based Multi-Object Tracking and Segmentation

    Existing methods perform well on multi-object detection and segmentation for independent video frames, but tracking of objects over time remains a challenge. MOTS methods formulate tracking locally, i.e., frame-by-frame, leading to sub-optimal results.

  6. Assignment-Space-based Multi-Object Tracking and Segmentation

    This work forms a global method for MOTS over the space of assignments rather than detections and develops a structured prediction formulation to score assignment sequences across any number of consecutive frames. Multi-object tracking and segmentation (MOTS) is important for understanding dynamic scenes in video data. Existing methods perform well on multi-object detection and segmentation ...

  7. PDF Assignment-Space-based Multi-Object Tracking and Segmentation

    Assignment-Space-based MOTS. Step 1: Step 2: Multi-Object Tracking and Segmentation. Goal: Jointly address detection, segmentation and tracking. Assignment-Space-based Multi-Object Tracking and Segmentation. Anwesa Choudhuri, Girish Chowdhary, Alexander Schwing, University of Illinois at Urbana-Champaign https://anwesachoudhuri.github.io ...

  8. Assignment-Space-based Multi-Object Tracking and Segmentation

    Download Citation | On Oct 1, 2021, Anwesa Choudhuri and others published Assignment-Space-based Multi-Object Tracking and Segmentation | Find, read and cite all the research you need on ResearchGate

  9. PDF Supplementary Material: Assignment-Space-based Multi-Object Tracking

    Assignment-Space-based Multi-Object Tracking and Segmentation In this section, we provide additional details and anal-ysis of the proposed approach for Multi-Object Tracking and Segmentation. In Sec.Awe elaborate on the param-eter learning procedure for MOTS that has been discussed in Sec.3.3. In Sec.B, we compare the time-complexities

  10. PDF CVF Open Access

    CVF Open Access

  11. Multi-Object Tracking and Segmentation

    Multiple object tracking and segmentation requires detecting, tracking, and segmenting objects belonging to a set of given classes. ... Assignment-Space-Based Multi-Object Tracking and Segmentation. ... In contrast, we formulate a global method for MOTS over the space of assignments rather than detections: First, we find all top-k assignments ...

  12. PDF arXiv:2007.01550v1 [cs.CV] 3 Jul 2020

    Abstract. Current multi-object tracking and segmentation (MOTS) methods follow the tracking-by-detection paradigm and adopt convolu-tions for feature extraction. However, as a ected by the inherent recep-tive eld, convolution based feature extraction inevitably mixes up the foreground features and the background features, resulting in ambiguities

  13. [2110.11284] Multi-Object Tracking and Segmentation with a Space-Time

    Multi-Object Tracking and Segmentation with a Space-Time Memory Network. We propose a method for multi-object tracking and segmentation based on a novel memory-based mechanism to associate tracklets. The proposed tracker, MeNToS, addresses particularly the long-term data association problem, when objects are not observable for long time intervals.

  14. MOTS: Multi-Object Tracking and Segmentation

    This paper extends the popular task of multi-object tracking to multi-object tracking and segmentation (MOTS). Towards this goal, we create dense pixel-level annotations for two existing tracking datasets using a semi-automatic annotation procedure. Our new annotations comprise 65,213 pixel masks for 977 distinct objects (cars and pedestrians) in 10,870 video frames. For evaluation, we extend ...

  15. Segment as Points for Efficient Online Multi-Object Tracking and

    For an input image, PointTrack obtains instance segments by an instance segmentation network. Then, PointTrack regards the segment and its surrounding environment as two 2D point clouds and learn features on them separately. MLP stands for multi-layer perceptron with Leaky ReLU. (Color figure online) Full size image.

  16. [1902.03604] MOTS: Multi-Object Tracking and Segmentation

    This paper extends the popular task of multi-object tracking to multi-object tracking and segmentation (MOTS). Towards this goal, we create dense pixel-level annotations for two existing tracking datasets using a semi-automatic annotation procedure. Our new annotations comprise 65,213 pixel masks for 977 distinct objects (cars and pedestrians) in 10,870 video frames. For evaluation, we extend ...

  17. SegDQ: Segmentation assisted multi-object tracking with dynamic query

    First, we propose a Transformer-based multi-object tracking framework with feature dependent queries and assisted segmentation task to help solve the MOT problem. Second, we extend the TransTrack method to a multi-task MOT method by posing semantic segmentation as an auxiliary task in addition to the original MOT problem to enhance foreground ...

  18. Multi-Object Tracking and Segmentation with a Space-Time Memory Network

    Abstract—We propose a method for multi-object tracking and segmentation based on a novel memory-based mecha-nism to associate tracklets. The proposed tracker, MeNToS, addresses particularly the long-term data association problem, when objects are not observable for long time intervals. Indeed, the recently introduced HOTA metric (High Order ...

  19. The moving target tracking and segmentation method based on space-time

    Choudhuri A, Chowdhary G, Schwing AG (2021) Assignment-Space-based Multi-Object Tracking and Segmentation. Proceedings of the IEEE/CVF International Conference on Computer Vision, 13598-13607. Danelljan M, Hager G, Shahbaz Khan F, et al. (2015) Learning spatially regularized correlation filters for visual tracking.

  20. PDF MOTS: Multi-Object Tracking and Segmentation

    Segmentation based tracking results, on the other hand, are by definition non-overlapping and can thus be compared to ground truth in a straightforward manner. In this paper, we therefore propose to extend the well-known multi-object tracking task to instance segmentation tracking. We call this new task "Multi-Object Tracking 7942

  21. The moving target tracking and segmentation method based on space-time

    Choudhuri A, Chowdhary G, Schwing AG (2021) Assignment-Space-based Multi-Object Tracking and Segmentation. Proceedings of the IEEE/CVF International Conference on Computer Vision, 13598-13607 Google Scholar; 7. Danelljan M, Hager G, Shahbaz Khan F, et al. (2015) Learning spatially regularized correlation filters for visual tracking.

  22. Assignment-Space-based Multi-Object Tracking and Segmentation

    Multi-object tracking and segmentation (MOTS) is important for understanding dynamic scenes in video data. Existing methods perform well on multi-object detection and segmentation for independent video frames, but tracking of objects over time remains a challenge. MOTS methods formulate tracking locally, i.e., frame-by-frame, leading to sub-optimal results.

  23. Assignment-Space-based Multi-Object Tracking and Segmentation阅读笔记

    多目标跟踪与分割经典的范式一是使用整个序列,也就是离线方法实现全局的最优化,后者是逐帧跟踪,也就是在线跟踪,这对于多目标跟踪来说可能更具有实际的意义。. 但是传统的跟踪方法的跟踪都是基于对象检测,这导致了再全局跟踪时选项的组合增长 ...

  24. Applied Sciences

    Multi-Object Tracking (MOT) technology is dedicated to continuously tracking multiple targets of interest in a sequence of images and accurately identifying their specific positions at different times. This technology is crucial in key application areas such as autonomous driving and security surveillance. However, the application process often requires the coordination of cameras from ...