Return to Research

How To Train Your Deep Multi-Object Tracker

How To Train Your Deep Multi-Object Tracker

IEEE Computer Vision and Pattern Recognition

Yihong Xu1, Aljoša Ošep2, Yutong Ban1,3 , Radu Horaud1
Laura Leal-Taixé2 , Xavier Alameda-Pineda1
1Inria, LJK, Univ. Grenoble Alpes, France    2Technical University of Munich, Germany
3Distributed Robotics Lab, CSAIL, MIT, USA

arxiv  | pdf | HALcode

  TITLE = {How To Train Your Deep Multi-Object Tracker},
  AUTHOR = {Xu, Y. and Osep, A. and Ban, Y. and Horaud, R. and Leal-Taix{\'e}, L. and Alameda-Pineda, X.},
  BOOKTITLE = {Computer Vision and Pattern Recognition},
  ADDRESS = {Seattle, United States},
  YEAR = {2020},
  MONTH = Jun,
  PDF = {}

Abstract. The recent trend in vision-based multi-object tracking (MOT) is heading towards leveraging the representational power of deep learning to jointly learn to detect and track objects. However, existing methods train only certain submodules using loss functions that often do not correlate with established tracking evaluation measures such as Multi-Object Tracking Accuracy (MOTA) and Precision (MOTP). As these measures are not differentiable, the choice of appropriate loss functions for end-to-end training of multiobject tracking methods is still an open research problem. In this paper, we bridge this gap by proposing a differentiable proxy of MOTA and MOTP, which we combine in a loss function suitable for end-to-end training of deep multiobject trackers. As a key ingredient, we propose a Deep Hungarian Net (DHN) module that approximates the Hungarian matching algorithm. DHN allows estimating the correspondence between object tracks and ground truth objects to compute differentiable proxies of MOTA and MOTP, which are in turn used to optimize deep trackers directly. We experimentally demonstrate that the proposed differentiable framework improves the performance of existing multi-object trackers, and we establish a new state of the art on the MOTChallenge benchmark.


  •  We propose novel loss functions that are directly inspired by standard MOT evaluation measures [1] for end-to-end training of multi-object trackers.

  •  In order to back-propagate losses through the network, we propose a new network module – Deep Hungarian Net (DHN, see below) – that learns to match predicted tracks to ground-truth objects in a differentiable manner.

  • We demonstrate the merit of the proposed loss functions and differentiable matching module by training the recently published Tracktor [2] using our proposed framework. DeepMOT improves the baseline and establishes a new state-of-the-art result on MOTChallenge benchmark datasets [3, 4].



We propose DeepMOT, a general framework for training deep multi-object trackers including the DeepMOT loss that directly correlates with established tracking evaluation measures [1]. The key component in our method is the Deep Hungarian
Net (DHN, see below.) that provides a soft approximation of the optimal prediction-to-ground-truth assignment, and allows to deliver the gradient, back-propagated from the approximated tracking performance measures, needed to update the tracker weights.

DeepMOT Loss

We approximate the most widely-used MOT evaluation metrics – MOTP and MOTA [1] in a differentiable way and use them as the objective functions to optimize a deep multi-object tracker.

Deep Hungarian Net: DHN

A key step of calculating the MOTA and MOTP is using the Hungarian algorithm [5] to output the optimal assignment decision among predicted tracks and ground-truth objects. However, the Hungarian algorithm cannot be expressed in an analytical form and contains some non-differentiable operations. We approximate the Hungarian algorithm with Bi-RNNs based DHN, as shown in the above figure.




The project was founded by the mobility grants from the Department for Science and Technology of the French Embassy in Berlin (SST) and the French Institute for Research in Computer Science and Automation (Inria). It was also partially funded by the Humboldt Foundation through the Sofja Kovalevskaja Award. We are grateful to the Dynamic Vision and Learning Group, Technical University of Munich as the host institute during this work.


[1] Keni Bernardin and Rainer Stiefelhagen. Evaluating multiple object tracking performance: The clear mot metrics. JIVP, 2008:1:1–1:10, 2008.
[2] Philipp Bergmann, Tim Meinhardt, and Laura Leal-Taixé. Tracking without bells and whistles. ICCV, 2019.
[3] Anton Milan, Laura. Leal-Taixé, Ian Reid, Stefan Roth, and Konrad Schindler. MOT16: A benchmark for multi-object tracking. arXiv preprint arXiv:1603.00831, 2016.
[4] Laura Leal-Taixé, Anton Milan, Ian Reid, Stefan Roth, and Konrad Schindler. MOTChallenge 2015: Towards a benchmark for multi-target tracking. arXiv preprint arXiv:1504.01942, 2015.
[5] Harold William Kuhn and Bryn Yaw. The hungarian method for the assignment problem. Naval research logistics quarterly, pages 83–97, 1955.

What do you want to do ?

New mail