Over the past two decades, the problem of person detection, localization, and
identification received significant attention from various research communities.
This coincides with the rising demand for the information about the positions and
identities of the individuals. Such demand is driven mainly by needs of surveillance
and security, intelligent environments, and sports science. In surveillance,
knowing the individuals’ position and identity enables us not only to determine
their presence or absence, but also analyze their behavior, detect abnormalities
in it, and reconstruct events. Similarly, recovery of the athletes’ trajectories
provides an opportunity for consistent and objective analysis of various game parameters,
such as movement of individual players and whole teams, intensity and
physiological demands of the game, players’ activities, and their adherence to the
predefined strategy.
Various localization solutions have been proposed, based on different sensor
modalities. The two most prominent research areas are detection and tracking
using video cameras, and localization using radio-based technology. Due to
their unobtrusive nature, the computer-vision-based multi-view multi-target detection
and tracking present an especially attractive choice. The advances in this
field were to a great extent fostered by proliferation of the so-called tracking-bydetection
paradigm. Under this paradigm, the first step involves independent,
robust detection and localization of the individuals, on frame-by-frame basis. In
the second step, the obtained anonymous detections are linked into trajectories
using a global optimization method. However, in the majority of the multi-view
multi-tracking approaches, this linking step is done solely based on the spatiotemporal
proximity of the hypothesized detections, with no long-term identity
validation. This may result in the propagation of identity switches when the individuals
come close and then disperse again. Such errors are unacceptable from
the perspective of the end-user application, as the propagation of a single identity
switch effectively renders the subsequent trajectory data invalid, both in terms of
proper localization and the derived motion patterns. The preservation of identity
became a popular issue only recently, with emergence of approaches that extract
and incorporate the appearance information in their tracking step.
In this dissertation, we propose to extend the paradigm of tracking-by-detection with the one we call tracking-by-identification. Under the proposed paradigm, the
first step involves detection, localization, and identification of the individuals. Depending
on the quality of the available information, this results in either fullyor
semi-identified detection hypotheses, which helps preventing the propagation
of identity switches. When the identity information is strong enough, it can be
directly used to split the detections; each sequence of the identified detections can
then be separately linked into a trajectory, even using an unmodified tracking approach
that otherwise does not consider any identity information. Alternatively,
the existing tracking approaches that extract and use the appearance information
could be modified to use the more general identity information instead. This
opens a possibility of standardizing the interface between the detection and the
tracking step, while reducing the amount of information required by the tracking
step, such as image data.
Within the context of tracking-by-identification, the presented dissertation
consists of three main scientific contributions. The first is a new methodology for
evaluation of the systems’ performance, which, in contrast to the established ones,
considers the tracking results from the perspective of the end-user. It therefore
discards the notion of tracking and the associated error types, and instead focuses
on the manifestation of such errors in terms of the resulting false positives, false
negatives, and localization error. This is done under several assignment strategies,
which reveal different aspects of the system — detection, localization, and
identification. Therefore, the proposed methodology is applicable both to systems
that perform only detection and localization, as well as to those that also
perform identification. In the latter case, it offers means to analyze the identity
switches both in terms of their duration and the involved individuals, as well as
their actual effect in terms of resulting localization error.
The second contribution encompasses a novel tracking-by-identification approach,
obtained by fusing a commercially-available localization solution based
on the Ultra-Wideband radio technology, and a state-of-the-art computer-visionbased
detection and tracking. Using the proposed evaluation methodology, we
thoroughly evaluate both subsystems and obtain insights into their strengths and
weaknesses when used in a realistically cluttered environment. Afterwards, we
fuse the systems by combining the best of both worlds — good camera-based
localization and reliable radio-based identification. The proposed fusion scheme
is shown to outperform its components, both in terms of localization errors and
maintaining the identity of individuals. The multi-modal dataset, used for development
and evaluation of our approach, is also publicly available on our website,
with the aim of sparking further interest in such multi-modal fusion.
The third and the last contribution is a novel multi-modal framework for
frame-by-frame person detection, localization, and identification, based on fusion
of multiple weakly-discriminative cues/features. The weakly-discriminative cues,
used to distinguish between the individuals, are encoded using feature maps, a
proposed generalization of an occupancy map, which allows consistent aggregation
and encoding of features across the views. The framework builds on two
ideas; the use of multiple weak features, and the feature fusion performed by one
or more trained classifiers. Experimental evaluation shows that even when the
obtained identity information is not strong (i.e., a detection is assigned multiple
possible identities), it still helps preventing the propagation of identity switches
and improves the tracking results.
|