We exploit the complementarity of audio and visual information for tracking multiple persons and for assigning segments of speech to each person, over time. The tracker is based on a variational Bayesian formulation which yields a computationally tractable solution. Please visit our research page for more details.
Acknowledgments: Work funded by the European Union under the ERC Advanced Grant VHIA.