Abstract: Robots have gradually moved from factory floors to populated spaces. Therefore, there is a crucial need to endow robots with communicative skills. One of the prerequisites of human-robot communication (or more generally, interaction) is the ability of robots to perceive their environment, to detect people, to track them over time, and to identify communicative cues, such as “who looks at whom” and “who speaks to whom”. Therefore we are interested in analysing situations in which several people are present, understand their activities, estimate who speaks and who doesn’t, etc. For that purpose we combine computer vision, audio signal processing and machine learning methods. We will briefly present the research that we carried out within this topic, and stress the importance of learning from sensory data.