by Wen Guo*, Xiaoyu Bie*, Xavier Alameda-Pineda and Francesc Moreno-Noguer
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022, New Orleans, USA
[paper] [code] [data]
Abstract. Human motion prediction aims to forecast future poses given a sequence of past 3D skeletons. While this problem has recently received increasing attention, it has mostly been tackled by single humans in isolation. In this paper, we explore this problem when dealing with humans performing collaborative tasks, we seek to predict the future motion of two interacted persons given two sequences of their past skeletons.
We propose a novel cross-interaction attention mechanism that exploits historical information of both persons and learns to predict cross dependencies between the two pose sequences. Since no dataset to train such interactive situations is available, we collected ExPI (Extreme Pose Interaction), a new lab-based person interaction dataset of professional dancers performing Lindy-hop dancing actions, which contains 115 sequences with 30K frames annotated with 3D body poses and shapes. We thoroughly evaluate our cross-interaction network on ExPI and show that both in short- and long-term predictions, it consistently outperforms state-of-the-art methods for single-person motion prediction.
Download Dataset: Please follow the instruction of this link to apply and download our dataset.
Important note: If you are affiliated with an institution from a non-GDPR-compliant country (China, U.S., etc. see the full list in the above link), a legal representative of your institution will need to sign the standard contract in the above link.
Dataset Information
We present the Extreme Pose Interaction (ExPI) Dataset, a new person interaction dataset of Lindy Hop aerial steps [1]. To perform these actions, the two dancers perform different movements that require a high level of synchronization. These actions are composed of extreme poses and require strict and close cooperation between the two persons, which is highly suitable for the study of human interactions.
Our dataset contains 2 couples of dancers performing 16 extreme actions, obtaining 115 sequences with 30k frames for each viewpoint and 60k instances with annotated 3D body poses and shapes.
- Dataset Structure
In the ExPI dataset 16 different actions are performed, some by both dancer couples, some by only one of the couples, as shown in Table 1. The seven first aerials (A1 ~A7) are performed by both couples. Six more aerials (A8 ~A13) are performed by Couple 1, while three others (A14 ~A16) by Couple 2.
Each of the actions was repeated five times to account for variability. Overall, ExPI contains 115 short sequences, each one depicting the execution of one of the actions. More precisely, for each recorded sequence ExPI provides:
- Multi-view image sequences at 25FPS, from all the available cameras.
- Mocap data: 3D position of all joints per frame in 25FPS synchronized with the image sequences
- Camera calibration information.
- 3D shapes as a textured mesh for each frame.
- Data Collection
The data were collected by Kinovis-platform[2], which has 2 acquisition systems: A 68 color cameras (4MPixels) system that provides full shape and appearance information with 3D textured meshes, and a standard Motion capture (Mocap) system composed of 20 cameras that provides infrared-reflective marker trajectories.
We track 18 joints per person. The order of the keypoints is as follows, where “F” and “L” denote the Follower and the Leader respectively, and “f”, “l” and “r” denote “forward”, “left” and “right”, also seen as Figure1: (0) `L-fhead’, (1) `L-lhead’, (2) `L-rhead’, (3) `L-back’, (4) `L-lshoulder’, (5) `L-rshoulder’, (6) `L-lelbow’, (7) `L-relbow’, (8) `L-lwrist’, (9) `L-rwrist’, (10) `L-lhip’, (11) `L-rhip’, (12) `L-lknee’, (13) `L-rknee’, (14) `L-lheel’, (15) `L-rheel’, (16) `L-ltoes’, (17) `L-rtoes’, (18) `F-fhead’, (19) `F-lhead’, (20) `F-rhead’, (21) `F-back’, (22) `F-lshoulder’, (23) `F-rshoulder’, (24) `F-lelbow’, (25) `F-relbow’, (26) `F-lwrist’, (27) `F-rwrist’, (28) `F-lhip’, (29) `F-rhip’, (30) `F-lknee’, (31) `F-rknee’, (32) `F-lheel’, (33) `F-rheel’, (34) `F-ltoes’, and (35) `F-rtoes’.
- Data cleaning
When collecting the motion capture data, some points are missed by the system due to occlusions or tracking losses. To overcome this issue, we manually post-processed the missing points. We have designed and implemented a 3D hand labeling toolbox to ease this process. The labeled joints are projected into 3D and various 2D images to confirm the quality of the approximation by visual inspection.
- Evaluation
We proposed different evaluation metrics in the paper for the human motion prediction task. Please find more details in the paper and the code repo.
Reference
If you find our data useful, please cite these TWO sources:
@inproceedings{guo2021multi,
title={Multi-Person Extreme Motion Prediction},
author={Wen,Guo and Xiaoyu, Bie and Xavier, Alameda-Pineda and Francesc,Moreno-Noguer},
booktitle={Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2022}
}
@misc{expidatarepo,
title={The ExPI Dataset Repository},
author={Wen,Guo and Xiaoyu, Bie and Xavier, Alameda-Pineda and Francesc,Moreno-Noguer},
doi={10.5281/zenodo.7567798},
url={https://zenodo.org/record/7567798},
year={2022},
}