Recently, significant progress has been made in text-based motion generation, enabling the generation of diverse and high-quality human motions that conform to textual descriptions. However, generating motions beyond the distribution of original datasets remains challenging, i.e., zero-shot generation. By adopting a divide-and-conquer strategy, we propose a new framework named Fine-Grained Human Motion Diffusion Model (FG-MDM) for zero-shot human motion generation. Specifically, we first parse previous vague textual annotations into fine-grained descriptions of different body parts by leveraging a large language model. We then use these fine-grained descriptions to guide a transformer-based diffusion model, which further adopts a design of part tokens. FG-MDM can generate human motions beyond the scope of original datasets owing to descriptions that are closer to motion essence. Our experimental results demonstrate the superiority of FG-MDM over previous methods in zero-shot settings. We will release our fine-grained textual annotations for HumanML3D and KIT on the project page.
First, we adopt ChatGPT to perform fine-grained paraphrasing of the given vague textual description. This expands concise textual descriptions into descriptions of different body parts. FG-MDM then uses these fine-grained descriptions to guide a diffusion model for human motion generation.
@article{shi2023generating,
title={Generating Fine-Grained Human Motions Using ChatGPT-Refined Descriptions},
author={Shi, Xu and Luo, Chuanchen and Peng, Junran and Zhang, Hongwen and Sun, Yunlian},
journal={arXiv preprint arXiv:2312.02772},
year={2023}
}