FG-MDM: Towards Zero-Shot Human Motion Generation via ChatGPT-Refined Descriptions

ICPR2024
1Nanjing University of Science and Technology 2Shandong University 3Institute of Automation, Chinese Academy of Sciences 4Beijing Normal University
cars peace

Abstract

Recently, significant progress has been made in text-based motion generation, enabling the generation of diverse and high-quality human motions that conform to textual descriptions. However, generating motions beyond the distribution of original datasets remains challenging, i.e., zero-shot generation. By adopting a divide-and-conquer strategy, we propose a new framework named Fine-Grained Human Motion Diffusion Model (FG-MDM) for zero-shot human motion generation. Specifically, we first parse previous vague textual annotations into fine-grained descriptions of different body parts by leveraging a large language model. We then use these fine-grained descriptions to guide a transformer-based diffusion model, which further adopts a design of part tokens. FG-MDM can generate human motions beyond the scope of original datasets owing to descriptions that are closer to motion essence. Our experimental results demonstrate the superiority of FG-MDM over previous methods in zero-shot settings. We will release our fine-grained textual annotations for HumanML3D and KIT on the project page.

Method Overview

cars peace

First, we adopt ChatGPT to perform fine-grained paraphrasing of the given vague textual description. This expands concise textual descriptions into descriptions of different body parts. FG-MDM then uses these fine-grained descriptions to guide a diffusion model for human motion generation.

Stylied Text-to-Motion

Fine-Grained Text-to-Motion

Compared With MDM and MotionDiffuse

BibTeX

@article{shi2023generating,
  title={Generating Fine-Grained Human Motions Using ChatGPT-Refined Descriptions},
  author={Shi, Xu and Luo, Chuanchen and Peng, Junran and Zhang, Hongwen and Sun, Yunlian},
  journal={arXiv preprint arXiv:2312.02772},
  year={2023}
}