FG-MDM: Towards Zero-Shot Human Motion Generation via Fine-Grained Descriptions

1Nanjing University of Science and Technology 2Institute of Automation, Chinese Academy of Sciences 3cheery.ai 4Beijing Normal University
cars peace

Abstract

Recently, significant progress has been made in text-based motion generation, enabling the generation of diverse and high-quality human motions that conform to textual descriptions. However, generating motions beyond the distribution of original datasets remains challenging, i.e., zero-shot generation. By adopting a divide-and-conquer strategy, we propose a new framework named Fine-Grained Human Motion Diffusion Model (FG-MDM) for zero-shot human motion generation. Specifically, we first parse previous vague textual annotations into fine-grained descriptions of different body parts by leveraging a large language model. We then use these fine-grained descriptions to guide a transformer-based diffusion model, which further adopts a design of part tokens. FG-MDM can generate human motions beyond the scope of original datasets owing to descriptions that are closer to motion essence. Our experimental results demonstrate the superiority of FG-MDM over previous methods in zero-shot settings. We will release our fine-grained textual annotations for HumanML3D and KIT.

Method Overview

cars peace

First, we adopt ChatGPT to perform fine-grained paraphrasing of the given vague textual description. This expands concise textual descriptions into descriptions of different body parts. FG-MDM then uses these fine-grained descriptions to guide a diffusion model for human motion generation.

Stylied Text-to-Motion

Fine-Grained Text-to-Motion

Compared With MDM and MotionDiffuse

BibTeX

@article{shi2023generating,
  title={Generating Fine-Grained Human Motions Using ChatGPT-Refined Descriptions},
  author={Shi, Xu and Luo, Chuanchen and Peng, Junran and Zhang, Hongwen and Sun, Yunlian},
  journal={arXiv preprint arXiv:2312.02772},
  year={2023}
}