1. Introduction
With the advancements in GPUs and the increasing performance of computers, users nowadays can experience diverse virtual environments, plunging themselves into virtual characters. It has become essential that characters move similarly to reality in virtual space. Given a limited amount of motion capture data, developing control strategies for characters has been a long-standing problem in character animation. Traditionally, motion generation has been solved by manually re-arranging motion clips or connecting motion data as nodes via directed graph structure [1], and for generating goal-directed motion, path planning for a character’s valid trajectory, so-called motion planning is an option. However, conventional methods are limited in that they are not capable of producing diverse, complex motions in a new environment that is different from the captured environment.
Recent research interests have shifted to implementing deep learning-based methods and have demonstrated efficient methods by adapting generative models to predict plausible, diverse motion, and reinforcement learning to produce task-based locomotion. Imitation learning is one of the examples of deep learning-based approaches. Multiple studies have demonstrated the application of imitation learning to replicate physics-based motions by mimicking the provided motion database. However, it is limited in that it cannot respond to various scenarios when exposed to adversarial surroundings. To cooperate interactions without explicit programming, Generative Adversarial Imitation Learning [2] is an alternative to imitation learning. Adversarial Motion Prior [3] and Adversarial Skill Embeddings [4] are two notable approaches that apply generative adversarial imitation learning to the field of character animation and have shown remarkable results in generating natural motions. From the perspective of a generative model, these two models leverage Generative Adversarial Networks to provide information on the fake distribution. Compared to approaches such as Variational Autoencoder, this enables the seamless and natural execution of different motions. Based on the data-driven methods, Reinforcement Learning is integrated to enable characters to perform each task while simultaneously preserving the distinctive features of the mocap data. Chapter 2 of this paper will provide an overall survey of deep learning methods.
On top of these state-of-the-art methods, we have implemented the downstream tasks of punching and boxing. Our contributions can be summarized as follows:
-
We first review several generative model approaches and reinforcement learning methods for generating physics-based tasks. These methods train the system to perform desired tasks by leveraging embedded skills or motion clips.
-
We aim to enable our character to perform punching and boxing tasks, and compare two existing, latest models for generating dynamic motions: Adversarial Motion Prior, and Adversarial Skill Embedding (ASE). Also, we implemented a boxing task with ASE based on Competitive Multi-Agent Reinforcement Learning using the TimeChamber framework. Our redesigned reward functions are shown to be effective in several experiments.
2. Related works
In general, character animation can be categorized based on its attributes: kinematic method and physics-based method. the former predicts action spaces that consist of kinematic poses, while the latter aims to produce joint torques through physics simulation. For the past few years, recent studies have demonstrated generating kinematic or physics-based motion through neural networks. Generating kinematic motions through this approach has demonstrated plausible results without the use of physical formulas [5-7], while a physics-based approach has been shown to create more natural motion [8, 9].
Reconstructing motion through conditioning from latent space or motion manifolds is one of the popular data-driven methods in the latest works. It is meaningful in that its non-linear deformation represents hidden variables or intrinsic features, underlying the data [10]. Generative models are used to produce new variations and transitions of a given character motion by learning a compact latent space representation of motion data and sampling from it.
Data-driven methods using generative models such as Convolutional Autoencoders [11], Conditional Variational Encoder (CVAE) [12-14], and Normalizing Flows [15] demonstrated efficient ways of predicting long sequences of motions Though these methods produce long-term motion, several limitations still exist. As CVAE highly depends on the prior distribution of its input data, it may be difficult to establish a connection between different motions, which is known as posterior collapse. This limitation can arise when synthesizing motions in more complex scenarios, such as punching a random target. To solve this, the explicit condition was used additionally to each encoder and decoder to integrate acyclic motions [12] or preprocess motion dataset via motion graph to overcome the lack of transition or connectivity of each dataset [13]. Due to the lack of connection between each corresponding skill and latent space, the discriminator was additionally used in mapping conditioned by state-conditioned prior and posterior prior [14]. This indicates an independent CVAE highly depends on the quantity and distribution of the dataset, which in turn, may fail to create transitions in distinct motions.
Generally, GAN generates new data samples that closely resemble the ground truth by training a generator. This is achieved by training a discriminator to the point where it cannot distinguish between the ground truth and the generated data. Compared to VAEs, GAN generates diverse, high-quality motions, and it is suitable for transitioning between distinct actions such as kicking and punching. Inspired by Generative Adversarial Imitation Learning (GAIL)[2] framework and motion prior, which was used in pose estimation to indicate the similarity between generated motion and ground truth, Adversarial Motion Prior (AMP)[3] was introduced to generate novel motion sequences without the need of explicit information on selecting and sequencing. Discriminator, which is depicted as AMP, first collects trajectories with policy. Unlike the state-action pair used in GAIL, AMP is trained with state transitions as states are only observed in data. Given trajectory information, the discriminator is trained as a policy by least-squares regression problem with style-reward function. The reward is then recorded trajectory and stored in a replay buffer, which prevents the discriminator from overfitting and stabilizes the training process. Once all trajectories are recorded with rewards, recorded rewards are used to update policy and value functions. The agent here, learns its actions by maximizing the expectation value of the likelihood of the trajectory within a given goal. Finally, the discriminator is updated by using mini batches of transitions from sampled ground truth data and transitions from the replay buffer.
Similar to AMP, Adversarial Skill Embedding (ASE)[4] also adapted discriminator. However, unlike AMP, ASE proposed two stages: a pre-training stage for low-level policy which is conditioned by state and embedded skills, and a transferring embedded skills stage for high-level policy. This hierarchical structure enables policy to learn adequate physics-based actions without learning from scratch, as AMP requires a whole dataset for different policies of each task. In low-level policy training, the policy π(at|st,z) learns skill embedding which is a result from imitating ground truth motions. In this work, the latent space Z is modeled as a hypersphere. As the sphere itself here is expected to have a uniform prior distribution, it allows to prevent the result of unnatural action spaces. In each iteration, a batch of trajectories is collected with a policy on sampled latent from the unit sphere space. Rewards are specified as the sum of a least-squares regression and latent representation from the encoder at each time step. Each process is stored in a data buffer, and later this information will be used in updating the encoder and discriminator by sampling mini batches of transitions. Other works of recent studies have shown generating novel data by using natural language processing. Inspired from ASE, PADL [16] provides an effective interface to users without requiring pre-knowledge, and enables the character to direct its behavior
Controlling characters to a purposeful motion is one of the challenges in computer animation. Planning trajectories for characters [17] or supervising learning methods [18] showed promising results in kinematic animation. In physics-based animation, reinforcement learning is widely used for its pure objective [9], imitating reference data [8], or synthesizing motion while optimizing [19]. This is because motions from dynamic models go through a simulator in one way, where non-differentiable prediction cannot be reused as optimization [20].
Multi-agent reinforcement learning (MARL) involves developing algorithms for multiple agents to learn and interact with one another in a shared environment. Each autonomous agent receives rewards by their individual actions and collective states from a multi-agent environment while cooperating or competing with others. Several MARL works have demonstrated competitive policies for physics-based character control [21], as well as for hide-and-seek games [22]. For a human-like result, Won et al. [23] first, imitates the motion and then, train the two agents in a competitive scenario. Similar to this approach, TimeChamber [24], a self-play framework for multiple agents inspired by ASE, trains two agents to discover skills and is constrained by low-level policy, where it is trained previously in the pre-training stage of ASE. The overall transferring stage is based on a classic PPO self-play algorithm, which allows multiple agents to learn from playing against each other and improve while training.
3. Experiments
First, we collected related mocap data from CMU [25] for our task. Our database included from simple locomotion to boxing actions. To adjust the existing humanoid model, the mocap dataset was retargeted. Irrelevant joints were dropped in this process while keeping similar joints by their names. As a result, 31 joints in the original data were reduced to 28 joints. Inspired by Won et al. [23] both hands in humanoid are 1.75 times larger (see Figure 1).
Then, the data were weighted according to the different motion types of files for balancing overall distribution. The weights were set according to the ratio based on the largest amount of motion type. Suppose there are two data types: A and B, and type A is larger than B. To apply such a rule, the following weight would be:
Table 1 summarizes the type of motion data and weights that were used in training. The same datasets were used in each task.
Dataset | Clips | Length (secs) | Weights |
---|---|---|---|
Locomotion | 17 | 248.46s | 1.0 |
Boxing | 5 | 74.11s | 3.4 |
Character punching tasks consist of two independent scenarios: run straight forward then, punch (punching task), or run towards to target while facing randomly (joystick task). Both scenarios use the same mocap data and the same environment. The character will be directed to attack a plain cuboid target that measures 0.4m x 0.4m x 1.8m. The target will be placed randomly within a range of 5 to 10m.
Using Motion Prior, we have designed a new environment for characters to randomly punch targets. Unlike previous work on AMP using Bullet physics engine [26], we conducted experiments utilizing the publicly accessible AMP implementation built upon Isaac Gym [27], a GPU-based physics simulated environment, to simulate in 4096 parallel environments. A single NVIDIA RTX 2080Ti GPU was used in this case. We also took advantage of ASE to compare AMP. The same motion assets and environment used in AMP were used in training low-level policy in ASE. Unlike the AMP framework, ASE employs a low-level policy in transferring stage. High-level policy in this stage trains a meaningful latent vector based on the state, goal-state, and reward. Each training time took approximately 10 days in pre-training and 22 hours in transferring. Both stages were trained on a single GPU, NVIDIA RTX 2080Ti GPU.
Inspired by the task weighting approach in PADL[16], we departed from the use of static weights for both the task and the discriminator. Instead, we adopted dynamic task weights on training ASE, akin to those regulated by a proportional-derivative (PD) controller. This adaptive scheme facilitates a balanced consideration of task and skill rewards promoting efficiency.
Our observation spaces for two different scenarios are similar to each other. We additionally appended local target moving and facing direction in the joystick task. The target moving direction in the joystick task is a local direction between the target cuboid and the character’s initial position. The target facing direction, on the other hand, is a random local facing direction where it remains until the environment resets. Both directions are projected onto the ground. Table 2 describes our continuous observation spaces used in detail.
To strike a target in a natural way, we have divided our task into a 2-step behavior as suggested by AMP. If the target’s location and root position’s difference is longer than 1.2m, the character first approaches the given target. The character then strikes the target if it is within the distance threshold. When punching a target, we have mainly focused on the target’s up-vector solely whether it is struck or not. Based on AMP striking downstream task’s reward design, we redefined and improved rfar to reflect all the circumstances of relative position, velocity, and facing term of the character. These terms are multiplied as a product in rfar. Also, compared to the previous return, the facing term is additionally appended. Our task-related reward function is defined as follows:
In rnear, the reward motivates the character’s hands to reach and strike a target. The first term rdist encourages the character to punch the target by hand while rvel constrains glove’s striking velocity. The largest rnear is then, selected between the left and right gloves. d here, is a unit vector of character facing the direction to a goal.
On the other hand, rfar stimulates the character to get closer to the target. Both the punching task and the joystick task share the same reward function. Overall, rfar aims to maximize and confine its value to 1.0. Equivalent to the first term in rnear, rdist in rfar enables character to get closer to target.
The second term rvel diminishes the error between the character’s velocity velroot with goal direction d* and target speed vel* (1.2m/s). errvel calculates tangential speed error where velroot is a root velocity projected onto the ground.
To directly face the desired direction, we added the facing reward rfacing. rfacing allows the character to minimize the error between the target facing direction d^ and the current character’s heading direction d ̅. Both heading directions are normalized. Ablation study for facing reward, see Figure 3.
In short, rfar maximizes the overall reward value by minimizing the distance between the character and the target (rdist), closely approximating the specified speed and the linear velocity (rvel) and ensuring the character faces the target (rfacing). By satisfying all these conditions, the character is prompted to punch the given cuboid.
In this task, we have applied a framework for multi-agents to compete with each other as boxing, TimeChamber. Based on ASE, the competitive policy learned by the discriminator will be transferred as a high-level policy to each agent. In this way, both agents learn tactics against their opponents based on basic skills. In addition to the previous task, our boxing task was trained in Isaac Gym, with 4096 parallel environments in a single GPU. Motivated by ASE task training, TimeChamber learns adequate latent information which fits the situation. For diversity in playing boxing games, TimeChamber has implemented a pool of opponent players where it is prioritized and sorted by winning rate and consists of high-level policy, winning rate, and environment index. Through the training, opponents are sampled based on their winning rate, then allocated to each parallel environment. End of each game, the winning rates are updated in the player pool. This depends on the ELO rating system, which calculates the relative skill levels of two players.
Compared to a single agent environment, observation space in a competitive environment additionally contains the opponent’s behavior state in local. The observation size is 65 in total.
Our boxing strategy is akin to that of a punching task. When the opponent is considered far from the player, the player approaches the opponent player. Then, the player maneuvers to strike down the opponent with a high return.
When approaching each other, rnear enables the character to get closer to the opponent (rpos) while maintaining a target speed 1.2m/s (rvel). rfacing allows players to face forward since attacking an opponent from behind is considered as violation in boxing (see Figure 9). This minimizes the error between target facing direction d^ and the current character’s heading direction d ̅.
In rnear, unlike the punching task, a penalty is added if the player itself has fallen or crossed the arena boundary. A penalty for facing direction has also been added as a regulation in boxing. Without a facing penalty, the player may tend to attack their opponent while holding down from behind. If the opponent falls, the player gets a reward value of 500. The overall reward function is similar to that of downstream strike task’s from TimeChamber. Table 3 includes the weights that were used in the reward function.
wdamage | 1.0 |
wclose | 4.0 |
wfacing | 10.0 |
wfall | 200.0 |
wenergy | 0.001 |
wpenalty | 30.0 |
rdamage measures how player have damaged by force. All forces are normal forces that are contacted to the player (Fop→ego) or the opponent Fego→op).
rclose motivates the character to punch the opponent’s target point: torso and head. rclose is identical to rnear that was used in the punching task. rfacing encourages players and the opponent to face each other. d denotes the opponent’s local facing direction to the player while d^ is the character’s local facing direction to the opponent. Both directions are normalized. The previous facing reward from Won et al. [23] presents a challenge in enabling characters to perform proper boxing maneuvers. It only considers whether the character faces the target, disregarding the opponent’s heading direction. This allows the agent to attack in the opposite direction (see Figure 9).
If the opponent is about to fall, rfall calculates the error between the opponent’s up-vector vop and global up-vector vup. To restrain character to attack backward we only give fall reward when facing each other. renergy induces characters to behave effectively and less aggressively where ajoint is the angular acceleration of character’s joint and l is a distance between the players.
4. Result
Compared to the previous AMP environment, it required approximately 8 hours which resulted in 140 hours, where the training time has been significantly reduced. This can be done by training with parallel environments, which allows the network to collect data faster. Not only does this improves the training time, but it also improves the overall correlation between multiple trajectories in the dataset. Our work snapshots are depicted in Figure 2.
If the facing term is ignored, the character tends to punch a target disregarding the facing direction, which leads to an unnatural style of locomotion. Without the term, the character maintains the initial facing direction while approaching (see Figure 3). Throughout the experiment, having a reward for facing direction is significant as it allows the character to retain its initial random facing while moving toward the target in the joystick task (see Figure 4). Although tuning the facing error is necessary, it still holds importance. Overall learning curves increase gradually in both tasks.
In the joystick task, the character manages to face the desired direction while heading towards to the target. The red arrow and blue arrow are illustrated in Figure 4 where each arrow points to the target and desired heading direction respectively.
Both tasks’ learning curves increased gradually in training. This denotes that a character tends to behave to obtain maximum rewards. Figure 5 shows the learning curves of mean rewards on each task.
Our experiment with ASE on punching task fulfills its task too. However, while the character moves toward the target, it behaves unnaturally compared to the result from AMP. This may have occurred due to the exploitation of a discriminator, which results in the character to behave a subset of specific skills as the given latent information returns the utmost reward (see Figure 6). Furthermore, the data utilized in the research only included essential information due to limitations in available GPU resources. To address this issue potential strategies involve training a low-level policy utilizing a range of diverse datasets, or alternatively, generating distinct latent vectors for each individual dataset. These methods aim to effectively mitigate the phenomenon of mode collapse. Overall training time took approximately 22 hours. Figure 7 shows the learning curves of its average task returns and task weight. Task weight is updated automatically via PD Controller. This enables to balance between task and style reward weight.
Training time for character boxing task took around 6 to 7 days. Multi-agent reinforcement learning spent a longer time due to its non-stationery feature and increased complexity. Also balance between exploration and exploitation trade-off might be a reason for this as well. Hence, learning curves in MARL might not increase constantly, especially in scenarios like competitive, zero-sum game environments. Meaningful skills emerged as the epoch increased. After training with 40000 epochs, the characters attacked and defended themselves naturally (see Figure 8).
5. Conclusion
In this paper, we explored effective frameworks of recent studies with our character punching and boxing tasks by comparing two existing methods: AMP and ASE. Without requiring explicit conditioning in the form of sequences or labels, both models illustrated natural transitions on locomotion and punching.
However, AMP is limited to dealing with various tasks with a single motion prior. Since a discriminator alone does not contain implicit information in the latent space, which enables to recover distinct behavior for each data, it becomes necessary to train it from scratch for different individual tasks. Also, AMP is prone to mode collapse if the task objective is vague, which leads to repeating specific behavior. ASE, on the other hand, uses a low-level policy to learn a latent space of meaningful skills. During the pre-training task, the low-level policy is trained to balance and imitate behaviors from the dataset. When training task-oriented high-level policy, the low-level policy constrains the set of possible actions. This reduces exploiting abnormal movements compared to the prior model. Still, ASE is also prone to mode collapse in that the combined reward between the latent information of the corresponding trajectory and discriminator may exploit only a portion of the skills. PADL solves it by allocating unique latent information to each motion dataset. Motivated by the task weight from PADL, instead of using fixed weights for tasks and the discriminator, we implemented task weights which are adjusted in a way similar to a PD controller.
We also demonstrate a competitive boxing task performed by two characters via TimeChamber, an ASE-based framework. Due to the complex nature of competitive MARL, the training time was longer compared to the single-agent environment. Also, finding the balance between exploration and exploitation is intricate as the agents observe moving objects while maximizing their return. As the training epoch increases, the character becomes capable of performing more meaningful and sophisticated skills. However, as noted in ASE, some movements such as “walking” towards the opponent aren’t natural.
For future work, implementing a diffusion model could be a promising direction. The diffusion model is one of the popular methods in recent works and it generates high-quality kinematic motions through stochastic processes [28-30]. Physics-based motion controllers based on a pre-trained diffusion model also exists [31]. However, most of the state-of-the-art existing diffusion models aim to control a single character. It would be intriguing to build an adversarial environment to interact with multiple agents while the user uses natural language prompts to control characters. We look forward to building interesting environments on top of generative models.