Motion capture and Estimation using Python

Understanding BVH format and time-series pose classification

Published in

The Startup

7 min readSep 6, 2020

Pose Estimation AI(PostNet) is becoming more widely accepted and used in our daily lives, starting from Fitness Applications to Research projects. They hold the power to replace so much of the capital spent on buying hardware and other electronics used for the same task. Let's take the example of games, to clone the movement of a real-life sportsperson into a game, heavy equipment is required. I believe that in the future the same task could be efficiently performed by these neural networks.

In this article, we will understand some basic concepts that lead up to classifying movement. The topics of discussion will be pose estimation, Movement, BVH file format, and CMU dataset. I will provide various examples and visualizations to help you understand each.

Pose Estimation

Image in http://www.cs.cmu.edu/~ILIM/projects/IM/humanpose/humanpose.html

The primary aim of pose estimation as the name suggests is to replicate the pose of a human body(s) in a given scenario. The posture of a body can naturally be defined by a skeleton, a segmentation, or even by the formation of skinned linear bodies. Most of the algorithms estimate poses by predicting the key points of a body which define certain joints like shoulders, hips, elbows, knees, etc. A human body has about 244 degrees of freedom with about 230 joints. To extract a perfect pose a system would require to show all these specifications and hence the task of perfecting a pose estimation AI is still in progress.

Algorithm

By TensorFlow in https://medium.com/tensorflow/real-time-human-pose-estimation-in-the-browser-with-tensorflow-js-7dd0bc881cd5

The aim of the algorithm(s) is to find the cartesian coordinates of the key points in the image provided. The task of pose estimation is divided into two sections, probability prediction of keypoints and error vector from the point of highest probability. Some techniques also use time-series estimation of poses to increase the prediction accuracy of the current posture. The input is the image in question, and the output contains the coordinates of the points with a probability of each key point. The output dimension can be varied with the required precision.

Most available PoseNets give 2-dimensional output with keypoints determined by their x and y coordinates in the frame.
Modern algorithms are able to predict 3-dimensional outputs as well(i.e. x, y, and z dimensions).
The emerging technology might be able to successfully predict about 6 dimensions, with rotation also added to each axis(i.e. 3 translation, and 3 rotation).

Implementations

The current opensource implementations include TensorFlow’s Tflite model, Openpose, and Pytorch models. Models like Epipolar require a multidirectional view of a body to be able to predict key points in 3-dimensional space. Most algorithms can predict about 17 to 19 key points that give a basic body skeleton. Some are able to give about 30 to 40 key points even. These points would include small details like toes, fingers, eyes heel, etc.

The base implementation of fall detection technology comes from pose estimation. It is utilized extensively in the fitness industry, for posture correction and training. The recent popularization of Gesture recognition and processing is being aided by this technology. If you’ve watched Mission Impossible then you know what Pose Estimation is capable of 😉.

Research

The current research on pose estimation aims at optimizing the complexity of PoseNets with respect to time and computation. When it comes to the prediction of poses in a live video feed some models provide very slow prediction, resulting in low fps videos. There will always be a battle between accuracy and optimality, and so better algorithms must be formalized. The evolution of pose estimation will also come with the number of key points an algorithm is able to predict and in what spatial dimension.

Some application-based advances are being made in this field with regard to pose analysis. Few models today can analyze one’s posture frame by frame and predict if he/she has some diseases like Ataxia, or skeletal disorders, etc. Probable future fields could include theft detection, behavioral prediction, and so on.

Movement

A series of postures make up a movement. Technically, a time series stack of postures with a sequential progression is movement. Although a posture is able to describe a body structure clearly it is unable to understand it. A movement, on the other hand, can define a skeleton and talk about the intention of the postures. Let's say that the difference between an RNN and an LSTM is exactly what this is, increased Memory capacity, and understanding.

Image in https://www.arxiv-vanity.com/papers/1702.07486/

How is movement understood? Let us consider a single posture to be of shape (17,2) representing 17 key points in our body in the x-axis and y-axis. For each frame, these 17 coordinates change their position dynamically, forming a sequence of sorts. It would normally take about 100 frames for anyone to actually understand the combination. Every motion has different dynamics, for example, running would take about 60–70 frames to make sense but walking would need a minimum 100.

BVH(BioVision Hierarchy)

A posture can be simply plotted and interpreted using 3d graphs but movement/motion cannot. For understanding and visualization of movements, there exist various motion capture files like BVH, FBX, 3Ds, etc. These files define the body structure and how each part is positioned during the movement.

BVH is the most widely used motion capture data format today. Generally speaking, it is used to store motion data but is mostly used in animation domains to animate characters. I have chosen this format since it is also easily obtainable and interpretable.

Structure

The structure of a BVH file is quite unique. It has two parts, a header section which describes the hierarchy and initial pose of the skeleton; and a data section which contains the motion data. The header section always starts with the word “HIERARCHY” and is used to define the skeleton structure. It defines the offset of each child from its parent, where each key point(child) falls in the hierarchical structure, the orientations of that child, and its children. The keyword “End Site” means that this node does not have any children. Each node in the hierarchy is a joint in our body and the nested joints are connections to form the skeleton.

image in https://v22.wiki.optitrack.com/index.php?title=Data_Export:_BVH

The second part of the file is motion data. Each posture in movement is represented by one such element from this motion data. This element contains all rotation and position values in accordance with the above hierarchy.

Implementation

Logically understanding and manipulating the BVH is quite simple. The structure of the skeleton and the hierarchy of the joints can be changed without impacting the overall movement. Their exists various python libraries and implementations which make BVH parsing smooth.

bpy — Blender is a famous 3D Creation software which extensively supports python scripting. It helps in the parsing of BVH data, conversion from and to various formats, etc.
bvh-toolbox — This is a python package used for manipulating frames, the hierarchy, and also an easy way of BVH conversion in python
bvh-converter — We will see extensive use of this package to convert our BVH data to user-friendly CSV files.

Note: The hierarchy and motion data are extensively linked through orientation. Even if one faulty motion data element with incorrect orientation is inserted the whole animation may become faulty.

CMU Dataset

The Carnegie-Mellon University dataset is a collection of about 2500 BVH files containing several types of movements. Simple movements such as jumping, walking, running to complex movements such as swordplay and swimming are present. The data, as well as the BVH hierarchy specified by the dataset, has set a standard for most research based on this field. The complexity and diversity of the data provide for adequacy in any training required.

Let us try and visualize some of the movements from this dataset in Blender and see the results. We will start with simple movements like walking and running then see some complex ones.

To understand the complexity in movements we now visualize an example salsa dance. As you can see even after 500 or so frames its difficult to differentiate the exact dance form.

I would like to thank Sagnik Sarkar for his help in the content of the article. Interactive graphs in plotly and motion capture visualizations using blender also helped a lot in the explanation of concepts.

With this, we come to the end of the article. I hope you learned something new about motion capture and pose estimation. If you have any doubts or questions comment below. I will soon write another article based completely on code implementation.