PoseNet: What and Why?

Sujal Paudel
5 min readMay 21, 2021

--

Pose Estimation refers to computer vision techniques that detect human figures in images and videos, so that one could determine, for example, where someone’s elbow shows up in an image[1]. This technology opens a whole new world of possibilities, allowing the creation of a system capable of surveillance to a whole new level of intelligence. This technology is not for recognition, rather for the detection of various poses.

Seventeen different poses are defined by the PoseNet API, and the confidence score to these is provided between 0.0 and 1.0, 1.0 being the highest. These scores are based on the fact, how clear the PoseNet model actually is regarding the pose of the person in the image or the video. The technicalities of PoseNet can further be read here [2].

As mentioned earlier PoseNet API provides us with 17 different outputs, which can lead us to an ocean of opportunities to build several applications. Primarily its field of application can be a smart form of surveillance, driven with intelligence. Here in this documentation, we try to present how the PoseNet can actually be used to figure out the poses of the people and then determine their actions.

# The below picture shows a subject who has just sat in a chair. When implemented PoseNet model under this photograph, we got the JSON response shown in the following figure.

Fig1: Man Just Sat in Chair. (SideView)

Fig2: JSON response of the image from the above image.

Explanation of the JSON response:

So, the JSON response is in total 17 in number, each defining,

1) Part: The part of the body

2) Position: The position of the body part in the 2D space

3) Score: Confidence of the model, about the body part it detected on that space

If we look at Figure1, our intelligence can say that the left segment of the subject is clearly visible, however, the right section is hidden. This is the understanding developed by our human intelligence, however the same thing when exposed to computers, the condition happens to be totally different. Nevertheless, the PoseNet model provides state-of-the-art detection, as it has been trained on several datasets and holds the caliber to differentiate between various poses.

If we look at the first two results of the array in Fig1, indexed by 0 and 1 representing nose and left eye respectively, we observe that the PoseNet model has a confidence score of more than 0.99 which means it is almost 100% confident about the presence of nose and left eye on the position defined by the X and Y coordinates. While index 2 represents the right eye. Here we can see the model shows a confidence score of 0.66 which is quite low, and which signifies that the model is not quite confident of the fact that the right eye of the subject is actually present in the image. There are some places where the model is disturbed, however, this can be sorted out, and actually, this needs to be sorted out, as the PoseNet is the most easily available API that we can use for pose detection.

From JSON response to Poses

The response we get is the confidence score, but how can we actually turn those into actual poses be sure whether the subject is Standing, Sitting, or in some other position. For this, we can follow several approaches. The very primitive way can be through the confidence score of the various poses we actually are in. With our intuition we can state that the response with high confidence of nose has a frame where the subject is facing directly to the camera, similar is with the eye. These pre-determined results can be used to train the model and finally figure out the pose in relation to the confidence score.

PoseNet For Us and Our System

As we intend to build a position determining system, this API can help us and reduce most of our computationally expensive tasks. Suppose we have a raw feed from surveillance cameras, and we intend to impose the PoseNet API into it. From the video feed we can pull down the image in a frame-to-frame pattern, where each frame in the video can be considered as an image, and then we can run upon the PoseNet model into that very frame. As the carbon footprint, this can be considered as the digital footprint of the subject. A cumulative result can be finally pulled off, considering all the poses the subject happens to be in, which can help analyze the subject's behavior.

At a high-level pose estimation happens in two phases:

An input RGB image is fed through a convolution neural network.

Either a single-pose or multi-pose decoding algorithm is used to decode poses, pose confidence scores, keypoint positions, and keypoint confidence scores from the model outputs.

The Workflow available:

Currently, the demo provided by the PoseNet community makes use of the real-time video feed to point out the 17 different areas of the body and displays the dots and the line in between those points. This phenomenon happens in real-time, while on a more analytical perspective the workflow can be designed where we perform frame-by-frame analysis of the video, which will not be in real-time, and cannot actually give the best performance if done in real-time.

Fig3: Response that can be observed real-time

The WorkFlow we intend (For Analysis, not real-time)

  1. Frame From the Video(Works as an Image). We consider it as a still
  2. Still when imposed to the PoseNet model gives off JSON response of 17 different values.
  3. Figure out the pose of the subject considering the JSON response in the still
  4. Further Analysis thereafter.

[1] https://www.tensorflow.org/lite/models/pose_estimation/overview#how_it_works

[2] https://www.tensorflow.org/lite/models/pose_estimation/overview

--

--