Real-Time Gesture Recognition Using GOOGLE’S MediaPipe Hands — Add Your Own Gestures [Tutorial #1]

9 min readJun 12, 2021

Artificial Intelligence is improving day by day. We are becoming smart whereas our machines are becoming even smarter by each passing moment. Object detection has been a hot topic for many years now and there are a lot of Machine Learning and Deep Learning models with near perfect accuracies. But accomplishing the perfection comes at a price, High Computational Power. Although, a need of High Performance Machines is not a new problem for us.

Our goal is to reach near 100 accuracy and there are two ways to accomplish it:

High Performance Machines
Optimized Algorithms/Models

This article is all about implementing an Optimized Model for Detecting Hand gestures.

1. WHAT IS MEDIAPIPE?

MediaPipe is a cross-platform framework, created by Google, for building multimodal applied machine learning pipelines. It provides cutting edge ML models such as:

Face Detection
Multi-Hand Tracking
Human Pose
and many more

For more MediaPipe solutions, visit solutions.html.

And to read more about MediaPipe Hands, visit hands.

MediaPipe models are built to use with minimal computing machine. Below are the performance characteristics of (real time fps) MediaPipe facemesh model-

Image From: https://blog.tensorflow.org/2020/03/face-and-hand-tracking-in-browser-with-mediapipe-and-tensorflowjs.html

In this article, we’ll use MediaPipe Hands to detect Hand Landmarks and will add a few gestures based on these landmarks to the model.

2. MEDIAPIPE HANDS

Hand Landmarks are the joints on the fingers as well as the finger tips. MediaPipe Hands detect 21 landmarks shown below.

The MediaPipe code returns the normalized coordinates of these 21 landmarks.

Firstly, install the MediaPipe Python library using terminal-

pip install mediapipe

Below is the implementation of the MediaPipe Hands from-https://google.github.io/mediapipe/solutions/hands

Importing MediaPipe, OpenCV and Numpy-

import mediapipe as mp
import cv2
import numpy as np

Capturing webcam video using OpenCV-

cap = cv2.VideoCapture(0)with mp_hands.Hands(
    min_detection_confidence=0.5,
    min_tracking_confidence=0.5) as hands:
        while cap.isOpened():
            success, img = cap.read()
            image = img.copy()
            if not success:
                print("Ignoring empty camera frame.")
                continue

If loading a video, use ‘break’ instead of ‘continue’ in the above code.

Flipping the image horizontally for a later selfie-view display, and converting the BGR image to RGB-

image = cv2.cvtColor(cv2.flip(image, 1), cv2.COLOR_BGR2RGB)

To improve performance, optionally mark the image as not writeable to pass by reference-

image.flags.writeable = False
results = hands.process(image)

Draw the hand annotations on the image-

image.flags.writeable = True
            image = cv2.cvtColor(image, cv2.COLOR_RGB2BGR)
            if results.multi_hand_landmarks:                      
                for hand_landmarks in results.multi_hand_landmarks:
                    mp_drawing.draw_landmarks(
                        image, hand_landmarks, mp_hands.HAND_CONNECTIONS)cv2.imshow('MediaPipe Hands', image)
            if cv2.waitKey(5) & 0xFF == ord("q"):
                break
cap.release()

3. METHODS USED TO EXTRACT AND CALCULATE REAL COORDINATES

What does results.multi_hand_landmarks do:

Returns each landmark with (x, y, z) values where x and y are the normalized coordinates and z is the normalized distance from the webcam. Below is the output of results.multi_hand_landmarks[-1] ([-1] means the latest landmarks):

Input[12]:print(results.multi_hand_landmarks[-1])Output[12]: landmark {                                #Landmark 0
  x: 0.5968636274337769
  y: 0.896553099155426
  z: 2.2493122742162086e-06
}
landmark {                                #Landmark 1
  x: 0.5569524168968201
  y: 0.9146661758422852
  z: 0.009924606420099735
}
landmark {                                #Landmark 2
  x: 0.5252435803413391
  y: 0.9221382141113281
  z: -0.00661493418738246
}
landmark {                                #Landmark 3
  x: 0.5062761902809143
  y: 0.9374220967292786
  z: -0.025194035843014717
}
landmark {                                #Landmark 4
  x: 0.4931206703186035
  y: 0.9585299491882324
  z: -0.04859820753335953
}
landmark {                                #Landmark 5
  x: 0.5249694585800171
  y: 0.8639355301856995
  z: -0.06959360837936401
}
landmark {                                #Landmark 6
  x: 0.4925597906112671
  y: 0.8875953555107117
  z: -0.10742112994194031
}
landmark {                                #Landmark 7
  x: 0.4766075015068054
  y: 0.925879716873169
  z: -0.12491398304700851
}
landmark {                                #Landmark 8
  x: 0.4691225588321686
  y: 0.961577832698822
  z: -0.1354815512895584
}
landmark {                                #Landmark 9
  x: 0.5465156435966492
  y: 0.8603861927986145
  z: -0.09077927470207214
}
landmark {                                #Landmark 10
  x: 0.5134445428848267
  y: 0.8893011212348938
  z: -0.139773428440094
}
landmark {                                #Landmark 11
  x: 0.495861679315567
  y: 0.9380254149436951
  z: -0.1563994288444519
}
landmark {                                #Landmark 12
  x: 0.48566314578056335
  y: 0.9797104597091675
  z: -0.16294890642166138
}
landmark {                                #Landmark 13
  x: 0.5695441365242004
  y: 0.8747822642326355
  z: -0.10704094916582108
}
landmark {                                #Landmark 14
  x: 0.5467929244041443
  y: 0.9169320464134216
  z: -0.15352743864059448
}
landmark {                                #Landmark 15
  x: 0.5310368537902832
  y: 0.9595043659210205
  z: -0.16797395050525665
}
landmark {                                #Landmark 16
  x: 0.5206668376922607
  y: 0.9926384687423706
  z: -0.17157959938049316
}
landmark {                                #Landmark 17
  x: 0.590991735458374
  y: 0.8993032574653625
  z: -0.11928978562355042
}
landmark {                                #Landmark 18
  x: 0.5841020345687866
  y: 0.9504171013832092
  z: -0.14728157222270966
}
landmark {                                #Landmark 19
  x: 0.5759584307670593
  y: 0.9837712645530701
  z: -0.14999689161777496
}
landmark {                                #Landmark 20
  x: 0.5684304237365723
  y: 1.0093547105789185
  z: -0.15041688084602356
}

Above are 21 Normalized Landmark Coordinates.

Extracting normalized x, y and z coordinates for each landmark:

Input:
results.multi_hand_landmarks[-1].landmark[0]Output:

x: 0.5968636274337769               #Landmark 0
y: 0.896553099155426
z: 2.2493122742162086e-06Input:
x`= str(results.multi_hand_landmarks[-1].landmark[0]).split('\n')[0]
y`= str(results.multi_hand_landmarks[-1].landmark[0]).split('\n')[1]
z`= str(results.multi_hand_landmarks[-1].landmark[0]).split('\n')[2]
print(x`)
print(y`)
print(z`)Output:'x: 0.5968636274337769'
'y: 0.896553099155426'
'z: 2.2493122742162086e-06'Input:
x = float(x`.split(" ")[1])
y = float(y`.split(" ")[1])
z = float(z`.split(" ")[1])#x, y and z are normalized coordinate for landmark 0

To get the real coordinates, multiply x coordinate by Width of the video and y coordinate by the Height of the video:

height = img.shape[0]
width = img.shape[1]x_real = x * width
y_real = y * height#where "x" and "y" are normalized coordinate of the landmark and "img" is the captured matrix of frame at an instance.

4. 1 ADDING GESTURES

4.1.1 Orientation of our hand:

Take a look at Landmark 0 and Landmark 9, we can use these two points to determine the approximate orientation of the hand.

Angle created by line joining Landmark 0 and Landmark 9 against horizontal will be:

where m1 is the slope of horizontal line, i.e., 0 and m2 is the slope of the line created by the Landmark 0 and Landmark 9, i.e.,

Since m1 = 0, therefore

Since tan is an increasing function, therefore angle will be directly proportional to slope m2:

Orientation : Upward

For |m2| > 1 , and y coordinate of landmark 9 is GREATER the y coordinate of landmark 0, the orientation will be upwards.

Orientation : Downward

For |m2| > 1 , and y coordinate of landmark 9 is SMALLER the y coordinate of landmark 0, the orientation will be upwards.

Orientation : Right

For |m| > 0 and |m| < 1, and x coordinate of landmark 9 is GREATER than x coordinate of landmark 0.

Orientation : Left

For |m| > 0 and |m| < 1, and x coordinate of landmark 9 is SMALLER than x coordinate of landmark 0.

Orientation function takes two tuples as argument, with each tuple containing x and y coordinates of landmark 0 and landmark 9.

def orientation(coordinate_landmark_0, coordinate_landmark_9): 
    x0 = coordinate_landmark_0[0]
    y0 = coordinate_landmark_0[1]
    
    x9 = coordinate_landmark_9[0]
    y9 = coordinate_landmark_9[1]
    
    if abs(x9 - x0) < 0.05:      #since tan(0) --> ∞
        m = 1000000000
    else:
        m = abs((y9 - y0)/(x9 - x0))       
        
    if m>=0 and m<=1:
        if x9 > x0:
            return "Right"
        else:
            return "Left"
    if m>1:
        if y9 < y0:       #since, y decreases upwards
            return "Up"
        else:
            return "Down"

Let’s try this function out:

Input:
orientation((0, 0), (1, 4))Output:
'Down'

4.1.2 CHECKING WHICH FINGER IS CLOSED

Take a look at the pictures below, the tip of the fingers are numbered in black and the adjacent joint/landmark is numbered in white.

OBSERVATION:

FIG 1 : With open fingers, distance between landmark 0 and tip of the fingers are greater than the distance between landmark 0 and their respective adjacent landmarks.

FIG 2 : With closed fingers, distance between landmark 0 and tip of the fingers are less than the distance between landmark 0 and their respective adjacent landmarks.

Hence, the above observation can be used to detect a closed finger.

Lets code this:

x_coordinate function returns x coordinate of the landmark passed
y_coordinate function returns the y coordinate of the landmark passed
finger function returns list of fingers which are closed when z = “finger” is passed irrespective of landmark passed, whereas if z = “true coordinate”, then it returns the TRUE or REAL COORDINATES as a tuple of the landmark passed.

#For calculating distance between two points on a graph
from math import distdef x_coordinate(landmark):  #landmark --> out of 21
    return float(str(results.multi_hand_landmarks[-1].landmark[int(landmark)]).split('\n')[0].split(" ")[1])def y_coordinate(landmark):  #landmark --> out of 21
    return float(str(results.multi_hand_landmarks[-1].landmark[int(landmark)]).split('\n')[1].split(" ")[1])def finger(landmark, z):   #is z="finger, it retuens which finger is closed. If z="true coordinate", it returns the true coordinates
    if results.multi_hand_landmarks is not None:
        try:
            p0x = x_coordinate(0) #coordinates of landmark 0
            p0y = y_coordinate(0)            p7x = x_coordinate(7) #coordinates of tip index
            p7y = y_coordinate(7)
            d07 = dist([p0x, p0y], [p7x, p7y])
          
            p8x = x_coordinate(8) #coordinates of mid index
            p8y = y_coordinate(8)
            d08 = dist([p0x, p0y], [p8x, p8y])            p11x = x_coordinate(11) #coordinates of tip middlefinger
            p11y = y_coordinate(11)
            d011 = dist([p0x, p0y], [p11x, p11y])            p12x = x_coordinate(12) #coordinates of mid index
            p12y = y_coordinate(12)                    
            d012 = dist([p0x, p0y], [p12x, p12y])            p15x = x_coordinate(15) #coordinates of mid index
            p15y = y_coordinate(15)                    
            d015 = dist([p0x, p0y], [p15x, p15y])            p16x = x_coordinate(16) #coordinates of tip middlefinger
            p16y = y_coordinate(16)
            d016 = dist([p0x, p0y], [p16x, p16y])            p19x = x_coordinate(19) #coordinates of mid index
            p19y = y_coordinate(19)                    
            d019 = dist([p0x, p0y], [p19x, p19y])            p20x = x_coordinate(20) #coordinates of mid index
            p20y = y_coordinate(20)                    
            d020 = dist([p0x, p0y], [p20x, p20y])            close = []
                  
            if z == "finger":    
               if d07>d08:
                 close.append(1)
               if d011>d012:
                  close.append(2)
               if d015>d016:
                  close.append(3)
               if d019>d020:
                  close.append(4)
            return close            if z == "true coordinate":
               plandmark_x = x_coordinate(landmark)
               plandmark_y = x_coordinate(landmark)         
               return (int(1280*plandmark_x), int(720*plandmark_y))
                        
                    
        except:
           pass

(Landmarks: 0 is for palm. 8, 12, 16 and 20 are tips of fingers whereas 7, 8, 15 and 19 are the adjacent landmark points)

The above finger function has two functionalities:

if z = “finger” is passed, then the function returns a list of fingers that are closed whereas if z = “true coordinate” is passed, it returns a tuple containing the true coordinates of the landmark.

4.1.3 ADDING “OKAY” OR “THUMBS UP” GESTURE

OBSERVATION:

For detecting “Thumbs Up” using left hand:

Orientation should be “Right”
All of the fingers should be closed
x coordinate of Landmark 4 should be less than x coordinate of Landmark 5

if finger(none, "finger") == [1,2,3,4]:  if orientation(finger(0,"true coordinate"), finger(9, "true coordinate") == "right":    if finger(4,"true coordinate")[0] < finger(5,"true coordinate")[0]:       cv2.putText(image, "Okay!!", (500, 200), 2.0, (0, 0, 255), 2)

4.1.4 DRAWING ON THE SCREEN USING YOUR FINGER

Lets draw using your index finger.

STEPS FOR DRAWING:

Except your index finger, every other finger should be closed.
Append all coordinates of the tip of the index finger landmarks while it is not closed.
Create a line by joining all the appended points using OpenCV.

Below is the code for the same:

points = []if finger(None, "finger") == [1,2,3]:
   points.append(finger(8, "true coordinate")) 
#since landmark 8 is the tip of index finger
for i in range(len(points) - 1):
    cv2.line(image, (points[i][0], points[i][1]), (points[i+1][0], points[i+1][1]), color = (255, 255, 0), thickness = 1)

STEPS FOR ERASING THE DRAWING:

Lets erase the drawing when all of the fingers are closed:

if finger(None, "finger") == [1,2,3,4]:
   points.clear()

5. HANDEDNESS

MediaPipe allows you to identify left and right hand by using the code below:

results = hands.process(image)
results.multi_handedness#Check MEDIAPIPE HANDS (SECTION 2) for "results" and add it in between the code

The output of the above code looks like this:

Input:
print('Handedness:', results.multi_handedness)Output:#None if no hand is detected
Handedness: None#If only left hand is detected
Handedness: [classification {
  index: 0
  score: 0.9713482856750488
  label: "Left"
}
]#If only left hand is detected
Handedness: [classification {
  index: 1
  score: 0.9999989867210388
  label: "Right"
}
]#If both left and right hand are detected
Handedness: [classification {
  index: 0
  score: 0.9999902844429016
  label: "Left"
}
, classification {
  index: 1
  score: 0.9998350739479065
  label: "Right"
}
]

Handedness could be used to extend the types and number of gestures and much more.

This is a beginner tutorial on how to use MediaPipe Hands for gesture recognition. You can play around and add more gestures.

Real-Time Gesture Recognition Using GOOGLE’S MediaPipe Hands — Add Your Own Gestures [Tutorial #1]

1. WHAT IS MEDIAPIPE?

2. MEDIAPIPE HANDS

3. METHODS USED TO EXTRACT AND CALCULATE REAL COORDINATES

4. 1 ADDING GESTURES

OBSERVATION:

OBSERVATION:

5. HANDEDNESS

Written by Vaibhav Mudgal