Real-Time Gesture Recognition Using GOOGLE’S MediaPipe Hands — Add Your Own Gestures [Tutorial #1]

Vaibhav Mudgal
9 min readJun 12, 2021
AI vs Brain

Artificial Intelligence is improving day by day. We are becoming smart whereas our machines are becoming even smarter by each passing moment. Object detection has been a hot topic for many years now and there are a lot of Machine Learning and Deep Learning models with near perfect accuracies. But accomplishing the perfection comes at a price, High Computational Power. Although, a need of High Performance Machines is not a new problem for us.

Our goal is to reach near 100 accuracy and there are two ways to accomplish it:

  1. High Performance Machines
  2. Optimized Algorithms/Models

This article is all about implementing an Optimized Model for Detecting Hand gestures.

1. WHAT IS MEDIAPIPE?

MediaPipe is a cross-platform framework, created by Google, for building multimodal applied machine learning pipelines. It provides cutting edge ML models such as:

  • Face Detection
  • Multi-Hand Tracking
  • Human Pose
  • and many more

For more MediaPipe solutions, visit solutions.html.

And to read more about MediaPipe Hands, visit hands.

MediaPipe models are built to use with minimal computing machine. Below are the performance characteristics of (real time fps) MediaPipe facemesh model-

Image From: https://blog.tensorflow.org/2020/03/face-and-hand-tracking-in-browser-with-mediapipe-and-tensorflowjs.html

In this article, we’ll use MediaPipe Hands to detect Hand Landmarks and will add a few gestures based on these landmarks to the model.

2. MEDIAPIPE HANDS

Hand Landmarks are the joints on the fingers as well as the finger tips. MediaPipe Hands detect 21 landmarks shown below.

Image From: hand_landmarks.png

The MediaPipe code returns the normalized coordinates of these 21 landmarks.

Firstly, install the MediaPipe Python library using terminal-

pip install mediapipe

Below is the implementation of the MediaPipe Hands from-https://google.github.io/mediapipe/solutions/hands

Importing MediaPipe, OpenCV and Numpy-

import mediapipe as mp
import cv2
import numpy as np

Capturing webcam video using OpenCV-

cap = cv2.VideoCapture(0)with mp_hands.Hands(
min_detection_confidence=0.5,
min_tracking_confidence=0.5) as hands:
while cap.isOpened():
success, img = cap.read()
image = img.copy()
if not success:
print("Ignoring empty camera frame.")
continue

If loading a video, use ‘break’ instead of ‘continue’ in the above code.

Flipping the image horizontally for a later selfie-view display, and converting the BGR image to RGB-

image = cv2.cvtColor(cv2.flip(image, 1), cv2.COLOR_BGR2RGB)

To improve performance, optionally mark the image as not writeable to pass by reference-

image.flags.writeable = False
results = hands.process(image)

Draw the hand annotations on the image-

image.flags.writeable = True
image = cv2.cvtColor(image, cv2.COLOR_RGB2BGR)
if results.multi_hand_landmarks:
for hand_landmarks in results.multi_hand_landmarks:
mp_drawing.draw_landmarks(
image, hand_landmarks, mp_hands.HAND_CONNECTIONS)
cv2.imshow('MediaPipe Hands', image)
if cv2.waitKey(5) & 0xFF == ord("q"):
break
cap.release()

3. METHODS USED TO EXTRACT AND CALCULATE REAL COORDINATES

What does results.multi_hand_landmarks do:

Returns each landmark with (x, y, z) values where x and y are the normalized coordinates and z is the normalized distance from the webcam. Below is the output of results.multi_hand_landmarks[-1] ([-1] means the latest landmarks):

Input[12]:print(results.multi_hand_landmarks[-1])Output[12]: landmark {                                #Landmark 0
x: 0.5968636274337769
y: 0.896553099155426
z: 2.2493122742162086e-06
}
landmark { #Landmark 1
x: 0.5569524168968201
y: 0.9146661758422852
z: 0.009924606420099735
}
landmark { #Landmark 2
x: 0.5252435803413391
y: 0.9221382141113281
z: -0.00661493418738246
}
landmark { #Landmark 3
x: 0.5062761902809143
y: 0.9374220967292786
z: -0.025194035843014717
}
landmark { #Landmark 4
x: 0.4931206703186035
y: 0.9585299491882324
z: -0.04859820753335953
}
landmark { #Landmark 5
x: 0.5249694585800171
y: 0.8639355301856995
z: -0.06959360837936401
}
landmark { #Landmark 6
x: 0.4925597906112671
y: 0.8875953555107117
z: -0.10742112994194031
}
landmark { #Landmark 7
x: 0.4766075015068054
y: 0.925879716873169
z: -0.12491398304700851
}
landmark { #Landmark 8
x: 0.4691225588321686
y: 0.961577832698822
z: -0.1354815512895584
}
landmark { #Landmark 9
x: 0.5465156435966492
y: 0.8603861927986145
z: -0.09077927470207214
}
landmark { #Landmark 10
x: 0.5134445428848267
y: 0.8893011212348938
z: -0.139773428440094
}
landmark { #Landmark 11
x: 0.495861679315567
y: 0.9380254149436951
z: -0.1563994288444519
}
landmark { #Landmark 12
x: 0.48566314578056335
y: 0.9797104597091675
z: -0.16294890642166138
}
landmark { #Landmark 13
x: 0.5695441365242004
y: 0.8747822642326355
z: -0.10704094916582108
}
landmark { #Landmark 14
x: 0.5467929244041443
y: 0.9169320464134216
z: -0.15352743864059448
}
landmark { #Landmark 15
x: 0.5310368537902832
y: 0.9595043659210205
z: -0.16797395050525665
}
landmark { #Landmark 16
x: 0.5206668376922607
y: 0.9926384687423706
z: -0.17157959938049316
}
landmark { #Landmark 17
x: 0.590991735458374
y: 0.8993032574653625
z: -0.11928978562355042
}
landmark { #Landmark 18
x: 0.5841020345687866
y: 0.9504171013832092
z: -0.14728157222270966
}
landmark { #Landmark 19
x: 0.5759584307670593
y: 0.9837712645530701
z: -0.14999689161777496
}
landmark { #Landmark 20
x: 0.5684304237365723
y: 1.0093547105789185
z: -0.15041688084602356
}

Above are 21 Normalized Landmark Coordinates.

  • Extracting normalized x, y and z coordinates for each landmark:
Input:
results.multi_hand_landmarks[-1].landmark[0]
Output:

x: 0.5968636274337769 #Landmark 0
y: 0.896553099155426
z: 2.2493122742162086e-06
Input:
x`= str(results.multi_hand_landmarks[-1].landmark[0]).split('\n')[0]
y`= str(results.multi_hand_landmarks[-1].landmark[0]).split('\n')[1]
z`= str(results.multi_hand_landmarks[-1].landmark[0]).split('\n')[2]
print(x`)
print(y`)
print(z`)
Output:'x: 0.5968636274337769'
'y: 0.896553099155426'
'z: 2.2493122742162086e-06'
Input:
x = float(x`.split(" ")[1])
y = float(y`.split(" ")[1])
z = float(z`.split(" ")[1])
#x, y and z are normalized coordinate for landmark 0
  • To get the real coordinates, multiply x coordinate by Width of the video and y coordinate by the Height of the video:
height = img.shape[0]
width = img.shape[1]
x_real = x * width
y_real = y * height
#where "x" and "y" are normalized coordinate of the landmark and "img" is the captured matrix of frame at an instance.

4. 1 ADDING GESTURES

4.1.1 Orientation of our hand:

Take a look at Landmark 0 and Landmark 9, we can use these two points to determine the approximate orientation of the hand.

Image From: hand_landmarks.png

Angle created by line joining Landmark 0 and Landmark 9 against horizontal will be:

where m1 is the slope of horizontal line, i.e., 0 and m2 is the slope of the line created by the Landmark 0 and Landmark 9, i.e.,

Since m1 = 0, therefore

Since tan is an increasing function, therefore angle will be directly proportional to slope m2:

Orientation : Upward

For |m2| > 1 , and y coordinate of landmark 9 is GREATER the y coordinate of landmark 0, the orientation will be upwards.

Orientation : Downward

For |m2| > 1 , and y coordinate of landmark 9 is SMALLER the y coordinate of landmark 0, the orientation will be upwards.

Orientation : Right

For |m| > 0 and |m| < 1, and x coordinate of landmark 9 is GREATER than x coordinate of landmark 0.

Orientation : Left

For |m| > 0 and |m| < 1, and x coordinate of landmark 9 is SMALLER than x coordinate of landmark 0.

Orientation function takes two tuples as argument, with each tuple containing x and y coordinates of landmark 0 and landmark 9.

def orientation(coordinate_landmark_0, coordinate_landmark_9): 
x0 = coordinate_landmark_0[0]
y0 = coordinate_landmark_0[1]

x9 = coordinate_landmark_9[0]
y9 = coordinate_landmark_9[1]

if abs(x9 - x0) < 0.05: #since tan(0) --> ∞
m = 1000000000
else:
m = abs((y9 - y0)/(x9 - x0))

if m>=0 and m<=1:
if x9 > x0:
return "Right"
else:
return "Left"
if m>1:
if y9 < y0: #since, y decreases upwards
return "Up"
else:
return "Down"

Let’s try this function out:

Input:
orientation((0, 0), (1, 4))
Output:
'Down'

4.1.2 CHECKING WHICH FINGER IS CLOSED

Take a look at the pictures below, the tip of the fingers are numbered in black and the adjacent joint/landmark is numbered in white.

Fig 1: With Open Fingers
Fig 2: With Closed Fingers

OBSERVATION:

FIG 1 : With open fingers, distance between landmark 0 and tip of the fingers are greater than the distance between landmark 0 and their respective adjacent landmarks.

FIG 2 : With closed fingers, distance between landmark 0 and tip of the fingers are less than the distance between landmark 0 and their respective adjacent landmarks.

Hence, the above observation can be used to detect a closed finger.

Lets code this:

  • x_coordinate function returns x coordinate of the landmark passed
  • y_coordinate function returns the y coordinate of the landmark passed
  • finger function returns list of fingers which are closed when z = “finger” is passed irrespective of landmark passed, whereas if z = “true coordinate”, then it returns the TRUE or REAL COORDINATES as a tuple of the landmark passed.
#For calculating distance between two points on a graph
from math import dist
def x_coordinate(landmark): #landmark --> out of 21
return float(str(results.multi_hand_landmarks[-1].landmark[int(landmark)]).split('\n')[0].split(" ")[1])
def y_coordinate(landmark): #landmark --> out of 21
return float(str(results.multi_hand_landmarks[-1].landmark[int(landmark)]).split('\n')[1].split(" ")[1])
def finger(landmark, z): #is z="finger, it retuens which finger is closed. If z="true coordinate", it returns the true coordinates
if results.multi_hand_landmarks is not None:
try:
p0x = x_coordinate(0) #coordinates of landmark 0
p0y = y_coordinate(0)
p7x = x_coordinate(7) #coordinates of tip index
p7y = y_coordinate(7)
d07 = dist([p0x, p0y], [p7x, p7y])

p8x = x_coordinate(8) #coordinates of mid index
p8y = y_coordinate(8)
d08 = dist([p0x, p0y], [p8x, p8y])
p11x = x_coordinate(11) #coordinates of tip middlefinger
p11y = y_coordinate(11)
d011 = dist([p0x, p0y], [p11x, p11y])
p12x = x_coordinate(12) #coordinates of mid index
p12y = y_coordinate(12)
d012 = dist([p0x, p0y], [p12x, p12y])
p15x = x_coordinate(15) #coordinates of mid index
p15y = y_coordinate(15)
d015 = dist([p0x, p0y], [p15x, p15y])
p16x = x_coordinate(16) #coordinates of tip middlefinger
p16y = y_coordinate(16)
d016 = dist([p0x, p0y], [p16x, p16y])
p19x = x_coordinate(19) #coordinates of mid index
p19y = y_coordinate(19)
d019 = dist([p0x, p0y], [p19x, p19y])
p20x = x_coordinate(20) #coordinates of mid index
p20y = y_coordinate(20)
d020 = dist([p0x, p0y], [p20x, p20y])
close = []

if z == "finger":
if d07>d08:
close.append(1)
if d011>d012:
close.append(2)
if d015>d016:
close.append(3)
if d019>d020:
close.append(4)
return close
if z == "true coordinate":
plandmark_x = x_coordinate(landmark)
plandmark_y = x_coordinate(landmark)
return (int(1280*plandmark_x), int(720*plandmark_y))


except:
pass

(Landmarks: 0 is for palm. 8, 12, 16 and 20 are tips of fingers whereas 7, 8, 15 and 19 are the adjacent landmark points)

The above finger function has two functionalities:

if z = “finger” is passed, then the function returns a list of fingers that are closed whereas if z = “true coordinate” is passed, it returns a tuple containing the true coordinates of the landmark.

4.1.3 ADDING “OKAY” OR “THUMBS UP” GESTURE

OBSERVATION:

For detecting “Thumbs Up” using left hand:

  • Orientation should be “Right”
  • All of the fingers should be closed
  • x coordinate of Landmark 4 should be less than x coordinate of Landmark 5
if finger(none, "finger") == [1,2,3,4]:  if orientation(finger(0,"true coordinate"), finger(9, "true coordinate") == "right":    if finger(4,"true coordinate")[0] < finger(5,"true coordinate")[0]:       cv2.putText(image, "Okay!!", (500, 200), 2.0, (0, 0, 255), 2)

4.1.4 DRAWING ON THE SCREEN USING YOUR FINGER

Lets draw using your index finger.

STEPS FOR DRAWING:

  • Except your index finger, every other finger should be closed.
  • Append all coordinates of the tip of the index finger landmarks while it is not closed.
  • Create a line by joining all the appended points using OpenCV.

Below is the code for the same:

points = []if finger(None, "finger") == [1,2,3]:
points.append(finger(8, "true coordinate"))
#since landmark 8 is the tip of index finger

for
i in range(len(points) - 1):
cv2.line(image, (points[i][0], points[i][1]), (points[i+1][0], points[i+1][1]), color = (255, 255, 0), thickness = 1)

STEPS FOR ERASING THE DRAWING:

Lets erase the drawing when all of the fingers are closed:

if finger(None, "finger") == [1,2,3,4]:
points.clear()

5. HANDEDNESS

MediaPipe allows you to identify left and right hand by using the code below:

results = hands.process(image)
results.multi_handedness
#Check MEDIAPIPE HANDS (SECTION 2) for "results" and add it in between the code

The output of the above code looks like this:

Input:
print('Handedness:', results.multi_handedness)
Output:#None if no hand is detected
Handedness: None
#If only left hand is detected
Handedness: [classification {
index: 0
score: 0.9713482856750488
label: "Left"
}
]
#If only left hand is detected
Handedness: [classification {
index: 1
score: 0.9999989867210388
label: "Right"
}
]
#If both left and right hand are detected
Handedness: [classification {
index: 0
score: 0.9999902844429016
label: "Left"
}
, classification {
index: 1
score: 0.9998350739479065
label: "Right"
}
]

Handedness could be used to extend the types and number of gestures and much more.

This is a beginner tutorial on how to use MediaPipe Hands for gesture recognition. You can play around and add more gestures.

--

--

Vaibhav Mudgal

Hey, I am an AI enthusiast. I love to play around with real world problems.