Controlling my computer using my hands

9 min readJun 23, 2024

Hi everyone!

So a couple of months ago, I decided to make a python program that would allow me to move my mouse with just my hands or eyes. It was a fairly ambitious project and I didn’t know what the end result was going to be. Or how hard it was going to be (Answer: extremely).

Now that I have finished it, the end result allows you to control your mouse using your hand, and even perform mouse actions like single left click, double left click, and long press the left mouse button using your hands.

This is the story of what I did, where I failed, and where I succeeded, from the start to the end of the project.

Coming up with a plan

I started off by researching about existing implementations of what I wanted to achieve. In my mind, the ideal end result would have worked like this:

The program divides the screen into a grid of 5 X 5.
The program tracks where I am looking and outputs an area of the screen corresponding to it.
The program moves the mouse, to the centre of the corresponding area.
The program tracks my hand movements which allows me to move the mouse in that area, giving me precise control of where I wanted to move the mouse to.

The additional requirement was to make the program generic, so that for anybody it can just be plug and play.

I knew training a custom (and lightweight model) to output face and hand landmarks was going to be a huge task, so I wanted to use pre trained models for those. Thankfully MediaPipe has a whole library for these models and they are very easy to use too.

I was able to quickly setup a basic project with hand and face landmark detection using opencv-python and mediapipe. Now that I had the pre trained models ready, it was time for me to train my own models.

Messing with eye tracking

I decided to start making the models in the order of which they would be required in the main event loop. That meant starting on a model that would track eye movement and move the mouse over the cell I am looking at.

I was able to quickly divide the screen into a 5 X 5 grid. I used pyautogui for this as I knew I would have to move the mouse too. After that, came the hellish time for data collection.

I wrote a program to move the mouse in a grid. Everytime the mouse would move to a new location, the user would have to look at it and press enter. The program would capture the face landmarks and store the location of the left and the right eye with respect to the location of the eyes when the user was looking at all the corners. The picture below shows this:

The red dot is the location where the user is looking at. The arrows represent the distance from the corners. For every data point, the program stored the length of the arrows.

I went forward with this method because I thought this would allow me to train a model that would be device independent. I collected data from all the devices I had (2 standalone displays and 2 laptops) with a range of grid sizes over which the mouse moves for each device. I normalised the data and started training models.

It did not work

Yup. It did not.

Of course I tried fixing it. I tried tuning a lot of models. Didn’t work. I tried collecting more data. Didn’t work. I ensured the subject was always in the centre of the screen. Didn’t work. I even generated data by using the linspace function from numpy to get the approximate distances for points that would lie between the ones I already collected.

Didn’t work.

I spent a whole solid month on this and even changed the way I was collecting the information for this. This included predicting the position of the eyes if the user were to look at the corners, including aspect ratios, directly transforming the physical size of the display to the coordinate system media pipe was using, etc. If I do end up explaining everything I tried, it would take a long time to finish this article, so I am not going to talk about it here.

The main problem was that I was not able to figure out a way to prevent the dimensions of the device from altering the data I collected. To make the program completely generic, it needed to transform the coordinates in such a way, so that it completely ignored user orientation and device dimensions. Cell number 16 in a laptop and a 4K display should have the same data points. This proved to be a very difficult problem for me.

So sadly, I decided to rework my goals and eliminate eye detection for now from the final intended project.

I will get back to it. If you have any suggestions about making this work, please drop a comment. It would help a lot. I am currently experimenting with the idea of training the model on device, but that comes with its own set of problems that need to be solved.

The new goal

The new goal was to make a program that uses your hand location to move the mouse. I had already tested a program to make this and it worked well. This is the basic code for it:

import cv2
import mediapipe as mp
import pyautogui
from utils.average import Average
from utils.mouse_behaviour import MouseBehaviour
import time

# mediapipe settings
BaseOptions = mp.tasks.BaseOptions
HandLandmarker = mp.tasks.vision.HandLandmarker
HandLandmarkerOptions = mp.tasks.vision.HandLandmarkerOptions
HandLandmarkerResult = mp.tasks.vision.HandLandmarkerResult
VisionRunningMode = mp.tasks.vision.RunningMode

# pyautogui settings
pyautogui.FAILSAFE = False

# instantiation of required classes and functions
cap = cv2.VideoCapture(0)
avg = Average(lim=15)
mouse_behaviour = MouseBehaviour()

def callback(results, _, __):
    global avg, mouse_behaviour
    if results.hand_landmarks:
        cursor = (
            sum(i.x for idx, i in enumerate(results.hand_landmarks[0]) if idx not in [4, 8]) / 19,
            sum(i.y for idx, i in enumerate(results.hand_landmarks[0]) if idx not in [4, 8]) / 19,
        )
        
        cursor = avg.get(cursor[0], cursor[1])
        
        mouse_behaviour.detect(
            cursor,
            [results.hand_landmarks[0][8].x, results.hand_landmarks[0][8].y, results.hand_landmarks[0][8].z],
            [results.hand_landmarks[0][4].x, results.hand_landmarks[0][4].y, results.hand_landmarks[0][4].z],
        )

options = HandLandmarkerOptions(
    base_options=BaseOptions(model_asset_path='models/scripts/mouse_movement/utils/hand_landmarker.task'),
    running_mode=VisionRunningMode.LIVE_STREAM,
    result_callback=callback
)

with HandLandmarker.create_from_options(options) as hands:
    while cap.isOpened():
        success, image = cap.read()
        height, width = image.shape[:2]
        if not success:
            print("Ignoring empty camera frame.")
            continue

        image = cv2.flip(image, 1)
        mp_image = mp.Image(image_format=mp.ImageFormat.SRGB, data=image)
        hands.detect_async(mp_image, round(time.time() * 1000))
        
            
        # cursor
        if mouse_behaviour.cursor is not None:
            cv2.circle(
                image, 
                (int(mouse_behaviour.cursor[0] * width), 
                 int(mouse_behaviour.cursor[1] * height)), 
                3, (255, 0, 255), 2
            )
                    
        # center
        cv2.circle(
            image, 
            (int(mouse_behaviour.center[0] * width), 
             int(mouse_behaviour.center[1] * height)), 
            3, (0, 0, 255), 2
        )
            
        
        cv2.imshow('Webcam View', image)
        if cv2.waitKey(5) & 0xFF == 27:
            break
        
cap.release()

This is the first iteration of the main code. It supports single click, double click, and long press. The Average class helps to reduce the jittering in the movement of the mouse by taking the average of the previous 15 mouse locations. The MouseBehaviour class is used for moving the mouse and performing the click actions. For the project I decided to perform a click, when the index and the thumb touch each other. The mouse moves by taking the average of all the other points. The center is defined as a point that corresponds to the centre of the screen. If your hand moves over it, the mouse should move near the centre of the screen.

average.py

class Average:
    def __init__(self, lim=7):
        self.x = []
        self.y = []
        self.lim = lim
    
    def get(self, x, y):
        if len(self.x) == self.lim and len(self.y) == self.lim:
            self.x.pop(0)
            self.y.pop(0)
        
        self.x.append(x)
        self.y.append(y)
        
        return sum(self.x) / self.lim, sum(self.y) / self.lim

mouse_behaviour.py

from pyautogui import mouseDown, mouseUp, leftClick, doubleClick, moveTo, size
import time


class MouseBehaviour:
    def __init__(self) -> None:
        self.time_start = None
        self.last_contact = None
        self.is_dragging = False
        self.distance_threshold = 0.045
        self.drag_time_threshold = 0.6
        self.double_click_time_threshold = 0.8
        
        self.cursor = None
        self.center = (.75, .75)
        
        self.swidth, self.sheight = size()
        
    def _euc(self, v1, v2):
        return pow(
            pow(v1[0] - v2[0], 2) + pow(v1[1] - v2[1], 2) + pow(v1[2] - v2[2], 2),
            0.5
        )
    
    def detect(self, cursor, index, thumb, mulx=4.5, muly=5.5):
        self.cursor = cursor
        distance = self._euc(index, thumb)
        
        if distance <= self.distance_threshold:
            if self.time_start is None:
                if self.last_contact is not None and time.time() - self.last_contact < self.double_click_time_threshold:
                    doubleClick(_pause=False)
                    self.time_start = None
                else:
                    self.time_start = time.time()
                    leftClick(_pause=False)
            elif time.time() - self.time_start >= self.drag_time_threshold:
                self.is_dragging = True
                mouseDown(_pause=False)
        else:
            if self.is_dragging:
                mouseUp(_pause=False)
                self.is_dragging = False
                self.last_contact = None
                self.time_start = None
            else:
                self.last_contact = self.time_start
                self.time_start = None                
            
        moveTo(
            self.swidth * (0.5 - ((self.center[0] - cursor[0]) * mulx)), 
            self.sheight * (0.5 - ((self.center[1] - cursor[1]) * muly)), 
            _pause=False
        )

However, this approach had a problem. The mouse would move every time my hands came into the webcams view. Apart from that, it worked fine.

To solve the problem, I decided to train a simple model that would take the coordinates of the landmarks as input and output whether or not, it was in ‘mouse mode’

What’s mouse mode?

It’s simple. In this position, your hand is in mouse mode:

The red dot corresponds to the centre of your screen. The purple dot is your current mouse location.

Putting your hand in this position will activate mouse mode and then moving your hand around will move the mouse. Otherwise nothing will happen.

So how did I train this?

Since I needed to make a binary classifier (mouse mode is on or off), I stuck to a random forest classifier model. I collected the data for mouse mode myself and used the SMOTE technique to increase the number of data points.

To collect data for positions of hand landmarks that do not classify as mouse mode, I downloaded several YouTube videos of people performing and teaching magic. I chose such videos as we can see hands for a considerable amount of time in the videos. I wrote this script to collect data for mouse_mode_off:

record.py

import cv2
import mediapipe as mp
import time
from utils import *


BaseOptions = mp.tasks.BaseOptions
HandLandmarker = mp.tasks.vision.HandLandmarker
HandLandmarkerOptions = mp.tasks.vision.HandLandmarkerOptions
HandLandmarkerResult = mp.tasks.vision.HandLandmarkerResult
VisionRunningMode = mp.tasks.vision.RunningMode

positions = []


def get_finger_tips(result: HandLandmarkerResult):
    if result.hand_landmarks:
        landmark_data = []

        for landmark in result.hand_landmarks[0]:
            landmark_data.append(float(landmark.x))
            landmark_data.append(float(landmark.y))
            landmark_data.append(float(landmark.z))

        positions.append(landmark_data)


options = HandLandmarkerOptions(
    base_options=BaseOptions(model_asset_path='hand_landmarker.task'),
    running_mode=VisionRunningMode.VIDEO,
)

with HandLandmarker.create_from_options(options) as landmarker:
    cap = cv2.VideoCapture("path/to/video.mp4")

    while True:
        _, frame = cap.read()
        if frame is None:
            break

        height, width = frame.shape[:2]
        mp_image = mp.Image(image_format=mp.ImageFormat.SRGB, data=frame)
        results = landmarker.detect_for_video(mp_image, round(time.time() * 1000))

        if results:
            get_finger_tips(results)

        image = draw_landmarks_on_image(mp_image.numpy_view(), results)

        cv2.imshow('Mouse mode detection', image)

        if cv2.waitKey(5) & 0xFF == 27:
            break
    cap.release()
    cv2.destroyAllWindows()

with open(f'mouse_mode_off_{str(time.time())}.csv', 'a') as file:
    data_points = positions
    data = ""
    for d in data_points:
        data += ", ".join([str(i) for i in d]) + "\n"
    file.write(data)

utils.py

from mediapipe import solutions
from mediapipe.framework.formats import landmark_pb2
import numpy as np
import cv2

MARGIN = 10  # pixels
FONT_SIZE = 1
FONT_THICKNESS = 1
HANDEDNESS_TEXT_COLOR = (88, 205, 54) # vibrant green

def draw_landmarks_on_image(rgb_image, detection_result):
  hand_landmarks_list = detection_result.hand_landmarks
  handedness_list = detection_result.handedness
  annotated_image = np.copy(rgb_image)

  # Loop through the detected hands to visualize.
  for idx in range(len(hand_landmarks_list)):
    hand_landmarks = hand_landmarks_list[idx]
    handedness = handedness_list[idx]

    # Draw the hand landmarks.
    hand_landmarks_proto = landmark_pb2.NormalizedLandmarkList()
    hand_landmarks_proto.landmark.extend([
      landmark_pb2.NormalizedLandmark(x=landmark.x, y=landmark.y, z=landmark.z) for landmark in hand_landmarks
    ])
    solutions.drawing_utils.draw_landmarks(
      annotated_image,
      hand_landmarks_proto,
      solutions.hands.HAND_CONNECTIONS,
      solutions.drawing_styles.get_default_hand_landmarks_style(),
      solutions.drawing_styles.get_default_hand_connections_style())

    # Get the top left corner of the detected hand's bounding box.
    height, width, _ = annotated_image.shape
    x_coordinates = [landmark.x for landmark in hand_landmarks]
    y_coordinates = [landmark.y for landmark in hand_landmarks]
    text_x = int(min(x_coordinates) * width)
    text_y = int(min(y_coordinates) * height) - MARGIN

    # Draw handedness (left or right hand) on the image.
    cv2.putText(annotated_image, f"{handedness[0].category_name}",
                (text_x, text_y), cv2.FONT_HERSHEY_DUPLEX,
                FONT_SIZE, HANDEDNESS_TEXT_COLOR, FONT_THICKNESS, cv2.LINE_AA)

  return annotated_image

Then I trained a model with the data:

Data points after SMOTE. I used the imblearn module for this.

Using the model, I got an F1 score of 0.99 on the test set. Here is the result:

Putting it all together

Here is everything put together:

In the gif, I tap my index and thumb once for a single click, twice for a double click, and long press to simulate a mouse down event.

Future plans

As promised, I will be actively looking for ways to control the mouse using eye movement and if I am successful there, I will surely inform you all about it.

In the meantime, thanks for reading till the end. If you like such articles, follow me and leave a clap!