Clear shot: Authored By: Davide Balzarotti, Marco Cova and Giovanni Vigna; Presented by: Md Raihan Majumder
1. ClearShot: Eavesdropping on Keyboard Input from Video
Davide Balzarotti, Marco Cova and Giovanni Vigna
Presented By:
Md Raihan Majumder
04/14/2017
2. Outlines
Introduction
Motivation and Contributions
Approach
• Computer Vision Analysis
• Text Analysis
Evaluation
Related Works
Conclusion
3. Introduction
Cryptography is widely used to prevent eavesdropping on
electric communication.
But, there are side channels such as E.M. emissions of
monitors, the timing of pressing keys and sound generated
by keyboards.
The goal of this paper is to automatically reconstruct the
text typed by a user based on the video of their typing
session.
A tool called CearShot has been implemented for testing
the approach.
4. Motivation and Contributions
Watching someone’s typing activity is a long and tedious work.
A scene from Robert Redford starrer 1992 movie “Sneakers” is
the main motivation of automating that whole task.
Another motivating factor is wide availability of webcams.
Main contribution is developing an approach to reconstruct the
key’s typed by a user.
Improving the reconstruction rate.
Developing a tool “ClearShot” that operates on low resolution
video.
5. Approach
Rationale behind monitoring the keyboard?
Only video camera and its position are under attacker’s control.
The whole approach is divided into two main phases
The first phase analyzes the video recorded by the camera using
computer vision techniques.
Result of first phase is noisy. So, second phase called “Text
Analysis” is done to remove errors using both language and
context-sensitive techniques.
Result of the second phase is the reconstructed text, where each
word is represented by a list of candidates, ranked by similarity.
6. Approach- Computer Vision Analysis
Similar to Gesture recognition research? No!
Computer Vision analysis is divided into two sub tasks.
•Hand tracking Analysis- information about the position of user’s hand.
•Key Pressing Analysis- information about whether or not a key was
pressed at a certain point of time.
Fig: Overview of the analysis steps performed by ClearShot
7. Hand Tracking Analysis- Contour Analysis
Contour detection technique has been used to track user hands. This
provides them with the information of where movement is
happening.
•User’s hands are the only moving object in the scene.
•Determine the contours of the parts of the hands by differentiating
each frame w.r.2. the previous one.
As expected the moving regions are concentrated around the
fingertips and the border of the hands.
8. Key pressing Analysis- Light-based Analysis
Light-based analysis leverages lighting features to determine
changes in the status of the key.
As a part of the process, they
Detect the contours of the keys on the keyboard.
Differentiate the contours on adjacent frames and, if their
differences is above a fixed threshold, They assume that the
corresponding key us likely to have been pressed.
9. Key pressing Analysis- Light-based and Occlusion-based
Analysis
Light-based analysis leverages lighting features to determine changes
in the status of the key.
Occlusion-based analysis gives idea about all the keys that are not
pressed in a certain frame.
Fig: Keys that are not pressed, are represented with the dark polygons
10. Approach- Text Analysis
“The goal of the text analysis phase is to suggest a sequence of meaningful
words starting from the set of candidate letters provided by the video
analysis”
•General purpose techniques(e.g. edit distance measure, context sensitive
spelling correction etc.) works poorly with the problem of text reconstruction.
1.Most errors come from inaccuracies in the video analysis phase not from
typing mistakes.
2.Errors are more random and hard to predict.
Text Analysis part is divided into two modules:
•Language Analysis
•Context Sensitive Analysis
11. Text Analysis- Language Analysis
Goal of language analysis is to determine the probability(P(w|s)) of
a certain word, given the set of keys typed by a user.
Error Model: From the computer vision analysis they got two
vectors.
1.Key list (Keys, likely to be pressed)
2.Exclusion list (keys, likely to be untouched)
Keys appearing in 2 consecutive frames are called a key grouping.
By analyzing grouping, they identify character models, which
represent place-holders for the characters in the word they are
constructing.
12. Language Analysis Contd.
Character models are created based on following rules:
•If two key groupings doesn't overlap, a new model is created that contains
only that key.
•If two key groupings partially overlap, they consider them to be consecutive
and two models are created.
•If there is a complete overlap between two or more key groupings, they create
a model that contains both keys.
•If there occurs a empty frame between two consecutive key groupings that is
greater than a certain threshold then an empty model is created.
Fig: Key lists generated by the video analysis
13. Language Analysis Contd.
Fig: Word model graph when user type word change. Dashed boxes
are character models. Nodes represent individual keys in a model.
14. Context- sensitive Analysis
• In the language analysis part for each word, they created a list of candidate words
sorted by score. But it doesn't give any meaningful sentences.
• They automated that problem by n-gram sequences using the google 3-gram and 4-
gram dataset.
In their context-sensitive analysis, we take three sets of consecutive words as provided
by the language analysis and we combine them together to form all the possible
three word sentences. They then extract the frequency of each sentence from the
Google 3-gram dataset.
{we, mill, will} {walk, table, talk} {tomorrow, tomato, automate}
Most of the combinations are incorrect. And that’s why they are not present in google
dataset. They wrong candidates that never appear in any valid 3-gram are
discarded. Result of 3-gram analysis:
{we, will} {talk, walk} {tomorrow}
Misfit:
15. Evaluation
Setup:
Keyboard Model: SK-8110 keyboard, black, Dell
Recording Device: Unibrain Fire-i web camera, 15 fr/sec,
640x480 resolution
Lighting Condition: fluorescent lamps
Machine: Pentium 4, 3.6 GHz, 2Gb RAM
All entries containing non-alphabetic chars are stored in a
MySQL dB
16. Evaluation Contd.
Experiments:
Two set of Users.
Type a 118 words paragraphs.
1st
user took 3m 55s to complete the task.
2nd
user took 3m 10s to complete the task.
5 misspelled words by 1st
user.
6 missplelled words by 2nd
user.
17. Evaluation Contd.
Reconstruction Capability:
For manually reconstructing analyst 1 took 59 mins with 89%
accuracy. For analyst 2 it took 1h 55m with 96% accuracy.
Then they ran ClearShot on the two recordings. It increased the
the number of correct words by 3% for 1st
user. And 7% for the
second user.
Performance of ClearShot:
It is not optimized for speed. Due to python implementation
instead of C.
18. Related Works
1. Compromising emanations.( leveraging the emissions of
generated by computed devices.)
2. Acoustic emanations. ( caused by typing of regular keyboard)
3. Traffic analysis techniques have been used to eavesdrop on
encrypted communication transmitted over a network.
19. Conclusion
Even though extraction of information from analyzing the video is
a difficult task. But this sort of problem can be avoided by a
physical shield over the keyboard. Some of those techniques are
used in ATMs and POSs.
Future work will focus on improving the motion tracking algorithm
so it can be done on different settings(lighting, camera settings,
types etc.).
Editor's Notes
Published On: 2008 IEEE Symposium on Security and Privacy