Anzeige

Marriage of speech, vision and natural language processing

Researcher um State University of New York at Buffalo
27. Dec 2020
Anzeige

Más contenido relacionado

Anzeige

Marriage of speech, vision and natural language processing

  1. Marriage of Computer Vision, Speech and Natural Language - Yaman Kumar (MIDAS Lab-IIITD, SUNY at Buffalo) - Rajiv Ratn Shah (MIDAS Lab-IIITD)
  2. What is Speech? Text Part of Speech Vision Part of Speech Aural Part of Speech
  3. Why Speech?
  4. Marriage of Speech & Language Conversational speech Information in the acoustic signal beyond the words Interactive nature of conversations
  5. Speech is ... more than spoken words • Rich in ‘extra-linguistic’ information • breathing noises • lip-smacks • Hand movements • Facial Expressions • Rich in ‘para-linguistic’ information • Personality • Attitude • Emotion • Individuality
  6. Some Examples • Disfluency • I am uh uh very …. I am very excited to see you • He is my em …… Yaman is my best friend • Intonation and Stress 1. *This* is my laptop (and not that) • This is *my* laptop (and not yours) • This is my *laptop* (and not book) 2. He found it on the street? • And in reply, He found it on the street • No punctuation and very open grammar • ASR errors
  7. • to the listener • a child (‘parentese’) • a non-native person • a hearing-impaired individual • an animal • a machine(!) • to the cognitive load • Interaction with other tasks • stressful/emotional situations • to the environment • noise • reverberation Speech is Adaptive • to the task • Casual conversation • Reading out loud • Public speaking
  8. Content • Content in spoken medium is the "information or experiences directed towards end-users or an audience". Why is Content Important? Whom do you prefer? • A speaker with style, elegance, panache but with a weak content (talking too much off-topic, not providing enough details about facts). OR • An average speaker but with a good content (ideas stick to the main topic, provides interesting/required background information).
  9. Content What defines a Good Content? ( High Relevance and High Sufficiency ) Relevance • Related to the topic • Connected to the prompt in a bigger story. • No Unwanted information or off topic. Sufficiency • Adequate details (which are also relevant) • All points covered • No Missing parts
  10. Response: IVE ACCOMPLISHED UM MANY THINGS IN LIFE ONE OF THEM IS IS BEING A PHILANTHROPIST IVE HELPED A LOT OF PEOPLE MOST SPECIALLY CHILDREN I GO TO SOME UM POOR AREAS AND WE TEACH LIKE THOSE CHILDREN SOME KNOWLEDGE THAT THEY DONT KNOW YET LIKE FOR EXAMPLE IM GOING TO BE THEIR TEACHER AND I I INFORM THEM ALL THE THINGS LIKE UM WHAT TO WRITE HOW TO READ HOW TO DESCRIBE SOMETHING AND THIS IS REALLY IMPORTANT IN MY LIFE BECAUSE BEING A TEACHER IS REALLY GOOD FOR ME AND I THINK IT WILL REALLY HELP ME GROW MY ABILITY TO HELP PEOPLE MOST SPECIALLY CHILDREN Response: IT IS IMPORTANT TO CHOOSE WISELY FOR YOUR CAREER AND ITS ALSO IMPORTANT THAT YOU CHOOSE THAT CAREER BECAUSE UH THIS IS YOUR PASSION AND THIS IS YOUR REALLY ONE JOB AND BECAUSE IF YOU DONT WANT THAT JOB OR CAR CAREER BUT YOU CHOOSE IT UH YOU WILL AT THE END OF THE DAY YOU WILL NOT BE UH MOTIVATED TO WORK WITH IT AND YOU WILL NOT BE YOU ARE UH THERES A TENDENCY THAT YOU WILL NOT ACHIEVE YOUR GOAL OR DESIRE IN YOUR IN THAT CAREER AND YOURE NOT BE WILL BE SUCCESSFUL IN THAT CAREER IT IS IMPORTANT TO CHOOSE WISELY YOUR CAREER AND UH CONSIDER THAT THIS IS YOUR UH THIS IS WHAT YOU REALLY WANT AND THIS IS YOUR PASSIONS AND ARE IT IS UH IF YOU CHOOSE YOUR CAREER BE SURE YOU ARE ENJOYING IT NOT DOING IT Relevance: High Speaker sticks to the things asked in prompt. (Being philanthropist or teacher as accomplishments, important of the same.) Sufficiency: High Explains in detail about how he helped children as a teacher, how did he help and importance of the same Relevance: Low Speaker goes too off topic from what is being asked. (About car, being successful, what good career is, instead of talking about accomplishments.) Sufficiency: Low Provides no information that addresses the points in the prompt. Prompt: You have to narrate to a career advisor 1 thing you accomplished which you are proud of and how it was important for you.
  11. D…Di….Disfluencies • Interruptions in the smooth flow of speech • These interruptions often occur in spoken communication. They usually help the speakers to buy more time while they express their thought process. • Reparandum (RM) - Refers to the unintended and unnecessary part of the disfluency span (This span can be deleted in order to obtain fluency) • Interregnum (IM) - Refers to the part that lies between RM and RR. (This span helps the speaker to fill the intermediate gap) • Repair (RR) - Refers to the corrected span of the RM. (This span should maintain the context of RM)
  12. D…Di….Disfluencies • Examples • Filled pauses : "This is a uhmm … good example" • Discourse Markers : " It's really nice to .. you know .. play outside sometimes." • Self-Correction : " So we will... we can go there." • Repetitions : "The... the... the decision was not mine to make" • Restart : "We would like to eat ... let’s go to the park" • Why can't we recognize these disfluencies solely by looking at the words ? 🤔 • Consideration of the audio helps in understanding the intention of speaker and hence deciding if there is a disfluency or not. • Can get confused with some fluently done repetitions - "Superman is the most most most powerful superhero ! " • Can also get confused from various other interruptions like non-verbal sounds and even silence !
  13. Pronunciation /prəˌnʌnsɪˈeɪʃ(ə)n/ Mispronunciation Detection: Problem where the perceived pronunciation doesn't match with intended pronunciation, but we can understand the meaning. Example. Pronunciation of word park. • Phoneme Recognition Problem: State of the art phoneme (sounds in a language) recognition systems has a phoneme error rate of 18% for native speech data. • Non-native accent: Phonemes might be recognized correctly but acoustic models (models used to detect phonemes) are often confused by non- native speech. Some phonemes (sounds) exist in the native language which do not have an alternative in the non-native language. E.g. Je sound in French has no English mapping which confuses the acoustic model to predict wrong sequences of phonemes.
  14. Pronunciation Intelligibility: There is a lot of difference between the intended speech and spoken speech. Example: Pronunciation of word mEssage is incorrect. A good ASR system will perceive it as mAssage and rate it correctly pronounced. However, the user meant to say mEssage.
  15. Discourse Coherence • Discourse is a coherent combination of spoken (or written) utterances communicated between a speaker (or writer) and a listener (or reader). • Discourse is a PRODUCT? ✍️ (linguistic perspective) • Discourse is a PROCESS!! 🤔🤔 (cognitive perspective) • Discourse coherence is the semantic relationship between propositions or communicative events in discourse. • It is a feature of the perception 👀👂 of discourse rather than the content of discourse itself.
  16. Discourse Coherence Discourse as Product ✍ • A well written speech. • How the discourse content is structured and organized by the speaker. • Cohesion in text, use of discourse markers, connectives, etc. • How readable is the text, how complex is the text, etc. Discourse as Process 🤔 • A well delivered speech. • How the discourse content is delivered efficiently to the listener. • Prosodic variation, use of stress, intonation, pauses, etc. • How intelligible is the speech, how focused is the listener, etc.
  17. Prosody • Prosodic features span... • several speech segments • several syllables • whole utterances • Such ‘suprasegmental’ behaviour includes ... • lexical stress (Prominence of Syllables) • lexical tone (Pitch pattern to distinguish words) • rhythmic stress (Emphasis) • intonation (Difference of Expressive meaning)
  18. It’s not what you say, but how you say it.
  19. The Two Ronnies - Four Candles vs Fork Handles Speech is Ambiguous
  20. Silent Speech is Even More Ambiguous • Elephant Juice vs I Love You • Million vs Billion • Pet vs Bell vs Men Speak Them To Yourself! You lip movements are exactly same!
  21. Exploring Semi-Supervised Learning for Predicting Listener Backchannels Accepted at CHI’21! Vidit Jain, Maitree Leekha, Jainendra Shukla, Rajiv Ratn Shah
  22. Introduction ● Developing human-like conversational agents is important! ○ Applications in education and healthcare ● Challenge: how to make them seem natural? ○ Human conversations are complex! ● Listener backchannels: a crucial element of human conversation: ○ Listener’s “regular” feedback to the speaker, indicating presence ○ Verbal: e.g., short utterances ○ Non-verbal: e.g., head shake, nod, smile etc. ● We focus on modelling these backchannels as a step towards natural Human Robot Interactions (HRIs).
  23. Research Questions Key Research Gaps: ● Prior works [1, 2 and more] relied on large amounts of manually annotated data to train listener backchannel prediction (LBP) models ○ This is expensive in terms of man hours ● In addition, all previous works have focused on only English conversations Major Contributions: ● Validating the use of semi-supervised techniques for LBP ○ Models using only 25% of manual annotation performed at par! ● Unlike past works, we use Hindi conversations [1] Park, Hae Won, et al. "Telling stories to robots: The effect of backchanneling on a child's storytelling." 2017 12th ACM/IEEE International Conference on Human-Robot Interaction (HRI. IEEE, 2017. [2] Goswami, Mononito, Minkush Manuja, and Maitree Leekha. "Towards Social & Engaging Peer Learning: Predicting Backchanneling and Disengagement in Children." arXiv preprint arXiv:2007.11346 (2020).
  24. Dataset ● We use the multimodal Hindi based Vyaktitv dataset [3] ○ 25 conversations, each ~16 min long ○ Video and audio feeds available for each participant (50 recordings) ● Annotations Done: ○ 3 annotators ○ Signal (kappa): Nod (0.7), Head-shake (0.6), Mouth (0.6), Eyebrow (0.5), Utterances (0.5) ● Features Extracted: ○ OpenFace - visual features: 18 facial action units (FAU), gaze velocities & accelerations, translational and rotational head velocities & accelerations, blink rate, pupil location, and smile ratio ○ pyAudioAnalysis - audio features: voice activity, MFCC, F0, energy [3] Khan, Shahid Nawaz, et al. "Vyaktitv: A Multimodal Peer-to-Peer Hindi Conversations based Dataset for Personality Assessment." 2020 IEEE Sixth International Conference on Multimedia Big Data (BigMM). IEEE, 2020.
  25. System Architecture Methodology: (i) Semi-supervised learning for identifying backchannels and type of signals emitted using a subset of labeled data. (ii) Learning to predict these instances and signals using the speaker's context.
  26. Task Formulations Identification Given a listener’s audio and video feeds, identify when he backchannels? These are the true labels in the prediction task We use semi-supervision here to generate these pseudo-labels (instance & type) Prediction Given a speaker’s context (~3-7 sec long), predict whether the listener will backchannel immediately after it. Use only speaker’s features to predict the instance & type of backchannel (verbal/visual)
  27. Key Findings ● The semi-supervised process was able to identify backchannel instances and signal types very well ○ Respective accuracies- 0.90 (ResNet) & 0.85 (RF)- only 25% manual annotation as seed! ● Comparing prediction models trained using manually annotated vs semi supervised pseudo labels: ○ Using semi-supervision, we reach ~94% of the baseline performance! ● Qualitative Study: Majority participants could not distinguish between the two prediction models!
  28. Demo Our final system trained using semi-supervision
  29. Lip Movement as Inputs for Information Retrieval https://www.aaai.org/ojs/index.php/AAAI/article/view/5649 https://www.aaai.org/ojs/index.php/AAAI/article/view/5148 https://www.aaai.org/ojs/index.php/AAAI/article/view/4106 https://www.isca-speech.org/archive/Interspeech_2019/pdfs/3269.pdf https://www.isca-speech.org/archive/Interspeech_2019/abstracts/3273.html https://www.youtube.com/watch?v=3BqQQnTfnlE&list=PL9rvax0EIUA6PDoiDT2Wp462GsT nikrvY
  30. Visual Speech Recognition
  31. Let’s put your lip reading abilities to test (SHOW OF HANDS)
  32. CONFIDENCE CONFERENCE CONCERNS CONFLICT
  33. CONFIDENCE CONFERENCE CONCERNS CONFLICT
  34. MOBIVSR Predictions: •CONFERENCE 65% •CONFLICT 20% •OFFICERS 10% •OFFICE 5%
  35. SPECIAL SPONGE DESPERATION SPEECH
  36. SPECIAL SPONGE DESPERATION SPEECH
  37. MOBIVSR Predictions: •SPEECH 85% •BRITISH 10% •PRESSURE 2% •INFLATION 1%
  38. Let’s jump to MobiVSR difficulty level. Your options: ABOUT ABSOLUTELY ABUSE ACCESS ACCORDING ACCUSED ACROSS ACTION ACTUALLY AFFAIRS AFFECTED AFRICA AFTER AFTERNOON AGAIN AGAINST AGREE AGREEMENT AHEAD ALLEGATION S ALLOW ALLOWED ALMOST ALREADY ALWAYS AMERICA AMERICAN AMONG AMOUNT ANNOUNCED ANOTHER ANSWER ANYTHING AREAS AROUND ARRESTED ASKED ASKING ATTACK ATTACKS AUTHORITIE S BANKS BECAUSE BECOME BEFORE BEHIND BEING BELIEVE BENEFIT BENEFITS BETTER BETWEEN BIGGEST BILLION BLACK BORDER BRING BRITAIN BRITISH BROUGHT BUDGET BUILD BUILDING BUSINESS BUSINESSES CALLED CAMERON CAMPAIGN CANCER CANNOT CAPITAL CASES CENTRAL CERTAINLY CHALLENGE CHANCE CHANGE CHANGES CHARGE CHARGES CHIEF CHILD CHILDREN CHINA CLAIMS CLEAR CLOSE CLOUD COMES COMING COMMUNITY COMPANIES COMPANY CONCERNS CONFERENCE CONFLICT CONSERVATIV E CONTINUE CONTROL COULD COUNCIL COUNTRIES COUNTRY COUPLE COURSE COURT CRIME CRISIS CURRENT CUSTOMERS DAVID DEATH DEBATE DECIDED DECISION DEFICIT DEGREES DESCRIBED DESPITE DETAILS DIFFERENCE DIFFERENT DIFFICULT DOING DURING EARLY EARLY EASTERN ECONOMIC ECONOMY EDITOR EDUCATION ELECTION EMERGENCY ENERGY ENGLAND ENOUGH EUROPE EUROPEAN EVENING EVENTS EVERY EVERYBODY EVERYONE EVERYTHING EVIDENCE EXACTLY EXAMPLE EXPECT EXPECTED EXTRA FACING FAMILIES FAMILY FIGHT FIGHTING FIGURES FINAL FINANCIAL FIRST FOCUS FOLLOWING FOOTBALL FORCE FORCES FOREIGN FORMER FORWARD FOUND FRANCE FRENCH FRIDAY FRONT FURTHER FUTURE GAMES GENERAL GEORGE GERMANY GETTING GIVEN GIVING GLOBAL GOING GOVERNMENT GREAT GREECE GROUND GROUP GROWING GROWTH GUILTY HAPPEN HAPPENED HAPPENING HAVING HEALTH HEARD HEART HEAVY HIGHER HISTORY HOMES HOSPITAL HOURS HOUSE HOUSING HUMAN HUNDREDS IMMIGRATION IMPACT IMPORTANT INCREASE INDEPENDENT INDUSTRY INFLATION INFORMATION INQUIRY INSIDE INTEREST INVESTMENT INVOLVED IRELAND ISLAMIC ISSUE ISSUES ITSELF JAMES JUDGE JUSTICE KILLED KNOWN LABOUR LARGE LATER LATEST LEADER LEADERS LEADERSHIP LEAST LEAVE LEGAL LEVEL LEVELS LIKELY LITTLE LIVES LIVING LOCAL LONDON LONGER LOOKING
  39. MAJOR MAJORITY MAKES MAKING MANCHESTER MARKET MASSIVE MATTER MAYBE MEANS MEASURES MEDIA MEDICAL MEETING MEMBER MEMBERS MESSAGE MIDDLE MIGHT MIGRANTS MILITARY MILLION MILLIONS MINISTER MINISTERS MINUTES MISSING MOMENT MONEY MONTH MONTHS MORNING MOVING MURDER NATIONAL NEEDS NEVER NIGHT NORTH NORTHERN NOTHING NUMBER NUMBERS OBAMA OFFICE OFFICERS OFFICIALS OFTEN OPERATION OPPOSITION ORDER OTHER OTHERS OUTSIDE PARENTS PARLIAMENT PARTIES PARTS PARTY PATIENTS PAYING PEOPLE PERHAPS PERIOD PERSON PERSONAL PHONE PLACE PLACES PLANS POINT POLICE POLICY POLITICAL POLITICIANS POLITICS POSITION POSSIBLE POTENTIAL POWER POWERS PRESIDENT PRESS PRESSURE PRETTY PRICE PRICES PRIME PRISON PRIVATE PROBABLY PROBLEM PROBLEMS PROCESS PROTECT PROVIDE PUBLIC QUESTION QUESTIONS QUITE RATES RATHER REALLY REASON RECENT RECORD REFERENDUM REMEMBER REPORT REPORTS RESPONSE RESULT RETURN RIGHT RIGHTS RULES RUNNING RUSSIA RUSSIAN SAYING SCHOOL SCHOOLS SCOTLAND SCOTTISH SECOND SECRETARY SECTOR SECURITY SEEMS SENIOR SENSE SERIES SERIOUS SERVICE SERVICES SEVEN SEVERAL SHORT SHOULD SIDES SIGNIFICANT SIMPLY SINCE SINGLE SITUATION SMALL SOCIAL SOCIETY SOMEONE SOMETHING SOUTH SOUTHERN SPEAKING SPECIAL SPEECH SPEND SPENDING SPENT STAFF STAGE STAND START STARTED STATE STATEMENT STATES STILL STORY STREET STRONG SUNDAY SUNSHINE SUPPORT SYRIA SYRIAN SYSTEM TAKEN TAKING TALKING TALKS TEMPERATURE S TERMS THEIR THEMSELVES THERE THESE THING THINGS THINK THIRD THOSE THOUGHT THOUSANDS THREAT THREE THROUGH TIMES TODAY TOGETHER TOMORROW TONIGHT TOWARDS TRADE TRIAL TRUST TRYING UNDER UNDERSTAND UNION UNITED UNTIL USING VICTIMS VIOLENCE VOTERS WAITING WALES WANTED WANTS WARNING WATCHING WATER WEAPONS WEATHER WEEKEND WEEKS WELCOME WELFARE WESTERN WESTMINSTE R WHERE WHETHER WHICH WHILE WHOLE WINDS WITHIN WITHOUT WOMEN WORDS WORKERS WORKING WORLD WORST WOULD WRONG YEARS YESTERDAY YOUNG
  40. MOBIVSR Predictions: •DIFFICULT 40% •GIVING 20% •GIVEN 10% •EVERYTHING 5%
  41. Speech as Inputs for Information Retrieval
  42. https://www.aaai.org /ojs/index.php/AAAI/ article/view/4106 https://www.isca- speech.org/archive/In terspeech_2019/pdfs/ 3269.pdf https://www.isca- speech.org/archive/Interspeec h_2019/abstracts/3273.html https://www.youtube.com/wa tch?v=3BqQQnTfnlE&list=PL9r vax0EIUA6PDoiDT2Wp462GsT nikrvY https://www.aaai.org /ojs/index.php/AAAI/ article/view/5649 https://www.aaai.org /ojs/index.php/AAAI/ article/view/5148 Lip Movement Speech for Information Retrieval
  43. Note: During speech reconstruction, the sex of the speaker is preserved Demonstration: English Video to Speech Reconstruction
  44. Demo: Chinese Video to Speech Reconstruction
  45. Demonstration: Hindi Video to Speech Reconstruction
  46. Example person with dysarthria
  47. https://ww w.aaai.org/ ojs/index.p hp/AAAI/ar ticle/view/ 4106 https://ww w.isca- speech.org /archive/In terspeech_ 2019/pdfs/ 3269.pdf https://www.aaai.org/ojs/index .php/AAAI/article/view/5649 https://www.aaai.org/ojs/index .php/AAAI/article/view/5148 https://www.isca- speech.org/archive/Interspee ch_2019/abstracts/3273.htm l https://www.youtube.com/w atch?v=3BqQQnTfnlE&list=PL 9rvax0EIUA6PDoiDT2Wp462 GsTnikrvY Lip Movement Speech Video for Information Retrieval
  48. Video Construction
  49. GAN output for an English phrase
  50. Viseme concatenation TC GAN Generated Output with Inter-Visemes Output for an English phrase, Good Bye
  51. GAN output for a Hindi phrase
  52. Viseme concatenation TC GAN Generated Output with Inter- Visemes Output for an Hindi phrase, Aap Kaise hai (How are you)
  53. LIFI: Towards Linguistically Informed Frame Interpolation Aradhya Neeraj Mathur¹, Devansh Batra², Yaman Kumar¹, Rajiv Ratn Shah¹, Roger Zimmermann³ Indraprastha Institute of Information Technology Delhi, India¹ Netaji Subhas University of Technology, Delhi² National University of Singapore (NUS)³
  54. Motivation 56 • Speech videos are extremely common across the internet (lectures, YouTube videos and even video calling apps), but no video interpolation methods pay heed to nuances of speech videos. • Visual Modality of speech is complicated. While uttering a single sentence, our lips cycle through dozens of visemes. • First 30 frames of a speaker speaking the sentence "I don't exactly walk around with a hundred and thirty five million dollars in my wallet". Notice the rich lip movement with opening and closing of the mouth.
  55. Motivation 57 We try to reconstruct this speech video by interpolating the intermediate frames from the first and last frames using state of the art models. Expected Original frames (with rich mouth movements) Observed Interpolated frames (with virtually no mouth movements) Some Surprising metrics L1 = 0.0498, MSE = 0.0088, SSIM = 0.9521, PSNR = 20.5415 Which are surprisingly good!!? This means that we need better evaluation criteria for interpolation or reconstruction of speech videos.
  56. Proposed Work 58 1. Challenge Datasets for Speech Video Reconstruction (based on LRS3-TED) Guess the words spoken? …………… "Well the short answer to the question is no, it's not the same thing" Random Frame Corruption (40%) Extreme Sparsity Corruption (75%) Prefix Corruption Suffix Corruption
  57. Proposed Work 59 1. Challenge Datasets for Speech Video Reconstruction (based on LRS3-TED) Visemic Corruption (visemes of a particular type being corrupted and requiring regeneration) Intra Word Corruption (Corruption of frames within the occurrence of a large word) Inter Word Corruption (Corruption of frames across word boundaries)
  58. Proposed Work 60 2. Visemic reconstruction with ROI Loss unit A modified FCN3D with ROI extraction unit to calculate ROI loss. Instead of training the reconstruction network with only the L1 loss between reconstructed and original images, we introduce an ROI Loss which measures the similarity between visemic regions of interests between observed and generated facial images. To accomplish this, we develop an ROI unit as shown on the left.
  59. Proposed Work 61 Key Findings We evaluate a Fully Convolutinal Network (FCN3D), a convolutional bi- directional LSTM and the original FCN3D network after addition of the ROI unit and Visemic Loss during training. We observe: 1. In different types of corruptions different networks perform differently. 2. While SuperSloMo performs very well in random frame corruption, we see that it performs much poorly on other types of corruptions. 3. As expected, a sequential LSTM based generator works much better than a fully connected convolutional network when there are corruptions in consecutive frames as shown in prefix and suffix corruption 4. Most Importantly, addition of an ROI loss also helps a network perform better on all forms of corruption and non-ROI based metrics, as shown by the results for (FCN3D+ROI) Performance of different models over datasets containing random corruptions, prefix corruptions and suffix corruption Performance of different models over datasets containing corruptions on different visemes
  60. Touchless Typing Using Head Movement-based Gestures Shivam Rustagi¹, Aakash Garg¹, Pranay Raj Anand², Rajesh Kumar³, Yaman Kumar², Rajiv Ratn Shah² Delhi Technological University, India¹ Indraprastha Institute of Information Technology Delhi, India² Haverford College, USA³
  61. Motivation 63 Traditional Input DevicesDiseases which render these devices useless ● Upper limb paralysis ● Deformed limb ● Damaged fingers/hand ● Various other disabilities
  62. Motivation 64
  63. Related Work [1] A. Nowosielski, “Two- letters-key keyboard for predictive touchless typing with head movements [2] J. Tu, H. Tao, and T. Huang, “Face as mouse through visual face tracking,” [3] M. Nabati and A. Behrad, “3d head pose estimation and camera mouse implementation using a monocular video camera
  64. Related Work 66 MID AIR TOUCHLESS TYPING TECHNIQUES [4] A. Markussen, M. R. Jakobsen, and K. Hornbundefinedk, “Vulture: A mid-air word-gesture keyboard Using fingers [5] C. Yu, Y. Gu, Z. Yang, X. Yi, H. Luo, and Y. Shi, “Tap,dwell or gesture? exploring head-based text entry techniques for hmds,” Using head
  65. 67 Proposed Work for the 10,000 most common English words there are 8529 unique cluster sequences with each sequence having on an average 1.17 different words. So once we predict the cluster sequence, it can be translated to 1-2 valid words on an average..
  66. Data Collection: Setup 68 Equipment Configuration Purpose Monitor and Keyboard 17 inches monitor and standard keyboard ● The color coded QWERTY keyboard was displayed on monitor. ● Keyboard used to start and stop recording. Camera (on tripods) ● 3 Samsung M10 mobile cameras which recorded videos at 30fps, 1920 x 1080 resolution ● All 3 mobiles had OpenCamera app installed ● 1 Samsung M10 mobile with MUSE2 app The 3 cameras were kept at angles - 45, 0 and 45 degrees respectively to record the head movements. MUSE2 headband Sensors such as accelerometer and gyroscope The sensors recorded the acceleration and rotation of head. Moderator’s laptop Standard The python script on laptop was responsible to start and stop cameras simultaneously. *Note: For our research we have used only the central view (the Camera-2) recordings.
  67. ❑ Total number of users volunteered = 25 ( 16 male; 9 female; 3 user data discarded on manual inspection) ❑ Each user recorded 3 samples of video each for 35 (words: 20, phrases: 10, sentences: 5 as per Table 1) ❑ Total number of video samples = 2310 (22 x 35 x 3) 69 Category Text Words locate, single, family, would, place, large, work, take, live,box, method, listen, house, learn, come, some, ice, old, fly, leg Phrases hello, excuse me, i am sorry, thank you, good bye, see you,nice to meet you, you are welcome, how are you, have a good time Sentences i never gave up, best time to live, catch the trade winds, hear a voice within you, he will forget it Table 1. List of 20 words, 10 phrases and 5 sentences that was typed by each user. Each of these was iterated for 3 times. Data Collection: Description
  68. Data Collection: Procedure 70 Camera-1 Camera-2
  69. Data Collection: Statistics 71 Category Avg. Number of letters per Entry Words 4.33 Phrases 10.6 Sentences 18.6 ❏ The words were selected to have proper cluster-coverage. ❏ The phrases and sentences were selected from OuluVS[6] and TIMIT[7] dataset respectively. Fig. Coverage of each cluster across dataset Fig. Avg Gesture per minute for each user ( avg = 49.26, std = 5.3 )
  70. 72 The proposed method is based on a CNN-RNN architecture, the feature extractor part, as shown above, is based on HopeNet architecture that predicts the yaw, pitch and roll features for the input image. The network is trained using a multi-task classification scheme. We utilize the available pretrained model on large-pose face images from 300W dataset. Hopenet Architecture
  71. 73 HopeNet output visualized on a user. The three vectors are constructed from the euler angles (features) predicted by the network. Working of Hopenet
  72. 74 The features from the HopeNet are passed into a multi-layered BiGRU network, which is then trained using a CTC loss function. During the inference phase we used beam search to decode the cluster sequence. CNN-RNN architecture
  73. 7575 Evaluation Metric: DTW
  74. 76 The method is evaluated on two scenarios: ● Inter-User: Training on user set S1, Testing on user set S2 such that S1 and S2 are mutually exclusive. Cluster sequences are kept the same for training and testing. ● Intra-user: For every user, i.e set S = {S1 U S2}, we record 3 samples per sequence. For training, 2 samples were taken and the testing is done on the 3rd sample Results
  75. 77 Our work presents a meaningful way of mapping gestures to character (cluster) sequence which could be beneficial for people with disabilities. Also, our dataset is publically available which could help improve the current system. In the future, the aim is to improve the performance issue by: Using more training data containing a variety of meaningful sequences, and 1. Combining video feeds from multiple cameras, brainwaves recorded via EEG sensors, acceleration, and rotation of the user’s head recorded via accelerometer and gyroscope. Other future applications could also work in the direction of integrating the interface with wearable devices and mobile computing. This will bring together a newer set of applications like browsing from wearable glasses. Conclusion and Future Work
  76. Information Retrieval through Soft Biometrics https://arxiv.org/pdf/2001.09134.pdf
  77. SeekSuspect: Retrieving Suspects from Criminal Datasets using Visual Memory Aayush Jain*, Meet Shah*, Suraj Pandey*, Mansi Agarwal*, Rajiv Ratn Shah, Yifang Yin ● Police maintain a crime dossier system that entails information like photographs and physical details. ● Finding suspects by name is possible, but fails when we only have informant's visual memory. ● Law enforcement agencies used to hire sketch artists, but they are limited in number. ● We propose SeekSuspect, a fast interactive suspect retrieval system. ● SeekSuspect employs sophisticated deep learning and computer vision techniques ○ to modify the search space and ○ find the envisioned image effectively and efficiently I do not exactly remember who she was Is this the person you wish to search for?Female, fair, black hair... Relevant images SeekSuspect Similar images
  78. SeekSuspect
  79. https://midas.iiitd.edu.in/ https://facebook.com/midasiiitd/ https://twitter.com/midasiiitd/ https://linkedin.com/company/midasiiitd/
  80. Team • Director: Dr. Rajiv Ratn Shah • PhD Students: Hitkul, Shivangi, Ritwik, Mohit, Yaman, Hemant, Kriti, Astha • MTech Students: Abhishek, Suraj, Meet, Aayush, William, Subhani, etc. • Research Assistants: Manraj, Pakhi, Karmanya, Mehar, Saket, Anuj, etc. • BTech Students (both full-time and remote students): • DTU: Maitree Leekha, Mansi Agarwal, Shivang Chopra, Rohan Mishra, Himanshu, etc. • NSUT: Ramit Sahwney, Puneet Mathur, Avinash Swaminathan, Rohit Jain, Hritwik, etc. • IIT: Pradyumn Gupta, Abhigyan Khaund, Palak Goenka, Amit Jindal, Prateek Manocha, etc. • IIIT: Vedant Bhatia, Raj K Gupta, Shagun Uppal, Osheen Sachdev, Siddharth Dhawan, etc. • Alumnus (Placements, Internship, MS Admissions): • Companies: Google, Microsoft, Amazon, Adobe, Tower Research, Walmart, Qualcomm, Goldman Sachs. Bloomberg, IBM Research, Wadhwani AI, Samsung Research, etc. • Academia: CMU, Columbia University, University of Pennsylvania, University of Maryland, University of Southern California, Erasmus Mundus, University of Virginia, Georgia Tech, etc.
  81. Collaborators • Prof Roger Zimmermann, National University of Singapore, Singapore • Prof Changyou Chen, State University of New York at Buffalo, USA • Prof Mohan Kankanhalli, National University of Singapore, Singapore • Prof Ponnurangam Kumaraguru (PK), IIIT Delhi, India • Dr. Amanda Stent, Bloomberg, New York, USA • Dr. Debanjan Mahata, Bloomberg, New York, USA • Prof. Rada Mihalcea, University of Michigan, USA • Prof. Shin'ichi Satoh, National Institute of Informatics, Japan • Prof. Jessy Li, University of Texas at Austin, USA • Prof. Huan Liu, Arizona State University, USA • Prof. Naimul Khan, Ryerson University, Canada • Prof. Diyi Yang, Georgia Institute of Technology, USA • Prof Payman Vafaee, Columbia University, USA • Prof Cornelia Caragea, University of Illinois at Chicago, USA • Dr. Mika Hama, SLTI, USA, and many more...
  82. Research (AI for Social Good) • NLP and Multimedia based systems for society (education, healthcare, etc.) • Automatic speech recognition (ASR) for different domains and accents (e.g., Indian, African) • Visual speech recognition/reconstruction (VSR) such as lipreading and speech reconstruction • Hate speech and malicious user detection in code-switched scenarios on social media • Mental health problems such as suicidal ideation and depression detection on social media • Building multimodal information retrieval and information extraction systems • Knowledge graph construction for different domains, e.g., medical, e-commerce, defence. etc. • Automated systems for number plate and damage detection, car insurance claim, e-challan, etc. • Multimodal sentiment analysis and its applications in education, policy making, etc. • Detecting, analyzing, and recommending advertisements in videos streams • Fake news detection and propagation, suspect detection, personality detection, etc. • Publications (but are not limited to) • AAAI, CIKM, ACL, EMNLP, WSDM, COLING, ACM Multimedia, ICDM, INTERSPEECH, WWW, ICASSP, WACV, BigMM, IEEE ISM, NAACL, ACM Hypertext, ACM SIGSPATIAL, Elsevier KBS, IEEE Intelligent Systems, IEEE MIPR, ACM MM Asia, AACL, Springer book chapters, etc.
  83. Research (AI for Social Good) • Awards (but are not limited to) • Won the outstanding paper award as COLING 2020 • Got selected to Heidelberg Laureate Forum (HLF) in 2018, 2019, 2020 • Best student poster in AAAI 2019, Honolulu, Hawai, USA • Best poster and best industrial paper in IEEE BigMM 2019, Singapore • Winner of the ACM INDIA Chapters Technology Solution Contest 2019 in Jaipur, India • Won the honorable mention award in ICDM Knowledge Graph Contest 2019 in Beijing, China • Won the best poster runner-up award at IEEE ISM 2018 conference in Taichung, Taiwan • Skills, Tools, and Frameworks (but are not limited to) • Natural Language Processing, Image Processing, Speech Processing • Multimodal Computing • Python, Java Script, Java • AI/ Machine Learning/ Deep Learning • Tensorflow, PyTorch, Keras, etc.
  84. Sponsors:
  85. References 1. Conversational Systems and the Marriage of Speech & Language by Mari Ostendorf (University of Washington) 2. Speech 101 by Robert Moore The University of Sheffield 3. https://www.youtube.com/watch?v=PWGeUztTkRA&ab_channel=Mark_Mitton 4. The Two Ronnies Show 5. Preliminaries to a Theory of Speech Disfluencies (Elizabeth Shriberg, 1994) 6. A Short Analysis of Discourse Coherence (Wang and Guo, 2014) 7. A. Nowosielski, “Two-letters-key keyboard for predictive touchless typing with head movements,” 07 2017, pp. 68–79 8. J. Tu, H. Tao, and T. Huang, “Face as mouse through visual face tracking,” Comput. Vis. Image Underst., vol. 108, no. 1–2, p. 35–40, Oct. 2007. [Online]. Available: https://doi.org/10.1016/j.cviu.2006.11.007 9. 3d head pose estimation and camera mouse implementation using a monocular video camera,” Signal, Image and Video Processing, vol. 9, 01 2012. 10. A. Markussen, M. R. Jakobsen, and K. Hornbundefinedk, “Vulture: A mid-air word-gesture keyboard,” in CHI ’14, 2014. 11. C. Yu, Y. Gu, Z. Yang, X. Yi, H. Luo, and Y. Shi, “Tap, dwell or gesture? exploring head-based text entry techniques for hmds,” in CHI ’17, 2017. 12. Zhao G, Barnard M & Pietikäinen M (2009) Lipreading with local spatiotemporal descriptors. IEEE Transactions on Multimedia 11(7):1254-1265. 13. Garofolo, J. & Lamel, Lori & Fisher, W. & Fiscus, Jonathan & Pallett, D. & Dahlgren, N. & Zue, V.. (1992). TIMIT Acoustic-phonetic Continuous Speech Corpus. Linguistic Data Consortium.
  86. 1. Gandharv Mohan, MIDAS Lab IIITD, Btech 2021 2. Akash Sharma, MIDAS Lab IIITD, Btech 2021 3. Rajaswa Patil, MIDAS Lab IIITD, Btech 2021 4. Avyakt Gupta, MIDAS Lab IIITD, Btech 2021 5. Gaurav Aggarwal, MIDAS Lab IIITD, Btech 2021 6. Devansh Batra, MIDAS Lab IIITD, Btech 2021 7. Aradhya Neeraj Mathur, MIDAS Lab IIITD, PhD Student 8. Maitree Leekha, MIDAS Lab IIITD, Btech 2020 9. Jainendra Shukla, HMI Lab IIITD, Assistant Professor 10. Vidit Jain, MIDAS and HMI Lab, Btech 2021 11. Rajesh Kumar, Haverford College USA, Assistant Professor 12. Shivam, Akash, Mohit, Vishaal, Mansi, Aayush, Meet, Suraj, and many other MIDAS members Acknowledgements
Anzeige