Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

Dolování dat z řeči pro bezpečnostní aplikace - Jan Černocký

Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Wird geladen in …3
×

Hier ansehen

1 von 40 Anzeige
Anzeige

Weitere Verwandte Inhalte

Ähnlich wie Dolování dat z řeči pro bezpečnostní aplikace - Jan Černocký (20)

Anzeige

Weitere von Security Session (20)

Anzeige

Dolování dat z řeči pro bezpečnostní aplikace - Jan Černocký

  1. 1. Dolování dat z řeči pro bezpečnostní aplikace Honza Černocký BUT Speech@FIT, FIT VUT v Brně Security Session, 11.4.2015
  2. 2. Security Session Honza Cernocky 11/4/2015 2/36 Agenda • Introduction • Gender ID example • Speech recognition • Language identification • Speaker recognition • Conclusions
  3. 3. 3/36 Needle in a haystack • Speech is the most important modality of human-human communication (~80% of information) … criminals and terrorists are also communicating by speech • Speech is easy to acquire in the scenarios of interest. • More difficult is to find what we are looking for • Typically done by human experts, but always count on: • Limited personnel • Limited budget • Not enough languages spoken • Insufficient security clearances Technologies of speech processing are not almighty but can help to narrow the search space. Security Session Honza Cernocky 11/4/2015
  4. 4. Security Session Honza Cernocky 11/4/2015 4/36 Data mining from spontaneous unprepared speech Speaker/Voice Recognition Gender Recognition Language Recognition Who speaks? What gender? What language? John Doe Male or Female English/German/?? audio (speech) Speech Recognition What was said?“Hello John!” “John” spotted Time/relation analysis Who asked to whom? John asked Paul
  5. 5. Security Session Honza Cernocky 11/4/2015 5/36 How do we work ? • According to recipes from pattern recognition text- books ! Collect data Choose features Choose model Train model Evaluate the classifier A priori knowledge of the problem deployment Happy (or deadline passed) ? Unhappy?
  6. 6. Security Session Honza Cernocky 11/4/2015 6/36 The result Feature extraction Evaluation of probabilities or likelihoods Models “Decoding” nput decision
  7. 7. 7/36 The simplest example … GID Gender Identification • Tag speech segments as male or female. Security Session Honza Cernocky 11/4/2015
  8. 8. Security Session Honza Cernocky 11/4/2015 8/36 So how is Gender-ID done ? Evaluation of GMM likelihoods MFCC put Gaussian Mixture models – boys, girls Decision Male/female
  9. 9. Security Session Honza Cernocky 11/4/2015 9/36 Features – Mel Frequency Cepstral Coefficients • The signal is not stationary • And the hearing is not linear
  10. 10. Security Session Honza Cernocky 11/4/2015 10/36 Features – a vector each 10ms
  11. 11. Security Session Honza Cernocky 11/4/2015 11/36 The evaluation of likelihoods: GMM
  12. 12. Security Session Honza Cernocky 11/4/2015 12/36 Decision - „decoding“
  13. 13. Gender ID summary Needed data: •Several hours of speech (from the target channels) labeled as M or F. Accuracy: •the most accurate of our speech data mining tools: >96% accuracy on challenging channels What do we get: •Limiting the search space by 50% Security Session Honza Cernocky 11/4/2015 13/36
  14. 14. Security Session Honza Cernocky 11/4/2015 14/36 Speech recognition • Voice2text (V2T), Speech2text (S2T), transcription … • Large vocabulary continuous speech recognition (LVCSR) Feature extraction Evaluation of likelihoods (scores of hypothesis) Acoustic models “Decoding” peech text Language model Pronunciation dictionary Recognition network
  15. 15. LVCSR technically … • Acoustic models • … how do speech segments match basic speech unites (phonemes) • trained on large (>100h) quantities of carefully transcribed speech data • Classically Gaussian Mixture models • Language models • … how do the words follow each other President George Bush President George push • Need to be trained on large quantities (Gigabytes) of text from the target domain • Pronunciation dictionary • Translate words into phonemes: dog  d oh g • Basis needs to be created by hand, the rest generated using trained grapheme to phoneme (g2p) converter • A toolkit to do all this … HTK, KALDI, proprietary. Security Session Honza Cernocky 11/4/2015 15/36
  16. 16. Security Session Honza Cernocky 11/4/2015 16/36 Making LVCSR work well • Neural networks • Eating up other techniques (feature extraction, scoring, LM) - DNNs • Bottle-neck NNs. • Speaker adaptation • Asking the speaker to read a text in dictation systems … • Unsupervised needed ! • MAP, MLLR, CMLLR, RDLT, SAT …
  17. 17. Security Session Honza Cernocky 11/4/2015 17/36 Challenges in LVCSR • LVCSR relatively mature in well represented languages (US English, Modern Standard Arabic, Czech) • Fast development of recognizers for new languages with limited resources – IARBA BABEL project • Limited language packs 10h + some 70h of untranscribed data • 2013 languages: Cantonese, Turkish, Pashto, Tagalog, Surprise - Vietnamese • 2014 languages: Bengali, Assamese, Zulu, Haiti Creole, Lao, Surprise: Tamil • How to re-use resources from other languages ? • How to adapt to user’s language/domain without seeing his/her data ?
  18. 18. Security Session Honza Cernocky 11/4/2015 18/36 Some examples …. and then they have one week to retrain their keyword results ... and ... give you might ask why one we there a lot of research or evaluation methods ... the people are trying out what keywords or so it is important to leave a ... sufficient amount of time there as well ... uhuh kade sengifowunelwe nguThami manje ithi angazi e- ekhuluma nomunye ubhuti wakwamasipala ukuthi ene usho ukuthi kunabantu ekufanele baphelelwe ngumsebenzi ngoba uNomvula emecabanga uzokhokha (()) ngoba yena uzoy ithela uzoyi uzoyihlulisela ngoba phela kukhona aba- abaphethe u-Adam angithi
  19. 19. LVCSR – what to expect Accuracies (word accuracy) •Dictation: >90% •Reasonable languages: >70% •Babel languages ~70% WER (example on Tamil) Is this OK ?? •Usually not useable for direct reading, and questionable, if a trained secretary is not faster in case we need 100% accurate output. •Yes useable for search, for rare languages often the only alternative. Security Session Honza Cernocky 11/4/2015 19/36
  20. 20. LVCSR – user data • Speech (for acoustic models): • Many hours of data as close as possible to the target use (language, dialect, speaking style …) • Needs to be transcribed better than in TV subtitles. • Text (for language models) • Newspapers and TV news work for dictation but not here. • Need target text data (including very dirty language) • Can be simulated by looking for dirty Internet data (Twitter, discussion forums). • Pronunciations: generally not a big deal, needs list of words. Problematic for languages without expertise. • Privacy issues: • Speech and text are sensitive. • Re-training of LVCSR by the users so far not successful. • Work on modularization: collection of statistics by the user, shipping to development teams… • Opportunity to collect this data jointly, especially for languages relevant for security across Europe Security Session Honza Cernocky 11/4/2015 20/36
  21. 21. Security Session Honza Cernocky 11/4/2015 21/36 Language identification • Which language in the recording ? LID
  22. 22. Security Session Honza Cernocky 11/4/2015 22/36 Standard approaches • Acoustics • Phonotactics
  23. 23. Security Session Honza Cernocky 11/4/2015 23/36 LID: Current state-of-the-art system • A large GMM (“Universal Background model - UBM”) – performs collection of sufficient statistics – a vector of several thousands of parameters per utterance (fixed size!) • Projection to a “language print” – several hundreds of values. • These language prints are scored and score is calibrated.
  24. 24. LID – what to expect • Performance on nice data NIST LRE 2009, 23 languages Security Session Honza Cernocky 11/4/2015 24/36 0% 2% 4% 6% 8% 10% 30s 10s 3s Best 1 Best 2 Best 3 Best 4 Best 5 Phase3 Phase2 Phase1 17 • And on terrible data RATS 2014, 5 languages (EER)
  25. 25. Security Session Honza Cernocky 11/4/2015 25/36 LID – user data • Tens of hours of data per target language or dialect • Need to have only the language label, no transcription necessary. • Allow to: • Improve the model of an existing language. • Add a new language or dialect, or even a target group • LID is a technology where the user can modify the system him/her-self • Language prints do not carry the information on the content – potential for cooperation • Backup solution: • automatic acquisition of language-specific telephone data from public sources (EOARD project)
  26. 26. Security Session Honza Cernocky 11/4/2015 26/36 Speaker recognition Two hypotheses • H0: the speaker in test recording IS THE SAME WE SAW IN THE ENROLMENT • H1: the speaker in test recording IS DIFFERENT • Log likelihood ratio
  27. 27. SRE classical scheme • Feature extraction – Mel Frequency Cepstral Coefficients • Background model implemented as a Gaussian Mixture model • Adapted to the target speaker. • At the time of the test, both models produce likelihoods that are subtracted and thresholded. Such a system • Can be built by a reasonably skilled student equipped with Matlab in half a day • Will reasonably function in case enrollment and test take place under similar conditions. Security Session Honza Cernocky 11/4/2015 27/36 IKR !
  28. 28. Inter-session variability NOT HAVING THE SAME CONDITIONS ! Intrinsic variability •Language •Emotions, stress, Lombard effect •Health condition •Content of the message Extrinsic variability •Noise •Transmission channel •Codec (or series of codecs) •Recording device … Security Session Honza Cernocky 11/4/2015 28/36
  29. 29. Security Session Honza Cernocky 11/4/2015 29/36 Years of SRE R&D fighting the variability … Front-end processing Front-end processing Target modelTarget model Background model Background model LR score normalization LR score normalization Σ ΛAdapt Feature domain Model domain Score domain • Noise removal • Tone removal • Cepstral mean subtraction • RASTA filtering • Mean & variance normalization • Feature warping • Speaker Model Synthesis • Eigenchannel compensation •Joint Factor Analysis • Nuisance Attribute Projection • Z-norm • T-norm • ZT-norm •Feature Mapping •Eigenchannel adaptation in feature domain
  30. 30. Security Session Honza Cernocky 11/4/2015 30/36 Current state-of-the-art • Low-dimensional representation of whole recordings • i-Vectors (for R&D), Voiceprints (for business) • Allows for very fast scoring.
  31. 31. Security Session Honza Cernocky 11/4/2015 31/36 What to expect I. • Works very nicely for long telephone recordings (EER ~2%) – multiple successes in NIST evaluations. • Examples …
  32. 32. Security Session Honza Cernocky 11/4/2015 32/36 What to expect II. • Noise, varying communication channels, short recordings (10s) still a problem – DARPA RATS program • Examples …
  33. 33. SRE – user data • The performance of the SRE system crucially depends on how the training data is close to the deployment. • UBM – needs lots (100s of hours) of unannotated data, not very sensitive. • VoicePrint extractor – dtto. • Scoring done by PLDA • Voice-prints with speaker labels (A,B,C, …) needed • Even 50 speakers help to increase the accuracy by 30%. • … but some users are not able to collect/label even this amount. • Work running on unsupervised adaptation on unannotated data. Security Session Honza Cernocky 11/4/2015 33/36
  34. 34. The charm of voice-prints • Allowing for transfer of speaker identities • without giving out the original WAV • Without possibility to reconstruct what was said. Security Session Honza Cernocky 11/4/2015 34/36 No contentcontent • Opening a range of opportunities for • Cooperation between customers and law enforcement • Cooperation with R&D teams.
  35. 35. Conclusions • Speech data mining technologies are already serving in security and defense (and you can test and eventually buy the ones from several vendors) • International crime asks for international reaction: Standardization (even in the form of informal working draft) should take place ASAP to allow Police forces to exchange voice-prints regardless of vendors. … we’re on it. Security Session Honza Cernocky 11/4/2015 35/36
  36. 36. Security Session Honza Cernocky 11/4/2015 36/36 Díky za pozvání na Security Session ! Otázky ?
  37. 37. BACKUP SLIDES Security Session Honza Cernocky 11/4/2015 37/36
  38. 38. Security Session Honza Cernocky 11/4/2015 38/11 Who am I • MS. in Radioelectronics from BUT 1993. • PhD. in Signal processing jointly from Universite d’Orsay (France) and BUT • Started speech coding in 1992 and stayed in speech processing since • was with Oregon Graduate Institute (Portland, OR) in the group of Prof. Hermansky in 2001 • Since 2002 at the Faculty of Information Technology of BUT, habilitation to Associate Professor (Doc.) in 2003. • Executive leader of BUT Speech@FIT research group • Since 2008 Head of Department of Computer Graphics and Multimedia
  39. 39. Security Session Honza Cernocky 11/4/2015 39/36 BUT Speech@FIT • Founded in 1997 (1 person) • ~20 people in 2013 (faculty, researchers, grad and pre-grad students, support staff) • Active in all technologies this presentation is about • Supported by EU, local and US (DARPA and IARPA) grants
  40. 40. International cooperation and standardization • NIST evaluation campaigns • Allowing for objective comparison of technologies • Often on too good data. • US-funded projects • Realistic testing on noisy channels (DARPA RATS) and new languages (IARPA Babel) • Restricted to participants • EU projects examples • Past: MOBIO EU FP7 (mobile biometry) helped and fast speaker recognition based on low-dimensional voice-prints. • SIIP – addressing topic SEC-2013.5.1-2 Audio and voice analysis, speaker identification for security applications – Integration Project - starting now. Standardization – not much … • UK Home Office Forensic Speech and Audio (FSA) Group - Bring forensic speech and audio under the regulation of ISO 17025 • ANSI/NIST-ITL Standard 1-2013, Data Format for InterchangeRecord Type-11: Forensic and investigatory voice record Security Session Honza Cernocky 11/4/2015 40/36

Hinweis der Redaktion

  • Sem dat obrazek spkID a zaramovat sloupec s pohlavim !!!
  • Can do this in more detail later …
  • Q publikum: kde takova data vzit ? Mozna demo na cestine, kurva piča, atd
  • Q pro publikum: co je tady nejvetsi challenge ?
    … poznat kde vubec rec je – VAD !
  • It might be problematic to collect even these 50 speakers (if possible on different communication channels…)

×