Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

Georgetown lecture 2012 6 2 full

Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Wird geladen in …3
×

Hier ansehen

1 von 85 Anzeige

Weitere Verwandte Inhalte

Anzeige

Ähnlich wie Georgetown lecture 2012 6 2 full (20)

Anzeige

Aktuellste (20)

Georgetown lecture 2012 6 2 full

  1. 1. “Triggers,” Preservation & Search June 2, 2012 Georgetown Law Sonya L. Sigler 9/23/2014 1
  2. 2. Overview Triggers & Preservation • What is it? • Why Does it Matter? Search Keyword Search Clustering Ontologies Technology Enhanced Review - Sampling Social Networking Analysis Relationship Analysis 9/23/2014 2
  3. 3. “Triggers” & Preservation What is a Trigger? – Litigation reasonably anticipated – Who decides Litigation Hold Continuum – Established in hind sight – Threat – Letter about litigation – Filing Suit Cases – Pippin, Zubulake, Pension Committee 9/23/2014 3
  4. 4. Pippins v. KPMG How much data to Preserve? – All hard drives (Pippins’ position) – 100 Sample Hard drives (KPMG’s position) To Cooperate or NOT to Cooperate? How Judges React to Lack of Cooperation 9/23/2014 4
  5. 5. Zubulake Litigation Holds – Cannot send a request into the ether Preservation Have to follow-up Take affirmative steps to monitor compliance In-house Counsel Duty Cannot leave it to employees discretion Document what was done 9/23/2014 5
  6. 6. Pension Committee No intentional destruction of data Careless & indifferent No Latchkey Custodians (alone & unsupervised) – Identify Custodians – Monitor their efforts – Including former employees and third parties Proactive Consistent Reasonable Approach 9/23/2014 6
  7. 7. Triggers When does a duty to preserve arise? 9/23/2014 7
  8. 8. What To Do? Who to include? – Not about data volume – Not about contact with underlying “litigation” Key Players (Zubulake opinions) – Likely to have relevant information – CEO, Board, Committees, employees, etc. Produce it from the Key Player (not others) – Nursing Home Pension Fund v. Oracle – Produce emails from the CEO (15) not others (1,650) 9/23/2014 8
  9. 9. Spoliation Failure to Preserve – Didn’t Ask • Right person • Right Place – Didn’t follow up Destruction of Data – Intentional – Inadvertent destruction What can happen – Sanctions – Adverse Inferences 9/23/2014 9
  10. 10. Search How to Use it To Find Information How to Use it to Ignore Information When to use which search methodology 9/23/2014 10
  11. 11. Search - Data Assessment Where is the Data? – Data Mapping - databases, servers, desktops, laptops, IMs, smart phones, voicemail, other records Defining Process from Collection to Review to Production Collection Strategy, Process, Approach – Scope of collection: custodians, date ranges, topics Reports on the Data Processing – File types, encrypted files, de-duplication rates, password protected files, encrypted files, etc. Not Reasonably Accessible data Assessing Risk of Data Loss 9/23/2014 11
  12. 12. Search - Case Assessment Who - Cast of Characters What - What the Heck Happened? Where - Where did it take place? When - What time period are we concerned with? How - fraud, antitrust violation, etc. WHY - What were the motives involved? Data Assessment ≠ Effective Case Assessment 9/23/2014 12
  13. 13. Keyword Search Under Scrutiny United States v. O’Keefe (Facciola) – Questioned lawyers’ ability to decide which search terms are more likely to produce relevant information – Facciola has also suggested that litigants take a look at advanced search methodologies Victor Stanley, Inc. v. Creative Pipe, Inc. (Grimm) – Defensibility of process AND execution lies with the party relying upon the search protocol to meet their obligations which needs to be able to explain search rationale, appropriateness, and proper implementation – Advocates quality assurance, e.g. by sampling – Searches should be designed by a competent practitioner 9/23/2014 13
  14. 14. Keyword Specific Case William A. Gross Construction Associates, Inc. v. American Manufacturers Mutual Insurance Company SDNY, Judge Andrew Peck Keyword list was in the thousands Use the actual data set and custodians to figure out keywords “This case is just the latest example of lawyers designing keyword searches in the dark, by the seat of the pants, without adequate (indeed, here, apparently without any) discussion with those who wrote the emails. Prior decisions from Magistrate Judges in the Baltimore- Washington Beltway have warned counsel of this problem, but the message has not gotten through to the Bar in this District.” 9/23/2014 14
  15. 15. $6M Keyword Mistake In re Fannie Mae Securities Litigation 3rd Party - OFHEO DC Circuit - Judge David Tatel Attorney agreed to something he did NOT understand Long list of key terms Taxpayers suffered the consequence 9/23/2014 15
  16. 16. What This Means • The Courts are finally catching up • Courts actively ruling on Standards of Care and Process • Lawyers are Getting Wise 9/23/2014 16
  17. 17. Case Law Effects on Discovery Defensibility of Review Process is now a focus – Culling now can kill you later – Cooperation is a hot topic – Tussle between inside & outside counsel – Beginning to see planning as a necessity Increased focus on Quality – Heightened involvement expected from corporate clients in the overall process – Cases pushing this, Qualcomm, Creative Pipe 9/23/2014 17
  18. 18. What Else Is There? Effort to establish & codify uniform “Best Practices” – Quickly becoming roadmap for uneducated industry – Increasingly relied upon by judges as measure of reasonable or standard behavior Publications have addressed: – Document retention & production – Email management – Search & Retrieval – Protective orders & confidentiality – ESI admissibility 9/23/2014 18
  19. 19. Getting to a Manageable Review Set Intake Data 100% Duplicates 25% reviewing & using the not just filtering data Non- Focus on finding, Responsive 20% “right” data, Produced 12.25% Junk/Spam/ Porn 20% NR/Priv 20% Responsive & Priv 15% These figures vary based upon the data set received 9/23/2014 19
  20. 20. Search Methodologies Visualization Measurement Relationship Analysis documents with causal or sequential relationship Social Network Analysis relationships among relevant people Clustering Ontology similarity of salient features Ontology generalized generalized words or phrases words or phrases specific exact words, KKeeyywwoorrdd specific exact words Keyword Keyword specific exact words proximity searches, stemming Context Concept Content 9/23/2014 20
  21. 21. Keyword Accuracy Example Keyword search reduced the document set by only 47% And 88% of the documents returned by keyword search were not responsive (Over-inclusive) 8,553 responsive documents missed by keyword search (Almost 8% of responsive documents missed by keyword search - Under-inclusive) 9/23/2014 21
  22. 22. Myth Keyword Searching is the Way to Go If I agree to keyword terms, I am OK Keyword Search Cases Keyword replacement example Keyword substitution Missing in Action (Under-inclusive) Unwanted Extras (Over-inclusve) Multiple subject/persons (Disambiguate) 9/23/2014 22
  23. 23. Fact or Myth? Manual review by humans of large amounts of information is as accurate and complete as possible - perhaps even perfect - and constitutes the gold standard by which all searches should be measured This is “The reigning Myth of ‘perfect’ retrieval using traditional means” Best Practices Commentary on the Use of Search and Information Retrieval Methods in E-Discovery The Sedona Conference Journal (2007) p. 199 Human beings retrieved less than 20% of the relevant documents when they believed they were retrieving over 75% An Evaluation of Retrieval Effectiveness for a Full-Text Document Retrieval System Blair & Maron (1985) 9/23/2014 23
  24. 24. IS 240 – Spring 2011 Blair and Maron 1985 A classic study of retrieval effectiveness – earlier studies were on unrealistically small collections Studied an archive of documents for a legal suit – ~350,000 pages of text – 40 queries – focus on high recall – Used IBM’s STAIRS full-text system Main Result: – The system retrieved less than 20% of the relevant documents for a particular information need; lawyers thought they had 75% But many queries had very high precision
  25. 25. IS 240 – Spring 2011 Blair and Maron, cont. How they estimated recall – generated partially random samples of unseen documents – had users (unaware these were random) judge them for relevance Other results: – two lawyers searches had similar performance – lawyers recall was not much different from paralegal’s
  26. 26. IS 240 – Spring 2011 Blair and Maron, cont. Why recall was low – users can’t foresee exact words and phrases that will indicate relevant documents • “accident” referred to by those responsible as: “event,” “incident,” “situation,” “problem,” … • differing technical terminology • slang, misspellings – Perhaps the value of higher recall decreases as the number of relevant documents grows, so more detailed queries were not attempted once the users were satisfied
  27. 27. Keyword Search Summary Pro Word Stemming –Hous* - house, housemate, household Easy to use/explain/agree Familiar Fast results Con Over-inclusive –Disambiguate Under-inclusive Word must be present Hard to craft Ineffective with short messages, IMs 9/23/2014 27
  28. 28. Keyword Truths Under-inclusive - missing relevant or important info Over-inclusive - costly to review “Reasonable Keyword Search” doesn’t exist Effective keyword search is difficult/impossible – Index Data, Analyze Index – Suggest keywords or approach Keywords may not be appropriate for the data Keyword Search is ONE Tool in Your Arsenal 9/23/2014 28
  29. 29. Keyword Accuracy Example Keyword search reduced the document set by only 47% And 88% of the documents returned by keyword search were not responsive (Over-inclusive) 8,553 responsive documents missed by keyword search (Almost 8% of responsive documents missed by keyword search - Under-inclusive) 9/23/2014 29
  30. 30. Search Methodology Continuum Review Methodology - Decided Upfront Identify Issues in the Case – Formulate Queries and Approaches for Finding Responsive Documents – Formulate Relevancy and Responsiveness Guidelines Identify Primary Participants Select or Triage Documents for Review 9/23/2014 30
  31. 31. Review Tools for Relevancy Assessment Keyword Searches, Culling – Slices of Data are Reviewed Categorization of Data – Entire Dataset is Categorized – Review Targeted Data Automated Review – Categorization of Dataset – Random Sampling (Statistically Significant) 9/23/2014 31
  32. 32. Categorization of Data for Review Categorize Entire Data Set – Spam/Porn/System Files – Personal/Private Data – Non-relevant Business Data Business Data – Relevancy Assessment by Topic – Privilege Review Keyword, Topic Analysis - Overlap, Holes 9/23/2014 32
  33. 33. Search Methodologies Visualization Measurement Relationship Analysis documents with causal or sequential relationship Social Network Analysis relationships among relevant people Clustering Ontology similarity of salient features Ontology generalized generalized words or phrases words or phrases specific exact words, KKeeyywwoorrdd specific exact words Keyword Keyword specific exact words proximity searches, stemming Context Concept Content 9/23/2014 33
  34. 34. Categorization Methods Statistical Methods (#s based) – Topic Clustering • Statistical Similarity • Counting #s of words, appearance together – Latent Semantic Indexing – Supervised v. Unsupervised Clustering Linguistic Methods (Word Based) – Keyword (Culling Method) – Ontologies 9/23/2014 34
  35. 35. Clustering Clustering just means putting documents into groups that have something in common. Manually (that's what manual review is) Keyword Searches Ontologies (linguistic filters) Automated clustering (using technology) – Automated clustering by document type (all the Word documents go into one basket – Automated clustering by creation date – Automated clustering by Actor – Automated clustering by statistical similarity (statistical clustering) – ... and many other approaches 9/23/2014 35
  36. 36. Clustering -- “Options” 1 Cluster or 4 Clusters Financial/energy trading options Email/computer menu-driven options Stock options (ISO's) The generic idea of an available choice of action 9/23/2014 36
  37. 37. Clustering Software implements statistical methods of finding groups of “similar” documents – “Similar” must be defined appropriately for the application Documents are categorized with very little effort by the user May help with document review – A single reviewer can look at similar documents together, produce consistent review decisions – Tight clustering can be used to detect “near duplicates” caused by OCR errors 9/23/2014 37
  38. 38. Clustering vs. queries Clustering is unpredictable compared to keywords or taxonomies The items that look very similar (to the clustering algorithm) may not actually be similar in ways that matter – Relevancy may depend upon fine legal distinctions – May vary in the same matter by subpoena and/or jurisdiction 9/23/2014 38
  39. 39. Ontologies Implement ontologies for directed searches. – Approach searching from a knowledge-representation viewpoint – Field is 25 years old, lots of work done – Advantages: • Disambiguate different meanings of the same word from their context  More accurate • Encapsulate many ways of saying the same thing  More thorough • Search for concepts, not individual words  More intuitive, more reusable, and faster Can be combined with other methods (unsupervised clustering, discussions). 9/23/2014 39
  40. 40. Subjectivity GOOD WEATHER – Sun – Calm BAD WEATHER – Rain – Snow – Wind 9/23/2014 40
  41. 41. A More Realistic Ontology ROYALTY CONCEPT • royalty • royalties • rty • commission • commissions • comm. • honorarium • honorariums • honoraria • usage fee • usage charge • usg fee • use fee • fee for use • fee for usage • incent* • insent* • earn a fee • eam a fee • charge for use • charged for use • charging for use • charges for use • licence fee • license fee • lisense fee • “take cut”~2 • “takes cut”~2 • “took cut”~2 • “slice pie”~5 • “piece pie”~5 • “piece action”~5 • “slice action”~5 • -king • -queen • -prince • -princess 9/23/2014 41
  42. 42. Ontology as a Query But it can be slightly cumbersome to deal with directly in that form q ((+(std:%CapacityReports_% std:%DINCapacity_%) +(std:%ACMEEPPlant_% std:%ProductName_%)) (+(std:%ACMEPNPlant_% std:%ProductName_%) +(std:%ProductiveCapability_% std:%CapacityReports_%)) (+(std:%CapacityCreep_% std:%OperationsImprovement_% std:%CapacityExpansion_% std:%CapacityRestoration_%) +(std:%ACMEPNPlant_% std:%ProductName_%)) (+(std:%EquipmentReplacement_% std:%FinishingColumn_%) +(std:%ACMEPNPlant_% std:%ProductName_%)) (std:%Audit_% actor:%Audit_%) (+(std:%SettlementNegotiations_% std:%ContractNegotiations_% ) +(actor:%ACMEOutsideCounsel_% std:%ACMEOutsideCounsel_% actor:%ACME UBOutsideCounsel_% std:%AcmeSubOutsideCounsel_% actor:%AcmeSub_% std:%AcmeSub_%)) (std:%FTC_% actor:%FTC_%) ((+subject:%ProductName_% +(std:swap std:"supply agreement" std:"exchange agreement" std:"agree to exchange")) std:"name (About a quarter of its regular size) 9/23/2014 42
  43. 43. Ontology Pros & Cons Identify acronyms Normalize variants Disambiguate terms Identify overly broad keywords Identify and correct keywords with errors Create extensive libraries of ontologies Can be used as a clustering method Topics can appear in more than one languages Reusable for different types of litigation, e.g. anti-trust, product liability etc. (and for both offense and defense) As with Keyword - word based Labor intensive, upfront 9/23/2014 43
  44. 44. “Search” Terminology Technology-Enhanced Review Technology Assisted Review Automated Review Predictive Coding • Process • Workflow Technology People • Subject Matter • Review • Feedback • Privilege • Production Quality Control 9/23/2014 44
  45. 45. Setup Sample Expert judges sample Non-responsive Responsive Model learns Model predicts Responsive Non-responsive Model categorizes all remaining documents Repeat as needed
  46. 46. Automated Review Methodology
  47. 47. Technology Enhanced Review: Speed, Predictable Costs, and Accuracy Example from a real case Priv by High-Speed Manual Review Automate any portion of the review Source Data Eliminate Duplicates & System Files Non-Responsive Isolation ontologies Responsive by Technology Enhanced Review (removed another 7%) NR by Technology Enhanced Review (removed another 18%) 30% 30% 15% 22% 100% 3% 9/23/2014 47
  48. 48. Search Methodologies Visualization Measurement Relationship Analysis documents with causal or sequential relationship Social Network Analysis relationships among relevant people Clustering Ontology similarity of salient features Ontology generalized generalized words or phrases words or phrases specific exact words, KKeeyywwoorrdd specific exact words Keyword Keyword specific exact words proximity searches, stemming Context Concept Content 9/23/2014 48
  49. 49. From Document Analysis to Social Network Analysis 9/23/2014 49
  50. 50. From Social Network Analysis to Discussions 9/23/2014 50
  51. 51. Search Methodologies Visualization Measurement Relationship Analysis documents with causal or sequential relationship Social Network Analysis relationships among relevant people Clustering Ontology similarity of salient features Ontology generalized generalized words or phrases words or phrases specific exact words, KKeeyywwoorrdd specific exact words Keyword Keyword specific exact words proximity searches, stemming Context Concept Content 9/23/2014 51
  52. 52. Analytics are Based on the Model Analytics and on Discussions 9/23/2014 52
  53. 53. Better Answers and Better Questions When were customary work practices circumvented? When did established norms of behavior change? Who knew, or likely knew, what facts? Who interacted with whom and how intimately? Who was involved in what types of decisions or meetings? Who are the real ‘insiders’? What data is hidden or missing? When were electronically documented conversations “taken off line,” possibly in an attempt to avoid detection? How did the importance of different actors change over time? 9/23/2014 53
  54. 54. Bear Stearns Lower Bar For Fraud? Two hedge fund managers arrested Charged with securities and wire fraud, and one with insider trading Internal emails: – “I'm fearful of these markets. ... As we discussed it may not be a meltdown for the general economy but in our world it will be.” – “I think we should close the funds now .” External communications: – “We are very comfortable with exactly where we are.” – “The funds are performing exactly as they were designed to.” 9/23/2014 54
  55. 55. Sentiment Analysis Visualization 9/23/2014 55
  56. 56. Analysis of Anomalous Communication Patterns Unusual levels relative to a particular type of activity pop out Color-coded graphs show relative communication densities for apples to apples comparisons 9/23/2014 56
  57. 57. Spread of Information 9/23/2014 57
  58. 58. Emotive Tone Whistle-blower Scenario 9/23/2014 58
  59. 59. “Call Me” Events Sequence Viewer used for analytics-driven review 9/23/2014 59
  60. 60. Search Risks Failure to find responsive documents Failure to recognize responsive documents Failure to recognize privileged documents Inconsistent treatment of documents (e.g., duplicates) Failure to complete project in a timely manner Sophisticated Tools – Understand What They Do and Don’t Do Well – Inform Yourself, Speak to References, Consultants 9/23/2014 60
  61. 61. Transparency of Process Discussing Review Protocols – Provide transparent, defensible, sophisticated search based on document content – Clustering, Ontologies, Analytics, and yes, sometimes Keywords too Develop search methodologies for each case – Use technology experts in consultation with case / legal experts Results verifiable by Quality Control – Defensible sampling 9/23/2014 61
  62. 62. Thank you! Sonya L. Sigler Vice President, Product Strategy SFL Data 415-321-8385 sonya@sfldata.com www.sfldata.com 9/23/2014 62
  63. 63. Review Protocol ≠ Agreeing to Search Terms Data Culling (upfront or backend) Search Methodologies - Continuum – Keyword Positive List – Ontologies – Clustering – Technology Enhanced Review – Relationship Analysis Quality Control Process & Procedures Privilege Review, Sensitivities Production Format & Timing 9/23/2014 63
  64. 64. Search The Courts are Finally Starting to Catch up to Technology Making more aggressive rulings: – Forcing attorneys to live with the results of bad searches – Sanctioning those who screw up, even if no allegation of fraud – Demanding repeatable, demonstrable process – using terms like “quality assurance” 9/23/2014 64
  65. 65. Search Under Scrutiny Facciola’s Opinions - United States v. O’Keefe “for lawyers and judges to dare opine that a certain search term or terms would be more likely to produce information than [other] search terms … is truly to go where angels fear to tread.” He has also suggested that litigants take a good look at more advanced search methodologies, including the use of computational linguistics and technology assisted review 9/23/2014 65
  66. 66. Reasonableness of Search Methods Victor Stanley, Inc. v. Creative Pipe, Inc., 2008 WL 2221841 (D. Md., May 29, 2008). "Common sense suggests that even a properly designed and executed keyword search may prove to be over-inclusive or under-inclusive...the only prudent way to test the reliability of the keyword search is to perform some appropriate sampling." “Selection of the appropriate search and information retrieval technique requires careful advance planning by persons qualified to design effective search methodology. The implementation of the methodology selected should be tested for quality assurance; and the party selecting the methodology must be prepared to explain the rationale for the method chosen to the court, demonstrate that it is appropriate for the task, and show that it was properly implemented.” 9/23/2014 66
  67. 67. From Pre-Discovery to Production Completeness Henry v. Quicken Loans --> 26(f) consulting – Lawyers agreed to keyword lists and process – Ran own (unsanctioned) searches with expert – Told to live with bad results, and pay for it Qualcomm --> Smell Test; Dig Deeper – In-house counsel (Qualcomm) v. Outside Counsel (Day Casebeer) – Sanctions, Attorney Client-Privilege Problems – Associate found docs and told they weren’t relevant; found out the hard way that those and 230,000 other pages were relevant Judge Rader’s Protocol in TX for Patent cases – 5 custodians – 5 search terms (can you say over broad…) 9/23/2014 67
  68. 68. Under-inclusive - Missing in Action Missing abbreviations / acronyms / clippings: – incentive stock option but not ISO – Board of Directors but not BOD – 1998 plan but not 98 plan Missing inflectional variants: – grant but not grants, granted, granting Missing spellings or common misspellings: – gray but not grey – privileged but not priviliged, priviledged, privilidged, priveliged, privelidged, priveledged, … 9/23/2014 68
  69. 69. Missing in Action II Missing syntactic variants: board of directors meeting but not meeting of the board of directors BOD meeting board meeting BOD mtg board mtg directors’ meeting directors’mtg mtg of the BOD mtg of the directors BOD meetings board meetings BOD mtgs board mtgs directors’ meetings directors’ mtgs mtgs of the BOD mtgs of the directors 9/23/2014 69
  70. 70. Missing in Action III Missing synonyms / paraphrases: hire date but not start date approved by Smith but not Smith’s approval the approval of Smith Smith’s ok Smith’s go-ahead Smith’s goahead the go-ahead from Smith the goahead from Smith the nod from Smith Smith’s signature Smith’s sign-off the sign-off of Smith the signoff of Smith 9/23/2014 70
  71. 71. Missing in Action IV As a keyword item, the address 101 E. Bergen Ave., Temple, CA 90200 does not match any of: 101 East Bergen Avenue the Bergen site the Temple location our 90200 outlet 9/23/2014 71
  72. 72. Over-inclusive - Unwanted Extras Options Target: Sheila was granted 100,000 options at $10 Match: What are our options for lunch? Match in a signature line: Amanda Wacz Acme Stock Options Administrator Destroy Target: destroy evidence Match in a disclaimer: The information in this email, and any attachments, may contain confidential and/or privileged information and is intended solely for the use of the named recipient(s). Any disclosure or dissemination in whatever form, by anyone other than the recipient is strictly prohibited. If you have received this transmission in error, please contact the sender and destroy this message and any attachments. Thank you. 9/23/2014 72
  73. 73. Unwanted Extras II alter* Target: alter, alters, altered, altering Matches: alternate, alternative, alternation, altercate, altercation, alterably, … grant Target: stock option grant Matches names: Grant Woods, Howard Grant 9/23/2014 73
  74. 74. Tuning an Ontology Linguists briefed as reviewers Linguists read the data Linguists study complaint and other relevant documents Linguists analyze the search index Legal Team provides input, feedback 9/23/2014 74
  75. 75. A Simple Linguistic Ontology ROYALTY CONCEPT – Royalty – Commission – Honorarium – Usage Fee – Slice of the Pie 9/23/2014 75
  76. 76. A Simple Pricing Concept PRICING CONCEPT – Purchase Order – PO – Dollar amount – Invoice 9/23/2014 76
  77. 77. Adding Subjective Content PRICING CONCEPT – Purchase Order – PO – Dollar amount – Invoice – Cylinder – Canister – Bottle 9/23/2014 77
  78. 78. Ontology Usage Identifying Misspellings, Slang, Nicknames, etc. Variant Generation – help the user find what he meant (names, words, suggestions) – Buy* Buying, Buys, Bought, etc. – Kenneth Lay, Ken Lay, klay, kenneth.lay View variations in context to choose topics Document segmentation – text blocks, signatures Finding Words in Context, Frequency at serious risk of losing 25 are certain risks inherent in 16 9/23/2014 78
  79. 79. Identifying misspellings, slang, etc 1. Match the index against electronic dictionary. 2. From the remaining material (not in dictionary), remove any items that are merely numbers. 3. Find (in the ontologies) any words that are similar to what remains. 4. Add the similar words to the ontology This increases coverage (i.e., ensures that we retrieve documents that otherwise would have been missed) 9/23/2014 79
  80. 80. Variant Generation Help the user find out search for what he meant Take names, numbers, and other entities for which the user wants to search Automatically generate likely synonyms 9/23/2014 80
  81. 81. Variant Generation Show the context of these variations, so the user can evaluate them. 9/23/2014 81
  82. 82. Document Segmentation Examples of signatures Jean-Louis Koenig President GGDA Region MegaCorp International SA Rue de Concours 2280 Bern, Switzerland Robert Guilliam Product Regulatory Affairs & Compliance MegaCorp International Neuchatel Switzerland Tél. +41 (31) 125 2366 Alberto Goreman Manager Printing & Packaging, Eastern Region +57 3 451 7195, alberto_goreman@megacorp.com 9/23/2014 82
  83. 83. Finding words in context Phrase Total Instances risks alienating some 37 at serious risk of losing 25 are certain risks inherent in 16 are at risk of running 15 it be risking anything by 15 difference a risk o why 14 and the risks inherent in 12 without assuming any risk 8 we could risk losing next 7 avoid transferring risk to the 5 requires taking risks and the 4 can t risk not living 3 and unknown risks and uncertainties 2 a potential risk that was 2 avoid transfering risk to the 2 This increases coverage AND precision 9/23/2014 83
  84. 84. Multi-Lingual Issues Does language matter? – Lucerne – Luzerne – Lucerna These places were all the same city Name of city not necessarily expressed in the same language as rest of document In Europe, many email threads and documents are mixed language, and must be properly categorized as such 9/23/2014 84
  85. 85. Automated Ontology Expansion Tools Currently implemented expansion modules: Spelling variants: color >> colour, defense >> defence, labeled>> labelled Lemmatization (recovering uninflected form): walking >> walk, ate >> eat Morphological variants: eat >> eats, eating, eaten, ate hablar >> hablo, hablas, habla, hablan, habláis, hablamos Number expansion: $2.5B >> two point five billion dollars 2,567 >> two thousand five hundred sixty seven 13 >> 13th, thirteenth Name variants: Elizabeth Van der Beek >> “Liz Van der Beek”, “Liz Vander Beek”, “Van der Beek, Elizabeth”, “Beth Vanderbeek”, etc. Email variants (mined from alias clusters file): Elizabeth Van der Beek >> evanderbeek, liz.vanderbeek, vanderbeekl, emvanderbeek, etc. Abbreviations: administrative project meeting >> admin project meeting, admin project mtg, admin proj mtg, etc. 9/23/2014 85

×