Lucene

2. Open source indexing and search engine

3. Web scale

4. Lucene Inverted index

5. Lucene Inverted index Results

6. Lucene Inverted index Servlet container J2EE application server

7. WARNING Java approaching!

8. Java is strongly object orientated

9. my @gene_names = (); push(@gene_names, $gene); print @gene_names; Perl Java Array gene_names = new Array(); gene_names.add(gene); System.out.println(gene_names.toString)

10. my $gene = Gene->new(‘ENS12345’); $gene->set_name(‘BRCA2’); Perl Java Gene gene = new Gene(‘ENS12345’); gene.set_name(‘BRCA2’);

11. Java is strongly typed

12. my $number = “100”; $number = $number + 400 print $number; Perl Java Integer number = new Integer(100); number = number + 400; System.out.println(number + 400);

13. Java is good at error handling

14. eval ($gene->transform); warn $@ if $@; Perl Java try { gene->transform } catch (IOException e) { e.printStackTrace; }

15. Java is surprisingly easy to learn

16. Conditionals and loops Variables have scope Extras from CPAN Performance is important Perl Java Conditionals and loops Variables have scope Extras available as JAR files Performance is important

17. Recipe 1: Indexing a collection of documents

18. org.ensembl.lucene.Writer

19. public static void main(String[] args) { HashMap<String, String> arguments = new HashMap<String, String>(); String key = null; for (String s: args) { if (key == null) { key = s; } else { arguments.put(key, s); key = null; } } Writer writer = new Writer(); writer.setIndexLocation(arguments.get(quot;-indexquot;)); writer.setInputLocation(arguments.get(quot;-inputquot;)); if (arguments.get(quot;-mergefactorquot;) != null) { writer.setMergeFactor(Integer.valueOf(arguments.get(quot;-mergefactorquot;))); } if (arguments.get(quot;-maxmergedocsquot;) != null) { writer.setMaxMergeDocs(Integer.valueOf(arguments.get(quot;-maxmergedocsquot;))); } try { writer.index(); } catch (IOException e) { e.printStackTrace(); } System.out.println(quot;Indexing completequot;); }

22. Max-merge-docs how many documents are added to a segment

23. Merge-factor how often Lucene merges index segments when adding documents

25. public void index() throws IOException { File index = new File(getIndexLocation()); File location = new File(getInputLocation()); IndexWriter writer = new IndexWriter(index, new StandardAnalyzer(), true); writer.setMergeFactor(getMergeFactor()); writer.setMaxMergeDocs(getMaxMergeDocs()); indexDocuments(writer, location); writer.optimize(); writer.close(); } private static void indexDocuments(IndexWriter writer, Filelocation) throws IOException { if (location.canRead()) { if (location.isDirectory()) { String[] files = location.list(); if (files != null) { for (int i = 0; i < files.length; i++) { indexDocuments(writer, new File(location, files[i])); } } } else { System.out.println(quot;Indexing quot; + location); try { GeneFileDocument.index(writer, location); } catch (FileNotFoundException e) { System.out.println(quot;Caught exception: quot; + e); } } } }

29. org.ensembl.lucene. GeneFileDocument

30. public static void index(IndexWriter writer, File f) throws IOException { String fields[] = {quot;subtypequot;, quot;idquot;, quot;urlquot;, quot;keywordsquot;, quot;descriptionquot;}; FileReader input = new FileReader(f); BufferedReader bufRead = new BufferedReader(input); String line; line = bufRead.readLine(); while (line != null){ Document doc = new Document(); int count = 0; String terms[] = line.split(quot;tquot;); while (count < terms.length) { String field = fields[count]; String item = terms[count]; doc.add(new Field(field, item, Field.Store.YES, Field.Index.TOKENIZED)); count++; } writer.addDocument(doc); line = bufRead.readLine(); } }

33. Quite a lot of memory ~1.5Gb

34. Creates index

35. Merge indices to form master search index

36. Recipe 2: Finding documents containing a search term

37. Easy

38. org.ensembl.lucene.Search

39. public static void main(String args[]) { Timer timer = new Timer(); String index = quot;indexquot;; try { timer.start(); Searcher searcher = new IndexSearcher(index); timer.stop(); System.out.println(quot;Loaded quot; + searcher.maxDoc() + quot; documents in quot; + timer.elapsed() + quot;msquot;); search(searcher, quot;subtypequot;, quot;Vega_havana processed_pseudogene Genequot;); search(searcher, quot;idquot;, quot;OTTHUMG00000000423quot;); searcher.close(); } catch (Exception e) { e.printStackTrace(); } }

43. private static void search(Searcher searcher, String field, String queryString) throws ParseException, IOException { Timer timer = new Timer(); timer.start(); System.out.println(quot;Search (quot; + field + quot;): quot; + queryString); QueryParser parser = new QueryParser(field, new StandardAnalyzer()); Query query = parser.parse(queryString); Hits hits = searcher.search(query); Integer count = 1; Iterator<Hit> hiterator = hits.iterator(); while (hiterator.hasNext()) { Hit hit = hiterator.next(); Document document = hit.getDocument(); System.out.println(count + quot;: ID: quot; + document.get(quot;idquot;)); System.out.println(count + quot;: Subtype: quot; + document.get(quot;subtypequot;)); count++; } int hitCount = hits.length(); timer.stop(); System.out.println(quot;Hits: quot; + hitCount); System.out.println(quot;Completed in quot; + timer.elapsed() + quot;msquot;);

49. Recipe 3: Querying a remote document index

50. Wrap everything into a single file

51. Copy that file to an application server

52. Restart the application server

53. Voilà!

54. (almost never that easy)

55. You will need...

56. Bonus recipe! Automate tasks with Ant

57. XML based configuration

58. Automated compiles

59. Automated test runner

60. Automated deployment

61. Platform independent

62. Flexible (but complex)

63. ant deploy

64. clean code clean index compile build index build jar build war deploy

65. Could this work for Ensembl?

66. lucene.apache.org

68. Java IDEs rock: get stuck in

69. Thank you

Lucene

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (19)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie Lucene

Ähnlich wie Lucene (20)

Mehr von Matt Wood

Mehr von Matt Wood (12)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Lucene