SlideShare ist ein Scribd-Unternehmen logo
1 von 53
Downloaden Sie, um offline zu lesen
Confidential, Copyright © Quanticate
What's with all the 1s and What's with all the 1s and 0s0s??
Making sense of binary Making sense of binary datadata at scale at scale
Confidential, Copyright © Quanticate
Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance
Our Values: Excellence • Customer Focus • Team Work • Passion • Integrity
What's with the 1s and 0s?
●
Detecting File Types
●
Detecting Text
●
A quick introduction to Apache Tika
●
How things can go wrong at Scale...
●
Working out if things are getting better or worse!
●
Other tools to consider
Confidential, Copyright © Quanticate
Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance
Our Values: Excellence • Customer Focus • Team Work • Passion • Integrity
Detecting File TypesDetecting File Types
Confidential, Copyright © Quanticate
Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance
Our Values: Excellence • Customer Focus • Team Work • Passion • Integrity
Isn't detecting simple?
●
 Surely you just know what a file is on your computer?
●
 Well, probably on your computer, and maybe elsewhere?
●
 OK, so maybe people rename things, but it's close, no?
●
 Ah, the internet... But that's normally right isn't it?
●
 Hmm, well most web servers tell the truth, right?
●
 They wouldn't get it that wrong?
●
 A few percent of the internet, that's hardly that much?
●
 And people would never rename things by accident?
●
 Operating Systems would never “help”, would they?
Confidential, Copyright © Quanticate
Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance
Our Values: Excellence • Customer Focus • Team Work • Passion • Integrity
Filenames
●
 Filenames – normally, but not always have extensions
●
 There aren't that many extension combinations
●
 There are probably more file formats than that
●
 No official way to reserve an extension
●
 So everyone just picks a “sensible” one, and hopes that don't 
have (too many) clashes...
●
 What happens if you rename a file though?
●
 Or have a file without one?
●
 Quick, but dirty, and may not be right...
Confidential, Copyright © Quanticate
Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance
Our Values: Excellence • Customer Focus • Team Work • Passion • Integrity
Mime Magic
●
 Most file formats have a well known structure
●
 Most of these have a (mostly) unique pattern near the start
●
 These are often called Mime Magic Numbers
●
 In some cases, these are numbers
●
 More commonly, they're some sort of number / text / bit mask
●
 Ideally located at a fixed offset, even better, right at the start 
of the file
●
 But not always...
Confidential, Copyright © Quanticate
Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance
Our Values: Excellence • Customer Focus • Team Work • Passion • Integrity
More on Magic
●
 PDFs (should) start with   %PDF­
●
 Microsoft Office OLE2 docs start with  0xd0cf11e0a1b11ae1
●
 Most Zip files start with   PK003004
●
 AIFF starts with   FORM????AI(FF|FC)   (mask 5­8)
●
 PE Executables normally have PE000000  at 128 or 240
●
 Not all of these are true constants
●
 Not all of these are unique ­ 0xfffe can be UTF­16LE or MP3
●
 Container formats – Zip can be Zip, OOXML, iWorks etc
Confidential, Copyright © Quanticate
Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance
Our Values: Excellence • Customer Focus • Team Work • Passion • Integrity
Container Formats
●
 Some file formats are actually containers, and can hold lots 
of different things in them
●
 For example, a .zip file could just be a zip of random files
●
 Or it could be a Microsoft OOXML file (eg .docx, .pptx)
●
 Or it could be an OpenDocument Format file (eg .ods)
●
 Or it could be an iWorks file (Numbers, Pages, Keynote)
●
 Or it could be an ePub file
●
 Or....
●
 A .ogg could be audio, video, text, or many!
Confidential, Copyright © Quanticate
Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance
Our Values: Excellence • Customer Focus • Team Work • Passion • Integrity
Dealing with Containers
●
 We can use Mime Magic to detect the container itself
●
 But to go beyond that, we need to actually parse the 
container, and look for special entries
●
 All the container based formats have some way of identifying 
the contents, of varying difficulty
●
 eg Check zip entries, look for _rels/.rels, it's OOXML, then 
read that file to get the mimetype
●
 eg Check Ogg for CMML, use that to get the primary stream 
type, else count stream types
Confidential, Copyright © Quanticate
Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance
Our Values: Excellence • Customer Focus • Team Work • Passion • Integrity
Patterns inside the file
●
 Not all file formats have nice Mime Magic near­ish the start
●
 Many do have some sort of structure, eg repeating patterns 
within them
●
 Varies a great deal between file formats, and even between 
what's in a given collection of files
●
 If you manually classify a number of files of a given type, you 
can then use machine learning to build byte histograms and n­
gram patterns for them, along with negative patterns too
●
 Then compare positive and negatives patterns to detect
●
 Needs training data sets and lots of pre­computation
Confidential, Copyright © Quanticate
Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance
Our Values: Excellence • Customer Focus • Team Work • Passion • Integrity
Pulling it all together
●
 Need to combine various techniques
●
 eg Use Mime Magic to spot it's a container, then use 
container­specific code to identify the type
●
 eg Use filename to tell the difference between two very 
similar different file types
●
 Using more techniques can increase quality of detection, but 
slows things down. Consider Speed vs Accuracy
●
 May need custom weightings, eg from machine learning
●
 Apache Tika combines different techniques for you via 
DefaultDetector, which auto­loads all available detections
Confidential, Copyright © Quanticate
Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance
Our Values: Excellence • Customer Focus • Team Work • Passion • Integrity
Detecting TextDetecting Text
Confidential, Copyright © Quanticate
Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance
Our Values: Excellence • Customer Focus • Team Work • Passion • Integrity
Encodings 101
●
 Many different ways to encode text in a file
●
 “A” could be 0x41 (ascii,utf8 etc), 0x0041 (utf16le), 0x4100 
(utf16be), 0xC1 (ebcdic)
●
 1­byte­per­character encodings historically very common
●
 But means that you need lots of different encodings to cope 
with different langauges and character sets
●
 0xE1 – could be:  á     с  α ‫ב‬ ‫ﻑ‬ แ (or something else too!)
●
 Some formats include what encoding they've used
●
 But many key ones, including plain text, do not!
Confidential, Copyright © Quanticate
Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance
Our Values: Excellence • Customer Focus • Team Work • Passion • Integrity
Languages
●
 Different languages have different common patterns of 
letters, based on words, spellings and patterns
●
 If you see accents like á  è  ç then it probably isn't English
●
 If you see a word starting with an S, it probably isn't Spanish, 
but if you see lots starting “ES” it might be
●
 You can look for these patterns, and use those to identify 
what language some text might be in
●
 Really needs quite a bit of text to work on though, it's very 
hard to make meaningful guesses on just a few letters!
Confidential, Copyright © Quanticate
Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance
Our Values: Excellence • Customer Focus • Team Work • Passion • Integrity
n­grams
●
 Wikipedia says “An n­gram model is a type of probabilistic 
language model for predicting the next item in such a 
sequence in the form of a (n ­ 1)–order Markov model”
●
 Basically, for us, it's all the possible character combinations 
(including start + end markers) along with their frequency
●
 The “n” is the size
●
 Trigrams of “hello” are [ ]he, hel, ell, llo, lo[ ]
●
 Quadgrams of “hello” are [ ]hel, ello, llo[ ]
●
 Can be used to identify both language and encoding
Confidential, Copyright © Quanticate
Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance
Our Values: Excellence • Customer Focus • Team Work • Passion • Integrity
False Positives, Problems
●
 For encoding detection to work, your tool (eg Tika) needs to 
recognise the file as Plain Text
●
 Too many control characters near the start can cause Tika 
and friends to decide it isn't Plain Text, so won't detect
●
 Some encodings are very similar, hard to tell apart
●
 For short runs of text, very hard to be sure what it is
●
 Same pattern can crop up in different languages
●
 Same pattern could occur between different languages when 
in different encodings
Confidential, Copyright © Quanticate
Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance
Our Values: Excellence • Customer Focus • Team Work • Passion • Integrity
Apache Tika in a nutshell Apache Tika in a nutshell 
Confidential, Copyright © Quanticate
Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance
Our Values: Excellence • Customer Focus • Team Work • Passion • Integrity
Apache Tika in a nutshell
“small, yellow and leech­like, and 
probably the oddest thing in the Universe”
●
 Like a Babel Fish for content!
●
 Helps you work out what sort of thing 
your content (1s & 0s) is
●
 Helps you extract the metadata from it, 
in a consistent way
●
 Lets you get a plain text version of your 
content, eg for full text indexing
●
 Provides a rich (XHTML) version too
Confidential, Copyright © Quanticate
Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance
Our Values: Excellence • Customer Focus • Team Work • Passion • Integrity
(Some) Supported Formats
●
 Microsoft Office – Word, Excel, PowerPoint, Works, 
Publisher, Visio – Binary and OOXML formats
●
 OpenDocument (OpenOffice), 
●
 iWorks – Keynote, Pages, Numbers
●
 HTML, XHTML, XML, PDF, RTF, Plain Text, CHM Help
●
 Compression / Archive – Zip, Tar, Ar, 7z, bz2, gz etc
●
 Atom, RSS, ePub            Lots of Scientific formats
●
 Audio – MP3, MP4, Vorbis, Opus, Speex, MIDI, Wav
●
 Image – JPEG, TIFF, PNG, BMP, GIF, ICO
Confidential, Copyright © Quanticate
Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance
Our Values: Excellence • Customer Focus • Team Work • Passion • Integrity
Ways of calling Tika
●
 Tika­App – command line tool
●
 Pure Java – Tika Facade – simple way from Java
●
 Pure Java – Direct – full control over what happens
●
 OSGi – all your dependencies nicely wrapped up
●
 Forked Parser – parsing in a different JVM
●
 JAX­RS Network Server – RESTful interface to Tika
●
 SOLR Plugin – Tika parsing from within SOLR upload
●
 Anything you want to write yourself!
●
 (These are just the main ways)
Confidential, Copyright © Quanticate
Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance
Our Values: Excellence • Customer Focus • Team Work • Passion • Integrity
Tika App
●
 Single runnable jar, containing all of Tika + dependencies
●
 GUI mode, ideal for quick testing and demos
●
 CLI mode, suitable for calling from non­Java programs
●
 Detection: ­­detect
●
 Metadata: ­­metadata
●
 Plain Text: ­­text
●
 XHTML: ­­xml
●
 Information on available mimetypes, parsers etc
●
 JVM startup costs means not the fastest for lots of docs
Confidential, Copyright © Quanticate
Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance
Our Values: Excellence • Customer Focus • Team Work • Passion • Integrity
JAX­RS Network Server
●
 JAX­RS – JSR­311 RESTful network server
●
 Powered by Apache CXF
●
 The tika­server jar can be run, starts up the server
●
 Content is sent via HTTP PUT calls to service URLs
●
 Supports detection, metadata extraction, plain text, xhtml, 
and a few other things too, plus info on parsers/formats/etc
●
 See https://wiki.apache.org/tika/TikaJAXRS
●
 Recommended way to call Tika from non­Java languages
●
 Runs in a new JVM
Confidential, Copyright © Quanticate
Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance
Our Values: Excellence • Customer Focus • Team Work • Passion • Integrity
Tika Facade (Java)
●
 Very simple way to call Tika
●
 Tika facade class – org.apache.tika.Tika
●
 Detection from streams, bytes, files, urls etc
●
 Parsing to Strings, Readers, Plain Text or XHTML
●
 By default, uses all available parsers and detectors
●
 Requires that all the Tika jars, and their dependencies are 
present on the classpath, and don't clash
●
 Harder to check if that hasn't worked properly, eg missing 
jars or jar clashes
Confidential, Copyright © Quanticate
Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance
Our Values: Excellence • Customer Focus • Team Work • Passion • Integrity
Java – Direct calls
●
 Slightly more lines of code, but you get full control
●
 Normally start with org.apache.tika.TikaConfig
●
 From that fetch Detectors, Parsers, Mime Types list etc
●
 Typically want to use AutoDetectParser to actually parse
●
 If possible, wrap your content with a TikaInputStream before 
parsing – create from InputStream or File
●
 Still needs all jars on classpath
●
 Ask DefaultParser what Parsers + mimetypes are available
Confidential, Copyright © Quanticate
Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance
Our Values: Excellence • Customer Focus • Team Work • Passion • Integrity
Java – Forked Parser
●
 By default, Tika runs in the same JVM as the rest of your 
code, so if Tika (or a library it depends on) breaks, so does the 
rest of your JVM
●
 This shouldn't happen, but some broken files can trigger 
parsers to run out of memory, or loop forever
●
 If you're doing internet scale things, this becomes all the 
more likely to be hit eventually!
●
 Forked Parser fires up a new JVM, sends over Tika + 
dependencies, then calls to that JVM to do the parsing
●
 If that crashed, doesn't affect your main JVM
Confidential, Copyright © Quanticate
Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance
Our Values: Excellence • Customer Focus • Team Work • Passion • Integrity
What Tika tries to do
●
 Tika aims to map file format specific metadata onto a 
common, consistent set, based on well known standards
●
 Tika aims to provide semantically meaningful, but not 
cluttered XHTML
●
 It isn't supposed to be a perfect rendition of the original file 
into XHTML
●
 It is supposed to be simple and clean
●
 eg Word Parser reports tables and style names, but not 
Word­style HTML full of fonts +colours
●
 eg Excel Parser returns what's there, not a CSV
Confidential, Copyright © Quanticate
Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance
Our Values: Excellence • Customer Focus • Team Work • Passion • Integrity
Extending Tika
●
 Mimetypes: Tika has >1400 defined, not all with magic
●
 You can define your own in the same structure, put a 
org/apache/tika/mime/custom­mimetypes.xml on classpath
●
 Parsers: Tika has >70 parsers available, some formats have 
multiple which can be used
●
 You can write your own quickly and register them for use, 
see http://tika.apache.org/1.8/parser_guide.html for how
●
 Basic parser is 13 lines of code, work involved depends on 
what libraries you have available for your file format
●
 Tika Config XML allows you control of which parser to use
Confidential, Copyright © Quanticate
Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance
Our Values: Excellence • Customer Focus • Team Work • Passion • Integrity
Embedded ResourcesEmbedded Resources
Confidential, Copyright © Quanticate
Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance
Our Values: Excellence • Customer Focus • Team Work • Passion • Integrity
Not just containers
●
 Container formats like Zip contain embedded resources, the 
files within them
●
 Many office documents support embedded resources too
●
 Microsoft Office documents can have other documents 
embedded within them, eg Excel in PowerPoint
●
 Most of the document formats support embedding images 
within the document, and many video too
●
 Where possible, Tika gives you the embedded resource, and 
indicates in the XHTML where it came from
●
 Up to you how / where / if to handle these resources
Confidential, Copyright © Quanticate
Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance
Our Values: Excellence • Customer Focus • Team Work • Passion • Integrity
Going wrong at Scale...Going wrong at Scale...
Confidential, Copyright © Quanticate
Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance
Our Values: Excellence • Customer Focus • Team Work • Passion • Integrity
Lots of Data is Junk
●
 At scale, you're going to hit lots of edge cases
●
 At scale, you're going to come across lots of junk 
or corrupted documents
●
 1% of a lot is still a lot...
●
 Bound to find files which are unusual or corrupted 
enough to be mis­identified
●
 You need to plan for failures!
Confidential, Copyright © Quanticate
Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance
Our Values: Excellence • Customer Focus • Team Work • Passion • Integrity
Unusual Types
●
 If you're working on a big data scale, you're bound to come 
across lots of valid but unusual + unknown files
●
 You're never going to be able to add support for all of them!
●
 May be worth adding support for the more common 
“uncommon” unsupported types
●
 Which means you'll need to track something about the files 
you couldn't understand
●
 If Tika knows the mimetype but has no parser, just log the 
mimetype
●
 If mimetype unknown, maybe log first few bytes
Confidential, Copyright © Quanticate
Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance
Our Values: Excellence • Customer Focus • Team Work • Passion • Integrity
Failure at Scale
●
 Tika will sometimes mis­identify something, so sometimes 
the wrong parser will run and object
●
 Some files will cause parsers or their underlying libraries to 
do something silly, such as use lots of memory or get into 
loops with lots to do
●
 Some files will cause parsers or their underlying libraries to 
OOM, or infinite loop, or something else bad
●
 If a file fails once, will probably fail again, so blindly just re­
running that task again won't help
Confidential, Copyright © Quanticate
Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance
Our Values: Excellence • Customer Focus • Team Work • Passion • Integrity
Failure at Scale continued...
●
 You'll need approaches that plan for failure
●
 Consider what will happen if a file locks up your JVM, or kills 
it with an OOM
●
 Forked Parser may be worth using
●
 Running a separate Tika Server could be good
●
 Depending on work needed, could have a smaller pool of 
Tika Server instances for big data code to call
●
 Think about failure modes, then think about retries (or not)
●
 Track common problems, report and fix them!
Confidential, Copyright © Quanticate
Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance
Our Values: Excellence • Customer Focus • Team Work • Passion • Integrity
Getting Better? Or Worse?Getting Better? Or Worse?
Confidential, Copyright © Quanticate
Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance
Our Values: Excellence • Customer Focus • Team Work • Passion • Integrity
When things go wrong
Confidential, Copyright © Quanticate
Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance
Our Values: Excellence • Customer Focus • Team Work • Passion • Integrity
When things go wrong
You don't know what you can't find...
Confidential, Copyright © Quanticate
Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance
Our Values: Excellence • Customer Focus • Team Work • Passion • Integrity
What can go wrong
●
 Catastrophic failures
●
 Out of Memory Errors
●
 Infinite Hangs
●
 Memory Leaks
●
 Exceptions: Null Pointer, etc.
●
 Extraction with loss of fidelity
●
 Missing text/metadata/attachments
●
 Extra text (eg placeholder, dummy, hidden, internal etc)
●
 Garbled text
Confidential, Copyright © Quanticate
Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance
Our Values: Excellence • Customer Focus • Team Work • Passion • Integrity
Did this change help?
●
 Unit tests can tell you about major failures
●
 Unit tests only check the things you know about though...
●
 And only check a small number of files of each type
●
 There is a much wider variety of files out there than in even a 
decent­sized test suite
●
 File type distributions are uneven, a “minor issue” for me 
might be a “critical issue” for someone else!
●
 You can check 10 documents by eye, you can't check 10gb 
of files by eye, let alone 1tb!
Confidential, Copyright © Quanticate
Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance
Our Values: Excellence • Customer Focus • Team Work • Passion • Integrity
Did this library change help?
●
 Large scale testing needed to spot these kinds of problems
●
 Needs automation of running and of analysis!
●
 Flag up key issues, look at those in detail
●
 Exceptions and Errors are easy  (1st
 one per file anyway...)
●
 Metadata little harder, but sizes small
●
 Attachment counts easy, contents less so
●
 Finding junk text harder
●
 Missing text tougher still
Confidential, Copyright © Quanticate
Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance
Our Values: Excellence • Customer Focus • Team Work • Passion • Integrity
Did this library change help?
●
 Need to store output from many runs to compare
●
 Common terms per file can identify problems, if it goes from 
“differences” to “4NGTMKY” then something's wrong
●
 Common terms can help with missing text, but not always
●
 Token entropy can show garbage text, or data heavy files!
●
 Too many errors to fix all of them
●
 Need metrics to know which to focus on
●
 Govdocs1, Common Crawl output, more datasets needed!
Confidential, Copyright © Quanticate
Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance
Our Values: Excellence • Customer Focus • Team Work • Passion • Integrity
Apache Tika at ScaleApache Tika at Scale
Confidential, Copyright © Quanticate
Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance
Our Values: Excellence • Customer Focus • Team Work • Passion • Integrity
Lots of Data is Junk
●
At scale, you're going to hit lots of edge cases
●
At scale, you're going to come across lots of junk 
or corrupted documents
●
1% of a lot is still a lot...
●
Bound to find files which are unusual or 
corrupted enough to be mis­identified
●
You need to plan for failures!
Confidential, Copyright © Quanticate
Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance
Our Values: Excellence • Customer Focus • Team Work • Passion • Integrity
Unusual Types
●
If you're working on a big data scale, you're bound to come 
across lots of valid but unusual + unknown files
●
You're never going to be able to add support for all of them!
●
May be worth adding support for the more common 
“uncommon” unsupported types
●
Which means you'll need to track something about the files 
you couldn't understand
●
If Tika knows the mimetype but has no parser, just log the 
mimetype
●
If mimetype unknown, maybe log first few bytes
Confidential, Copyright © Quanticate
Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance
Our Values: Excellence • Customer Focus • Team Work • Passion • Integrity
Failure At Scale
●
 Tika will sometimes mis­identify something, so sometimes 
the wrong parser will run and object
●
 Some files will cause parsers or their underlying libraries to 
do something silly, such as use lots of memory or get into 
loops with lots to do
●
 Some files will cause parsers or their underlying libraries to 
OOM, or infinite loop, or something else bad
●
 If a file fails once, will probably fail again, so blindly just re­
running that task again won't help
●
 A library upgrade might help, so track which ones failed
Confidential, Copyright © Quanticate
Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance
Our Values: Excellence • Customer Focus • Team Work • Passion • Integrity
Failure At Scale Continued
●
 You'll need approaches that plan for failure
●
 Consider what will happen if a file locks up your JVM, or kills 
it with an OOM
●
 Forked Parser may be worth using
●
 Running a separate Tika Server could be good
●
 Depending on work needed, could have a smaller pool of 
Tika Server instances for big data code to call
●
 Think about failure modes, then think about retries (or not)
●
 Track common problems, report and fix them!
Confidential, Copyright © Quanticate
Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance
Our Values: Excellence • Customer Focus • Team Work • Passion • Integrity
Tika Eval
●
 Runs Tika against a corpus of documents, or compares the 
results of running two different versions on those documents
●
 Stack traces and exception counts
●
 File id, language id
●
 Attachment and metadata counts
●
 Top 10 most common words
●
 Content length
●
 Token length statistics, Token entropy
●
 Gives reports on key things for a human to check
Confidential, Copyright © Quanticate
Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance
Our Values: Excellence • Customer Focus • Team Work • Passion • Integrity
Tika Batch
●
 Tools for running Tika across lots of documents, where some 
 of them are going to fail
●
 Tika Batch App is complete, used by current Eval work
●
 WIP: Tika Batch Hadoop and Tika Batch Storm
●
 http://wiki.apache.org/tika/TikaInHadoop has lots of pointers 
on how to handle Tika on Hadoop right now
●
 Related: https://github.com/DigitalPebble/behemoth/
Confidential, Copyright © Quanticate
Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance
Our Values: Excellence • Customer Focus • Team Work • Passion • Integrity
Other Projects to know aboutOther Projects to know about
Confidential, Copyright © Quanticate
Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance
Our Values: Excellence • Customer Focus • Team Work • Passion • Integrity
Apache Projects
●
 Nutch
●
 Storm Crawler
●
 Any23
●
 OpenNLP
●
 UIMA
●
 CTAKES
●
 Any from the audience?
Confidential, Copyright © Quanticate
Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance
Our Values: Excellence • Customer Focus • Team Work • Passion • Integrity
External Projects
●
 Hardened Tika
●
 Behemoth, and Behemoth extensions to Tika
●
 Google Chrome “Compact Language Detector” (CLD)
●
 Mozilla's LibCharsetDetect
●
 ICU4J
●
 Any from the audience?
Confidential, Copyright © Quanticate
Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance
Our Values: Excellence • Customer Focus • Team Work • Passion • Integrity
Any Questions?Any Questions?
Confidential, Copyright © Quanticate
Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance
Our Values: Excellence • Customer Focus • Team Work • Passion • Integrity
Nick BurchNick Burch
@Gagravarr@Gagravarr

Weitere ähnliche Inhalte

Ähnlich wie What's with the 1s and 0s? Making sense of binary data at scale - Berlin Buzzwords 2015

Ensuring Data Quality in Databricks Unleashing the Power of Great Expectation...
Ensuring Data Quality in Databricks Unleashing the Power of Great Expectation...Ensuring Data Quality in Databricks Unleashing the Power of Great Expectation...
Ensuring Data Quality in Databricks Unleashing the Power of Great Expectation...Knoldus Inc.
 
ARMA Calgary Spring Seminar: The Nuts and Bolts of Metadata Tagging and Taxon...
ARMA Calgary Spring Seminar: The Nuts and Bolts of Metadata Tagging and Taxon...ARMA Calgary Spring Seminar: The Nuts and Bolts of Metadata Tagging and Taxon...
ARMA Calgary Spring Seminar: The Nuts and Bolts of Metadata Tagging and Taxon...Concept Searching, Inc
 
Turning XML to XLS on the JVM, without loosing your Sanity, with Groovy
Turning XML to XLS on the JVM, without loosing your Sanity, with GroovyTurning XML to XLS on the JVM, without loosing your Sanity, with Groovy
Turning XML to XLS on the JVM, without loosing your Sanity, with Groovygagravarr
 
Ashley Ohmann--Data Governance Final 011315
Ashley Ohmann--Data Governance Final 011315Ashley Ohmann--Data Governance Final 011315
Ashley Ohmann--Data Governance Final 011315Ashley Ohmann
 
Getting Started with Product Analytics - A 101 Implementation Guide for Begin...
Getting Started with Product Analytics - A 101 Implementation Guide for Begin...Getting Started with Product Analytics - A 101 Implementation Guide for Begin...
Getting Started with Product Analytics - A 101 Implementation Guide for Begin...Vishrut Shukla
 
Hadoop and Your Data Warehouse
Hadoop and Your Data WarehouseHadoop and Your Data Warehouse
Hadoop and Your Data WarehouseCaserta
 
Is Your Content Migration Strategy Garbage In, Garbage Out? Webinar
Is Your Content Migration Strategy Garbage In, Garbage Out? WebinarIs Your Content Migration Strategy Garbage In, Garbage Out? Webinar
Is Your Content Migration Strategy Garbage In, Garbage Out? WebinarConcept Searching, Inc
 
Research data management for medical data with pyradigm
Research data management for medical data with pyradigmResearch data management for medical data with pyradigm
Research data management for medical data with pyradigmPradeep Redddy Raamana
 
Data Profiling: The First Step to Big Data Quality
Data Profiling: The First Step to Big Data QualityData Profiling: The First Step to Big Data Quality
Data Profiling: The First Step to Big Data QualityPrecisely
 
Amp Up Your Testing by Harnessing Test Data
Amp Up Your Testing by Harnessing Test DataAmp Up Your Testing by Harnessing Test Data
Amp Up Your Testing by Harnessing Test DataTechWell
 
Correlation does not mean causation
Correlation does not mean causationCorrelation does not mean causation
Correlation does not mean causationPeter Varhol
 
Effectively Capturing Paper and Digital Documents in your Existing Applicatio...
Effectively Capturing Paper and Digital Documents in your Existing Applicatio...Effectively Capturing Paper and Digital Documents in your Existing Applicatio...
Effectively Capturing Paper and Digital Documents in your Existing Applicatio...J. Kevin Parker, CIP
 
Data quality management Basic
Data quality management BasicData quality management Basic
Data quality management BasicKhaled Mosharraf
 
Data Quality at the Speed of Work
Data Quality at the Speed of WorkData Quality at the Speed of Work
Data Quality at the Speed of WorkTechWell
 
[DSC Europe 22] Govern your event streams - Ivan Dundovic
[DSC Europe 22] Govern your event streams - Ivan Dundovic[DSC Europe 22] Govern your event streams - Ivan Dundovic
[DSC Europe 22] Govern your event streams - Ivan DundovicDataScienceConferenc1
 

Ähnlich wie What's with the 1s and 0s? Making sense of binary data at scale - Berlin Buzzwords 2015 (20)

Ensuring Data Quality in Databricks Unleashing the Power of Great Expectation...
Ensuring Data Quality in Databricks Unleashing the Power of Great Expectation...Ensuring Data Quality in Databricks Unleashing the Power of Great Expectation...
Ensuring Data Quality in Databricks Unleashing the Power of Great Expectation...
 
ARMA Calgary Spring Seminar: The Nuts and Bolts of Metadata Tagging and Taxon...
ARMA Calgary Spring Seminar: The Nuts and Bolts of Metadata Tagging and Taxon...ARMA Calgary Spring Seminar: The Nuts and Bolts of Metadata Tagging and Taxon...
ARMA Calgary Spring Seminar: The Nuts and Bolts of Metadata Tagging and Taxon...
 
1 d.1
1 d.11 d.1
1 d.1
 
Turning XML to XLS on the JVM, without loosing your Sanity, with Groovy
Turning XML to XLS on the JVM, without loosing your Sanity, with GroovyTurning XML to XLS on the JVM, without loosing your Sanity, with Groovy
Turning XML to XLS on the JVM, without loosing your Sanity, with Groovy
 
Ashley Ohmann--Data Governance Final 011315
Ashley Ohmann--Data Governance Final 011315Ashley Ohmann--Data Governance Final 011315
Ashley Ohmann--Data Governance Final 011315
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Getting Started with Product Analytics - A 101 Implementation Guide for Begin...
Getting Started with Product Analytics - A 101 Implementation Guide for Begin...Getting Started with Product Analytics - A 101 Implementation Guide for Begin...
Getting Started with Product Analytics - A 101 Implementation Guide for Begin...
 
RPE.pptx
RPE.pptxRPE.pptx
RPE.pptx
 
Hadoop and Your Data Warehouse
Hadoop and Your Data WarehouseHadoop and Your Data Warehouse
Hadoop and Your Data Warehouse
 
Is Your Content Migration Strategy Garbage In, Garbage Out? Webinar
Is Your Content Migration Strategy Garbage In, Garbage Out? WebinarIs Your Content Migration Strategy Garbage In, Garbage Out? Webinar
Is Your Content Migration Strategy Garbage In, Garbage Out? Webinar
 
Research data management for medical data with pyradigm
Research data management for medical data with pyradigmResearch data management for medical data with pyradigm
Research data management for medical data with pyradigm
 
Data Profiling: The First Step to Big Data Quality
Data Profiling: The First Step to Big Data QualityData Profiling: The First Step to Big Data Quality
Data Profiling: The First Step to Big Data Quality
 
Amp Up Your Testing by Harnessing Test Data
Amp Up Your Testing by Harnessing Test DataAmp Up Your Testing by Harnessing Test Data
Amp Up Your Testing by Harnessing Test Data
 
Machine Learning & Apache Mahout
Machine Learning & Apache MahoutMachine Learning & Apache Mahout
Machine Learning & Apache Mahout
 
Correlation does not mean causation
Correlation does not mean causationCorrelation does not mean causation
Correlation does not mean causation
 
Effectively Capturing Paper and Digital Documents in your Existing Applicatio...
Effectively Capturing Paper and Digital Documents in your Existing Applicatio...Effectively Capturing Paper and Digital Documents in your Existing Applicatio...
Effectively Capturing Paper and Digital Documents in your Existing Applicatio...
 
Taxonomy 101
Taxonomy 101Taxonomy 101
Taxonomy 101
 
Data quality management Basic
Data quality management BasicData quality management Basic
Data quality management Basic
 
Data Quality at the Speed of Work
Data Quality at the Speed of WorkData Quality at the Speed of Work
Data Quality at the Speed of Work
 
[DSC Europe 22] Govern your event streams - Ivan Dundovic
[DSC Europe 22] Govern your event streams - Ivan Dundovic[DSC Europe 22] Govern your event streams - Ivan Dundovic
[DSC Europe 22] Govern your event streams - Ivan Dundovic
 

Mehr von gagravarr

But we're already open source! Why would I want to bring my code to Apache?
But we're already open source! Why would I want to bring my code to Apache?But we're already open source! Why would I want to bring my code to Apache?
But we're already open source! Why would I want to bring my code to Apache?gagravarr
 
What's new with Apache Tika?
What's new with Apache Tika?What's new with Apache Tika?
What's new with Apache Tika?gagravarr
 
The Apache Way
The Apache WayThe Apache Way
The Apache Waygagravarr
 
The other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needsThe other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needsgagravarr
 
How Big is Big – Tall, Grande, Venti Data?
How Big is Big – Tall, Grande, Venti Data?How Big is Big – Tall, Grande, Venti Data?
How Big is Big – Tall, Grande, Venti Data?gagravarr
 
If You Have The Content, Then Apache Has The Technology!
If You Have The Content, Then Apache Has The Technology!If You Have The Content, Then Apache Has The Technology!
If You Have The Content, Then Apache Has The Technology!gagravarr
 
But We're Already Open Source! Why Would I Want To Bring My Code To Apache?
But We're Already Open Source! Why Would I Want To Bring My Code To Apache?But We're Already Open Source! Why Would I Want To Bring My Code To Apache?
But We're Already Open Source! Why Would I Want To Bring My Code To Apache?gagravarr
 
What's With The 1S And 0S? Making Sense Of Binary Data At Scale With Tika And...
What's With The 1S And 0S? Making Sense Of Binary Data At Scale With Tika And...What's With The 1S And 0S? Making Sense Of Binary Data At Scale With Tika And...
What's With The 1S And 0S? Making Sense Of Binary Data At Scale With Tika And...gagravarr
 
What's with the 1s and 0s? Making sense of binary data at scale with Tika and...
What's with the 1s and 0s? Making sense of binary data at scale with Tika and...What's with the 1s and 0s? Making sense of binary data at scale with Tika and...
What's with the 1s and 0s? Making sense of binary data at scale with Tika and...gagravarr
 
The other Apache technologies your big data solution needs!
The other Apache technologies your big data solution needs!The other Apache technologies your big data solution needs!
The other Apache technologies your big data solution needs!gagravarr
 
Apache Tika end-to-end
Apache Tika end-to-endApache Tika end-to-end
Apache Tika end-to-endgagravarr
 
Apache Content Technologies
Apache Content TechnologiesApache Content Technologies
Apache Content Technologiesgagravarr
 

Mehr von gagravarr (12)

But we're already open source! Why would I want to bring my code to Apache?
But we're already open source! Why would I want to bring my code to Apache?But we're already open source! Why would I want to bring my code to Apache?
But we're already open source! Why would I want to bring my code to Apache?
 
What's new with Apache Tika?
What's new with Apache Tika?What's new with Apache Tika?
What's new with Apache Tika?
 
The Apache Way
The Apache WayThe Apache Way
The Apache Way
 
The other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needsThe other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needs
 
How Big is Big – Tall, Grande, Venti Data?
How Big is Big – Tall, Grande, Venti Data?How Big is Big – Tall, Grande, Venti Data?
How Big is Big – Tall, Grande, Venti Data?
 
If You Have The Content, Then Apache Has The Technology!
If You Have The Content, Then Apache Has The Technology!If You Have The Content, Then Apache Has The Technology!
If You Have The Content, Then Apache Has The Technology!
 
But We're Already Open Source! Why Would I Want To Bring My Code To Apache?
But We're Already Open Source! Why Would I Want To Bring My Code To Apache?But We're Already Open Source! Why Would I Want To Bring My Code To Apache?
But We're Already Open Source! Why Would I Want To Bring My Code To Apache?
 
What's With The 1S And 0S? Making Sense Of Binary Data At Scale With Tika And...
What's With The 1S And 0S? Making Sense Of Binary Data At Scale With Tika And...What's With The 1S And 0S? Making Sense Of Binary Data At Scale With Tika And...
What's With The 1S And 0S? Making Sense Of Binary Data At Scale With Tika And...
 
What's with the 1s and 0s? Making sense of binary data at scale with Tika and...
What's with the 1s and 0s? Making sense of binary data at scale with Tika and...What's with the 1s and 0s? Making sense of binary data at scale with Tika and...
What's with the 1s and 0s? Making sense of binary data at scale with Tika and...
 
The other Apache technologies your big data solution needs!
The other Apache technologies your big data solution needs!The other Apache technologies your big data solution needs!
The other Apache technologies your big data solution needs!
 
Apache Tika end-to-end
Apache Tika end-to-endApache Tika end-to-end
Apache Tika end-to-end
 
Apache Content Technologies
Apache Content TechnologiesApache Content Technologies
Apache Content Technologies
 

Kürzlich hochgeladen

How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AIABDERRAOUF MEHENNI
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...Health
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 

Kürzlich hochgeladen (20)

How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 

What's with the 1s and 0s? Making sense of binary data at scale - Berlin Buzzwords 2015

  • 2. Confidential, Copyright © Quanticate Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance Our Values: Excellence • Customer Focus • Team Work • Passion • Integrity What's with the 1s and 0s? ● Detecting File Types ● Detecting Text ● A quick introduction to Apache Tika ● How things can go wrong at Scale... ● Working out if things are getting better or worse! ● Other tools to consider
  • 3. Confidential, Copyright © Quanticate Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance Our Values: Excellence • Customer Focus • Team Work • Passion • Integrity Detecting File TypesDetecting File Types
  • 4. Confidential, Copyright © Quanticate Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance Our Values: Excellence • Customer Focus • Team Work • Passion • Integrity Isn't detecting simple? ●  Surely you just know what a file is on your computer? ●  Well, probably on your computer, and maybe elsewhere? ●  OK, so maybe people rename things, but it's close, no? ●  Ah, the internet... But that's normally right isn't it? ●  Hmm, well most web servers tell the truth, right? ●  They wouldn't get it that wrong? ●  A few percent of the internet, that's hardly that much? ●  And people would never rename things by accident? ●  Operating Systems would never “help”, would they?
  • 5. Confidential, Copyright © Quanticate Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance Our Values: Excellence • Customer Focus • Team Work • Passion • Integrity Filenames ●  Filenames – normally, but not always have extensions ●  There aren't that many extension combinations ●  There are probably more file formats than that ●  No official way to reserve an extension ●  So everyone just picks a “sensible” one, and hopes that don't  have (too many) clashes... ●  What happens if you rename a file though? ●  Or have a file without one? ●  Quick, but dirty, and may not be right...
  • 6. Confidential, Copyright © Quanticate Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance Our Values: Excellence • Customer Focus • Team Work • Passion • Integrity Mime Magic ●  Most file formats have a well known structure ●  Most of these have a (mostly) unique pattern near the start ●  These are often called Mime Magic Numbers ●  In some cases, these are numbers ●  More commonly, they're some sort of number / text / bit mask ●  Ideally located at a fixed offset, even better, right at the start  of the file ●  But not always...
  • 7. Confidential, Copyright © Quanticate Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance Our Values: Excellence • Customer Focus • Team Work • Passion • Integrity More on Magic ●  PDFs (should) start with   %PDF­ ●  Microsoft Office OLE2 docs start with  0xd0cf11e0a1b11ae1 ●  Most Zip files start with   PK003004 ●  AIFF starts with   FORM????AI(FF|FC)   (mask 5­8) ●  PE Executables normally have PE000000  at 128 or 240 ●  Not all of these are true constants ●  Not all of these are unique ­ 0xfffe can be UTF­16LE or MP3 ●  Container formats – Zip can be Zip, OOXML, iWorks etc
  • 8. Confidential, Copyright © Quanticate Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance Our Values: Excellence • Customer Focus • Team Work • Passion • Integrity Container Formats ●  Some file formats are actually containers, and can hold lots  of different things in them ●  For example, a .zip file could just be a zip of random files ●  Or it could be a Microsoft OOXML file (eg .docx, .pptx) ●  Or it could be an OpenDocument Format file (eg .ods) ●  Or it could be an iWorks file (Numbers, Pages, Keynote) ●  Or it could be an ePub file ●  Or.... ●  A .ogg could be audio, video, text, or many!
  • 9. Confidential, Copyright © Quanticate Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance Our Values: Excellence • Customer Focus • Team Work • Passion • Integrity Dealing with Containers ●  We can use Mime Magic to detect the container itself ●  But to go beyond that, we need to actually parse the  container, and look for special entries ●  All the container based formats have some way of identifying  the contents, of varying difficulty ●  eg Check zip entries, look for _rels/.rels, it's OOXML, then  read that file to get the mimetype ●  eg Check Ogg for CMML, use that to get the primary stream  type, else count stream types
  • 10. Confidential, Copyright © Quanticate Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance Our Values: Excellence • Customer Focus • Team Work • Passion • Integrity Patterns inside the file ●  Not all file formats have nice Mime Magic near­ish the start ●  Many do have some sort of structure, eg repeating patterns  within them ●  Varies a great deal between file formats, and even between  what's in a given collection of files ●  If you manually classify a number of files of a given type, you  can then use machine learning to build byte histograms and n­ gram patterns for them, along with negative patterns too ●  Then compare positive and negatives patterns to detect ●  Needs training data sets and lots of pre­computation
  • 11. Confidential, Copyright © Quanticate Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance Our Values: Excellence • Customer Focus • Team Work • Passion • Integrity Pulling it all together ●  Need to combine various techniques ●  eg Use Mime Magic to spot it's a container, then use  container­specific code to identify the type ●  eg Use filename to tell the difference between two very  similar different file types ●  Using more techniques can increase quality of detection, but  slows things down. Consider Speed vs Accuracy ●  May need custom weightings, eg from machine learning ●  Apache Tika combines different techniques for you via  DefaultDetector, which auto­loads all available detections
  • 12. Confidential, Copyright © Quanticate Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance Our Values: Excellence • Customer Focus • Team Work • Passion • Integrity Detecting TextDetecting Text
  • 13. Confidential, Copyright © Quanticate Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance Our Values: Excellence • Customer Focus • Team Work • Passion • Integrity Encodings 101 ●  Many different ways to encode text in a file ●  “A” could be 0x41 (ascii,utf8 etc), 0x0041 (utf16le), 0x4100  (utf16be), 0xC1 (ebcdic) ●  1­byte­per­character encodings historically very common ●  But means that you need lots of different encodings to cope  with different langauges and character sets ●  0xE1 – could be:  á     с  α ‫ב‬ ‫ﻑ‬ แ (or something else too!) ●  Some formats include what encoding they've used ●  But many key ones, including plain text, do not!
  • 14. Confidential, Copyright © Quanticate Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance Our Values: Excellence • Customer Focus • Team Work • Passion • Integrity Languages ●  Different languages have different common patterns of  letters, based on words, spellings and patterns ●  If you see accents like á  è  ç then it probably isn't English ●  If you see a word starting with an S, it probably isn't Spanish,  but if you see lots starting “ES” it might be ●  You can look for these patterns, and use those to identify  what language some text might be in ●  Really needs quite a bit of text to work on though, it's very  hard to make meaningful guesses on just a few letters!
  • 15. Confidential, Copyright © Quanticate Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance Our Values: Excellence • Customer Focus • Team Work • Passion • Integrity n­grams ●  Wikipedia says “An n­gram model is a type of probabilistic  language model for predicting the next item in such a  sequence in the form of a (n ­ 1)–order Markov model” ●  Basically, for us, it's all the possible character combinations  (including start + end markers) along with their frequency ●  The “n” is the size ●  Trigrams of “hello” are [ ]he, hel, ell, llo, lo[ ] ●  Quadgrams of “hello” are [ ]hel, ello, llo[ ] ●  Can be used to identify both language and encoding
  • 16. Confidential, Copyright © Quanticate Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance Our Values: Excellence • Customer Focus • Team Work • Passion • Integrity False Positives, Problems ●  For encoding detection to work, your tool (eg Tika) needs to  recognise the file as Plain Text ●  Too many control characters near the start can cause Tika  and friends to decide it isn't Plain Text, so won't detect ●  Some encodings are very similar, hard to tell apart ●  For short runs of text, very hard to be sure what it is ●  Same pattern can crop up in different languages ●  Same pattern could occur between different languages when  in different encodings
  • 17. Confidential, Copyright © Quanticate Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance Our Values: Excellence • Customer Focus • Team Work • Passion • Integrity Apache Tika in a nutshell Apache Tika in a nutshell 
  • 18. Confidential, Copyright © Quanticate Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance Our Values: Excellence • Customer Focus • Team Work • Passion • Integrity Apache Tika in a nutshell “small, yellow and leech­like, and  probably the oddest thing in the Universe” ●  Like a Babel Fish for content! ●  Helps you work out what sort of thing  your content (1s & 0s) is ●  Helps you extract the metadata from it,  in a consistent way ●  Lets you get a plain text version of your  content, eg for full text indexing ●  Provides a rich (XHTML) version too
  • 19. Confidential, Copyright © Quanticate Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance Our Values: Excellence • Customer Focus • Team Work • Passion • Integrity (Some) Supported Formats ●  Microsoft Office – Word, Excel, PowerPoint, Works,  Publisher, Visio – Binary and OOXML formats ●  OpenDocument (OpenOffice),  ●  iWorks – Keynote, Pages, Numbers ●  HTML, XHTML, XML, PDF, RTF, Plain Text, CHM Help ●  Compression / Archive – Zip, Tar, Ar, 7z, bz2, gz etc ●  Atom, RSS, ePub            Lots of Scientific formats ●  Audio – MP3, MP4, Vorbis, Opus, Speex, MIDI, Wav ●  Image – JPEG, TIFF, PNG, BMP, GIF, ICO
  • 20. Confidential, Copyright © Quanticate Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance Our Values: Excellence • Customer Focus • Team Work • Passion • Integrity Ways of calling Tika ●  Tika­App – command line tool ●  Pure Java – Tika Facade – simple way from Java ●  Pure Java – Direct – full control over what happens ●  OSGi – all your dependencies nicely wrapped up ●  Forked Parser – parsing in a different JVM ●  JAX­RS Network Server – RESTful interface to Tika ●  SOLR Plugin – Tika parsing from within SOLR upload ●  Anything you want to write yourself! ●  (These are just the main ways)
  • 21. Confidential, Copyright © Quanticate Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance Our Values: Excellence • Customer Focus • Team Work • Passion • Integrity Tika App ●  Single runnable jar, containing all of Tika + dependencies ●  GUI mode, ideal for quick testing and demos ●  CLI mode, suitable for calling from non­Java programs ●  Detection: ­­detect ●  Metadata: ­­metadata ●  Plain Text: ­­text ●  XHTML: ­­xml ●  Information on available mimetypes, parsers etc ●  JVM startup costs means not the fastest for lots of docs
  • 22. Confidential, Copyright © Quanticate Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance Our Values: Excellence • Customer Focus • Team Work • Passion • Integrity JAX­RS Network Server ●  JAX­RS – JSR­311 RESTful network server ●  Powered by Apache CXF ●  The tika­server jar can be run, starts up the server ●  Content is sent via HTTP PUT calls to service URLs ●  Supports detection, metadata extraction, plain text, xhtml,  and a few other things too, plus info on parsers/formats/etc ●  See https://wiki.apache.org/tika/TikaJAXRS ●  Recommended way to call Tika from non­Java languages ●  Runs in a new JVM
  • 23. Confidential, Copyright © Quanticate Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance Our Values: Excellence • Customer Focus • Team Work • Passion • Integrity Tika Facade (Java) ●  Very simple way to call Tika ●  Tika facade class – org.apache.tika.Tika ●  Detection from streams, bytes, files, urls etc ●  Parsing to Strings, Readers, Plain Text or XHTML ●  By default, uses all available parsers and detectors ●  Requires that all the Tika jars, and their dependencies are  present on the classpath, and don't clash ●  Harder to check if that hasn't worked properly, eg missing  jars or jar clashes
  • 24. Confidential, Copyright © Quanticate Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance Our Values: Excellence • Customer Focus • Team Work • Passion • Integrity Java – Direct calls ●  Slightly more lines of code, but you get full control ●  Normally start with org.apache.tika.TikaConfig ●  From that fetch Detectors, Parsers, Mime Types list etc ●  Typically want to use AutoDetectParser to actually parse ●  If possible, wrap your content with a TikaInputStream before  parsing – create from InputStream or File ●  Still needs all jars on classpath ●  Ask DefaultParser what Parsers + mimetypes are available
  • 25. Confidential, Copyright © Quanticate Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance Our Values: Excellence • Customer Focus • Team Work • Passion • Integrity Java – Forked Parser ●  By default, Tika runs in the same JVM as the rest of your  code, so if Tika (or a library it depends on) breaks, so does the  rest of your JVM ●  This shouldn't happen, but some broken files can trigger  parsers to run out of memory, or loop forever ●  If you're doing internet scale things, this becomes all the  more likely to be hit eventually! ●  Forked Parser fires up a new JVM, sends over Tika +  dependencies, then calls to that JVM to do the parsing ●  If that crashed, doesn't affect your main JVM
  • 26. Confidential, Copyright © Quanticate Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance Our Values: Excellence • Customer Focus • Team Work • Passion • Integrity What Tika tries to do ●  Tika aims to map file format specific metadata onto a  common, consistent set, based on well known standards ●  Tika aims to provide semantically meaningful, but not  cluttered XHTML ●  It isn't supposed to be a perfect rendition of the original file  into XHTML ●  It is supposed to be simple and clean ●  eg Word Parser reports tables and style names, but not  Word­style HTML full of fonts +colours ●  eg Excel Parser returns what's there, not a CSV
  • 27. Confidential, Copyright © Quanticate Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance Our Values: Excellence • Customer Focus • Team Work • Passion • Integrity Extending Tika ●  Mimetypes: Tika has >1400 defined, not all with magic ●  You can define your own in the same structure, put a  org/apache/tika/mime/custom­mimetypes.xml on classpath ●  Parsers: Tika has >70 parsers available, some formats have  multiple which can be used ●  You can write your own quickly and register them for use,  see http://tika.apache.org/1.8/parser_guide.html for how ●  Basic parser is 13 lines of code, work involved depends on  what libraries you have available for your file format ●  Tika Config XML allows you control of which parser to use
  • 28. Confidential, Copyright © Quanticate Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance Our Values: Excellence • Customer Focus • Team Work • Passion • Integrity Embedded ResourcesEmbedded Resources
  • 29. Confidential, Copyright © Quanticate Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance Our Values: Excellence • Customer Focus • Team Work • Passion • Integrity Not just containers ●  Container formats like Zip contain embedded resources, the  files within them ●  Many office documents support embedded resources too ●  Microsoft Office documents can have other documents  embedded within them, eg Excel in PowerPoint ●  Most of the document formats support embedding images  within the document, and many video too ●  Where possible, Tika gives you the embedded resource, and  indicates in the XHTML where it came from ●  Up to you how / where / if to handle these resources
  • 30. Confidential, Copyright © Quanticate Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance Our Values: Excellence • Customer Focus • Team Work • Passion • Integrity Going wrong at Scale...Going wrong at Scale...
  • 31. Confidential, Copyright © Quanticate Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance Our Values: Excellence • Customer Focus • Team Work • Passion • Integrity Lots of Data is Junk ●  At scale, you're going to hit lots of edge cases ●  At scale, you're going to come across lots of junk  or corrupted documents ●  1% of a lot is still a lot... ●  Bound to find files which are unusual or corrupted  enough to be mis­identified ●  You need to plan for failures!
  • 32. Confidential, Copyright © Quanticate Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance Our Values: Excellence • Customer Focus • Team Work • Passion • Integrity Unusual Types ●  If you're working on a big data scale, you're bound to come  across lots of valid but unusual + unknown files ●  You're never going to be able to add support for all of them! ●  May be worth adding support for the more common  “uncommon” unsupported types ●  Which means you'll need to track something about the files  you couldn't understand ●  If Tika knows the mimetype but has no parser, just log the  mimetype ●  If mimetype unknown, maybe log first few bytes
  • 33. Confidential, Copyright © Quanticate Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance Our Values: Excellence • Customer Focus • Team Work • Passion • Integrity Failure at Scale ●  Tika will sometimes mis­identify something, so sometimes  the wrong parser will run and object ●  Some files will cause parsers or their underlying libraries to  do something silly, such as use lots of memory or get into  loops with lots to do ●  Some files will cause parsers or their underlying libraries to  OOM, or infinite loop, or something else bad ●  If a file fails once, will probably fail again, so blindly just re­ running that task again won't help
  • 34. Confidential, Copyright © Quanticate Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance Our Values: Excellence • Customer Focus • Team Work • Passion • Integrity Failure at Scale continued... ●  You'll need approaches that plan for failure ●  Consider what will happen if a file locks up your JVM, or kills  it with an OOM ●  Forked Parser may be worth using ●  Running a separate Tika Server could be good ●  Depending on work needed, could have a smaller pool of  Tika Server instances for big data code to call ●  Think about failure modes, then think about retries (or not) ●  Track common problems, report and fix them!
  • 35. Confidential, Copyright © Quanticate Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance Our Values: Excellence • Customer Focus • Team Work • Passion • Integrity Getting Better? Or Worse?Getting Better? Or Worse?
  • 36. Confidential, Copyright © Quanticate Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance Our Values: Excellence • Customer Focus • Team Work • Passion • Integrity When things go wrong
  • 37. Confidential, Copyright © Quanticate Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance Our Values: Excellence • Customer Focus • Team Work • Passion • Integrity When things go wrong You don't know what you can't find...
  • 38. Confidential, Copyright © Quanticate Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance Our Values: Excellence • Customer Focus • Team Work • Passion • Integrity What can go wrong ●  Catastrophic failures ●  Out of Memory Errors ●  Infinite Hangs ●  Memory Leaks ●  Exceptions: Null Pointer, etc. ●  Extraction with loss of fidelity ●  Missing text/metadata/attachments ●  Extra text (eg placeholder, dummy, hidden, internal etc) ●  Garbled text
  • 39. Confidential, Copyright © Quanticate Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance Our Values: Excellence • Customer Focus • Team Work • Passion • Integrity Did this change help? ●  Unit tests can tell you about major failures ●  Unit tests only check the things you know about though... ●  And only check a small number of files of each type ●  There is a much wider variety of files out there than in even a  decent­sized test suite ●  File type distributions are uneven, a “minor issue” for me  might be a “critical issue” for someone else! ●  You can check 10 documents by eye, you can't check 10gb  of files by eye, let alone 1tb!
  • 40. Confidential, Copyright © Quanticate Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance Our Values: Excellence • Customer Focus • Team Work • Passion • Integrity Did this library change help? ●  Large scale testing needed to spot these kinds of problems ●  Needs automation of running and of analysis! ●  Flag up key issues, look at those in detail ●  Exceptions and Errors are easy  (1st  one per file anyway...) ●  Metadata little harder, but sizes small ●  Attachment counts easy, contents less so ●  Finding junk text harder ●  Missing text tougher still
  • 41. Confidential, Copyright © Quanticate Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance Our Values: Excellence • Customer Focus • Team Work • Passion • Integrity Did this library change help? ●  Need to store output from many runs to compare ●  Common terms per file can identify problems, if it goes from  “differences” to “4NGTMKY” then something's wrong ●  Common terms can help with missing text, but not always ●  Token entropy can show garbage text, or data heavy files! ●  Too many errors to fix all of them ●  Need metrics to know which to focus on ●  Govdocs1, Common Crawl output, more datasets needed!
  • 42. Confidential, Copyright © Quanticate Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance Our Values: Excellence • Customer Focus • Team Work • Passion • Integrity Apache Tika at ScaleApache Tika at Scale
  • 43. Confidential, Copyright © Quanticate Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance Our Values: Excellence • Customer Focus • Team Work • Passion • Integrity Lots of Data is Junk ● At scale, you're going to hit lots of edge cases ● At scale, you're going to come across lots of junk  or corrupted documents ● 1% of a lot is still a lot... ● Bound to find files which are unusual or  corrupted enough to be mis­identified ● You need to plan for failures!
  • 44. Confidential, Copyright © Quanticate Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance Our Values: Excellence • Customer Focus • Team Work • Passion • Integrity Unusual Types ● If you're working on a big data scale, you're bound to come  across lots of valid but unusual + unknown files ● You're never going to be able to add support for all of them! ● May be worth adding support for the more common  “uncommon” unsupported types ● Which means you'll need to track something about the files  you couldn't understand ● If Tika knows the mimetype but has no parser, just log the  mimetype ● If mimetype unknown, maybe log first few bytes
  • 45. Confidential, Copyright © Quanticate Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance Our Values: Excellence • Customer Focus • Team Work • Passion • Integrity Failure At Scale ●  Tika will sometimes mis­identify something, so sometimes  the wrong parser will run and object ●  Some files will cause parsers or their underlying libraries to  do something silly, such as use lots of memory or get into  loops with lots to do ●  Some files will cause parsers or their underlying libraries to  OOM, or infinite loop, or something else bad ●  If a file fails once, will probably fail again, so blindly just re­ running that task again won't help ●  A library upgrade might help, so track which ones failed
  • 46. Confidential, Copyright © Quanticate Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance Our Values: Excellence • Customer Focus • Team Work • Passion • Integrity Failure At Scale Continued ●  You'll need approaches that plan for failure ●  Consider what will happen if a file locks up your JVM, or kills  it with an OOM ●  Forked Parser may be worth using ●  Running a separate Tika Server could be good ●  Depending on work needed, could have a smaller pool of  Tika Server instances for big data code to call ●  Think about failure modes, then think about retries (or not) ●  Track common problems, report and fix them!
  • 47. Confidential, Copyright © Quanticate Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance Our Values: Excellence • Customer Focus • Team Work • Passion • Integrity Tika Eval ●  Runs Tika against a corpus of documents, or compares the  results of running two different versions on those documents ●  Stack traces and exception counts ●  File id, language id ●  Attachment and metadata counts ●  Top 10 most common words ●  Content length ●  Token length statistics, Token entropy ●  Gives reports on key things for a human to check
  • 48. Confidential, Copyright © Quanticate Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance Our Values: Excellence • Customer Focus • Team Work • Passion • Integrity Tika Batch ●  Tools for running Tika across lots of documents, where some   of them are going to fail ●  Tika Batch App is complete, used by current Eval work ●  WIP: Tika Batch Hadoop and Tika Batch Storm ●  http://wiki.apache.org/tika/TikaInHadoop has lots of pointers  on how to handle Tika on Hadoop right now ●  Related: https://github.com/DigitalPebble/behemoth/
  • 49. Confidential, Copyright © Quanticate Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance Our Values: Excellence • Customer Focus • Team Work • Passion • Integrity Other Projects to know aboutOther Projects to know about
  • 50. Confidential, Copyright © Quanticate Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance Our Values: Excellence • Customer Focus • Team Work • Passion • Integrity Apache Projects ●  Nutch ●  Storm Crawler ●  Any23 ●  OpenNLP ●  UIMA ●  CTAKES ●  Any from the audience?
  • 51. Confidential, Copyright © Quanticate Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance Our Values: Excellence • Customer Focus • Team Work • Passion • Integrity External Projects ●  Hardened Tika ●  Behemoth, and Behemoth extensions to Tika ●  Google Chrome “Compact Language Detector” (CLD) ●  Mozilla's LibCharsetDetect ●  ICU4J ●  Any from the audience?
  • 52. Confidential, Copyright © Quanticate Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance Our Values: Excellence • Customer Focus • Team Work • Passion • Integrity Any Questions?Any Questions?
  • 53. Confidential, Copyright © Quanticate Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance Our Values: Excellence • Customer Focus • Team Work • Passion • Integrity Nick BurchNick Burch @Gagravarr@Gagravarr