Session Description: Understanding and accessing large volumes of content often requires a multi-faceted approach that goes well beyond the basics of simple batch processing jobs. In many cases, one needs both ad hoc, real time access to the content as well as the ability to discover interesting information based on a variety of features such as recommendations, summaries and other insights. In this talk, we`ll discuss real world use cases across several industries as well as how to effectively leverage open source tools like Hadoop, Solr, Mahout and others to better enable user access to big data.
How do you gain insight?The Search boxis the UI for data these daysFeedback improvements into system for usersExtract key metrics for business understanding
ChallengesMany of these are intense calculations or iterativeMany are subjective and require a lot of experimentation
Single nucleotide polymorphisms (SNPs) are used as markers in linkage and association studies to detect which regions in the human genome may be involved in disease.Single nucleotide polymorphism (SNP) studies and random mutagenesis projects identify amino acid substitutions in protein-coding regions. Each substitution has the potential to affect protein function. SIFT (Sorting Intolerant From Tolerant) is a program that predicts whether an amino acid substitution affects protein function so that users can prioritize substitutions for further study.
Make into images?
SearchStorage and processingExperiment managementToolsNLPstatistical analysisScalableLow costProduction monitoringProvisioningBulk and near real-time Handle volume in sub-second processing
Solr takes care of leader election, etc. so no more master/slave1 second (default) soft commits for NRT updates1 minute (default) hard commits (no searcher reopen)Transaction logs for recovery