Bionic Info Pro: Machine Learning, Taxonomies and Big Data
1. Bionic Info Pro:
New Takes on an Old Theme
Machine Learning, Taxonomy Creation, Big Data,
Competitive Intelligence, and the Human Element
Elaine M. Lasda Bergman
Annual Conference
Special Libraries Association
Vancouver, BC, Canada
Monday, June 9, 2014
2. Overview
• A little bit about Machine Learning
• A little bit about Taxonomies
• A little bit about Big Data
• A little bit about Hybrid Techniques
3. NOT NEW:
Machine Learning for CI
Mena, Jesus. (1996). Data Mining for
Competitive Intelligence, Competitive
Intelligence Review, 7(4):18-25.
5. Refinement of Machine Learning
• Support Vector Machines-
– Predictive Classification
• Association Rules
– Marketbasket analysis
• Natural Language Processing
– Sentiment Analysis
6. Getting up to Speed
• http://efytimes.com
• 6 Video Tutorials and Playlists on
Machine Learning (January 2014)
7. NOT NEW: Taxonomies in
Information Retrieval
http://comsaad.blogspot.com/p/old-computer-photos.html
http://commons.wikimedia.org/wiki/File:A_Library_Primer_illustration_Joined_Hand.jpg
8. Need for Taxonomic Structures
http://farm9.staticflickr.com/8262/8673326413_4492b5dc68_o.jpg
11. BigData Sources and AnalysisDataType Qualities Analysis Tools Result
Social Media Demographics API integration More profiles of like-
minded users
“Social Influencers” User Reviews NLP, Text Analysis Sentiment readings
“Internet of Things” Logs/Sensors/Check-Ins Parsing Usage and behavior
patterns
SaaS Cloud/Web-based/Subscription
software
Dist. data integration/in-memory caching
technology/API integration
Usage behavior patterns,
customer data, etc.
Public Data e.g., Amazon Data Market,
WorldBank, Wikipedia
All above (depends on data structure) Depends on Dataset (and
there are LOTS of them!)
Hadoop/MapReduce Volume! Parallel Processing/Parsing/Reduction Big patterns, correlations,
needles in haystacks
Data Warehouses Internal transactional data Likely same as above Correlations,
marketbasket, etc.
NoSQL/Columnar Volume! Fills gaps in Parallel processing tools Real time activity and
patterns
In-Stream Monitoring Network traffic (streaming
videos, system outages)
Packet evaluation, distributed query processing Network/Stream usage
patterns
Legacy Data Usually PDFs &
Documents/SemiStructured
Transformation tools(eg, Xenos d2e) + above Depends on content (could
be all)
http://www.zdnet.com/top-10-categories-for-big-data-sources-and-mining-technologies-7000000926/
13. Advantages
• When term is too low to appear in
frequent item/rulesets
• Create more interesting rules using
more general, aggregated concepts
[DVD, wheat bread, home electronics,
electronitcs, food]
Kumar, T.S. (2005) Introduction to Data Science
14. Disadvantages
• How low and how high in the hierarchy
do you set the threshold?
• Increased computation time
• If threshold is to high, redundant rules
for more specific terms can be
summarized by rules using more
general terms
15. Hybrid Taxonomic Development
• Understand your auto-classification
model
• Work with domain experts to create
basic taxonomy
• Test Taxonomy in the Model
• Rinse, repeat
Wendy Pohs,ASIS&T Bulletin 12/1/13
16. Domain Knowledge
and Thick Data
• Thick Data analysis primarily relies on human brain power to
process a small “N” while big data analysis requires
computational power (of course with humans writing the
algorithms) to process a large “N”.
• Big Data reveals insights with a particular range of data
points, while Thick Data reveals the social context of and
connections between data points. Big Data delivers numbers;
thick data delivers stories. Big data relies on machine
learning; thick data relies on human learning.
http://ethnographymatters.net/blog/2013/05/13/big-data-needs-thick-data/ (Tricia Wang)
17. Data Driven CI is Meaningless
Without Human/Domain
Knowledge
http://www.wired.com/2014/04/your-big-data-is-worthless-if-you-dont-bring-it-into-the-real-
world/
18. Recap
• Data Mining for CI is not new
• Refinement and Improvement
• Bigger, Weirder Data
19. Recap
• Where it’s at: Hybrid Schemas
• Thick Data, not just Big Data
• HUMAN ELEMENT IS ESSENTIAL
“automatic discovery of patterns using software to analyze vast amounts of records in a database”
What else was going on in techi n 1996
The 1996 article mentioned transactional data, “all the rage”
Marketing,
Infentory,
Risk mitigation
Efficiency and waste
allow us to formulate solutions in englisn
“Library Hand” – we’ve been doing indexing, taxonomies, classsification since the beginning of our profession
Machine created taxonomies are not new, text mining, extraction, and indexing have been automated since the 1960s. The earliest I could find was a paper published by the RAND corporation in 1961
Wider need for classification- Building Enterprise Taxonomies, Stewart
The pendulum – “searching” versus “browsing” paradigms
Search = lack of context, precision versus recall, relevancy ranking, choice of terminology
Proper syntax for each search tool, where to search? Spelling variants, bad labels
Where do we find taxonomies and ontologies today? Here are some of their natural habitats
Web sites
Discipline/Domain Classification
Machine Learning Algorithms
Training dataset and a testing dataset.
As heather points out in her book the Accidental Taxonoist, the efficacy of machine created taxonomies improves dramatically with human quality control
Relational DBs – ENTITY RELATIONSHIP
Legacy systems
Hierarchical models
Network models
Diagram for a realtional database is in rows and columns,
Classes, variables, attributies, qualities, fields observations instances, records, cases
NoSQL
Multimedia
Unstructured
Andrew Brust
“Bigger data means weirder data” <-Jeffry Stanton in Intro to Data Science book
Big Data a revolution that will transform how we live work and think
Weed out data noise
Algorithms can be programed with human quality control to account for redundancy and catch inconsistencies, different terms
http://it.toolbox.com/blogs/irm-blog/the-benefits-of-a-data-taxonomy-4916
https://www.earley.com/blog/why-taxonomy-critical-master-data-management-mdm
Autoclassification model:
Linguistic/lexical: gather and rank representative words and phrases that are associated with the concepts to be classified;
Rules Based: no common syntax for developing rules; varies by tool. Rules syntax could be Boolean to the more complex syntax more commonly used in programming languages. Because of this lack of consistency, the people who create and maintain these rules will have a more specialized skill set and will require more training.
Machine Learning/Predictive: And these systems rely on iteration to continuously validate. Traditional hierarchical taxonomy may not be needed, reference terms or document sets to model. Maintenance of machine learning systems = repeated training, especially when you add new content. You will also help revise the larger machine-learning model as you learn more about your content.
Examples of Domain Knowledge
-Big data revolution book – buliding inspectors needed to predict which buildings should have priority inspections
wEb design for user generated content – automatically ccategorizes user driven content but taxonomy is refined by humans
As refined, the autoclassifier improves,”gets smarter”
We as knowledge experts fill in the gaps!
We can be facilitators with those in the field/analysts and those programming the algorithms
Example of meaningless data: Google Flu trends
Scientific controlled experiments limit external sources, domain knowledge fills in the gaps in the real world data analysis
http://www.wired.com/2014/04/your-big-data-is-worthless-if-you-dont-bring-it-into-the-real-world/