Bionic Info Pro: Machine Learning, Taxonomies and Big Data

Bionic Info Pro:
New Takes on an Old Theme
Machine Learning, Taxonomy Creation, Big Data,
Competitive Intelligence, and the Human Element
Elaine M. Lasda Bergman
Annual Conference
Special Libraries Association
Vancouver, BC, Canada
Monday, June 9, 2014

Overview
• A little bit about Machine Learning
• A little bit about Taxonomies
• A little bit about Big Data
• A little bit about Hybrid Techniques

NOT NEW:
Machine Learning for CI
Mena, Jesus. (1996). Data Mining for
Competitive Intelligence, Competitive
Intelligence Review, 7(4):18-25.

Refinement of Machine Learning
• Decision Trees/Classification
• Clustering
• Anomaly Detection

Refinement of Machine Learning
• Support Vector Machines-
– Predictive Classification
• Association Rules
– Marketbasket analysis
• Natural Language Processing
– Sentiment Analysis

Getting up to Speed
• http://efytimes.com
• 6 Video Tutorials and Playlists on
Machine Learning (January 2014)

NOT NEW: Taxonomies in
Information Retrieval
http://comsaad.blogspot.com/p/old-computer-photos.html
http://commons.wikimedia.org/wiki/File:A_Library_Primer_illustration_Joined_Hand.jpg

Need for Taxonomic Structures
http://farm9.staticflickr.com/8262/8673326413_4492b5dc68_o.jpg

NOT NEW: Datasets
http://www.conceptdraw.com/solution-park/resource/images/solutions/entity-relationship-diagram-(erd)/Diagramming-Crow's-Foot-ERD-Sample60.png

Enter BIG DATA
http://commons.wikimedia.org/wiki/File:DARPA_Big_Data.jpg

BigData Sources and AnalysisDataType Qualities Analysis Tools Result
Social Media Demographics API integration More profiles of like-
minded users
“Social Influencers” User Reviews NLP, Text Analysis Sentiment readings
“Internet of Things” Logs/Sensors/Check-Ins Parsing Usage and behavior
patterns
SaaS Cloud/Web-based/Subscription
software
Dist. data integration/in-memory caching
technology/API integration
Usage behavior patterns,
customer data, etc.
Public Data e.g., Amazon Data Market,
WorldBank, Wikipedia
All above (depends on data structure) Depends on Dataset (and
there are LOTS of them!)
Hadoop/MapReduce Volume! Parallel Processing/Parsing/Reduction Big patterns, correlations,
needles in haystacks
Data Warehouses Internal transactional data Likely same as above Correlations,
marketbasket, etc.
NoSQL/Columnar Volume! Fills gaps in Parallel processing tools Real time activity and
patterns
In-Stream Monitoring Network traffic (streaming
videos, system outages)
Packet evaluation, distributed query processing Network/Stream usage
patterns
Legacy Data Usually PDFs &
Documents/SemiStructured
Transformation tools(eg, Xenos d2e) + above Depends on content (could
be all)
http://www.zdnet.com/top-10-categories-for-big-data-sources-and-mining-technologies-7000000926/

Why “Concept Hierarchies” in
an Unstructured Environment?

Advantages
• When term is too low to appear in
frequent item/rulesets
• Create more interesting rules using
more general, aggregated concepts
[DVD, wheat bread, home electronics,
electronitcs, food]
Kumar, T.S. (2005) Introduction to Data Science

Disadvantages
• How low and how high in the hierarchy
do you set the threshold?
• Increased computation time
• If threshold is to high, redundant rules
for more specific terms can be
summarized by rules using more
general terms

Hybrid Taxonomic Development
• Understand your auto-classification
model
• Work with domain experts to create
basic taxonomy
• Test Taxonomy in the Model
• Rinse, repeat
Wendy Pohs,ASIS&T Bulletin 12/1/13

Domain Knowledge
and Thick Data
• Thick Data analysis primarily relies on human brain power to
process a small “N” while big data analysis requires
computational power (of course with humans writing the
algorithms) to process a large “N”.
• Big Data reveals insights with a particular range of data
points, while Thick Data reveals the social context of and
connections between data points. Big Data delivers numbers;
thick data delivers stories. Big data relies on machine
learning; thick data relies on human learning.
http://ethnographymatters.net/blog/2013/05/13/big-data-needs-thick-data/ (Tricia Wang)

Data Driven CI is Meaningless
Without Human/Domain
Knowledge
http://www.wired.com/2014/04/your-big-data-is-worthless-if-you-dont-bring-it-into-the-real-
world/

Recap
• Data Mining for CI is not new
• Refinement and Improvement
• Bigger, Weirder Data

Recap
• Where it’s at: Hybrid Schemas
• Thick Data, not just Big Data
• HUMAN ELEMENT IS ESSENTIAL

Questions?
Elaine Lasda Bergman
University at Albany
http://www.slideshare.net/librarian68
elasdabergman@albany.edu
@ElaineLibrarian

Bionic Info Pro: Machine Learning, Taxonomies and Big Data

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Bionic Info Pro: Machine Learning, Taxonomies and Big Data

Ähnlich wie Bionic Info Pro: Machine Learning, Taxonomies and Big Data (20)

Mehr von Elaine Lasda

Mehr von Elaine Lasda (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Bionic Info Pro: Machine Learning, Taxonomies and Big Data

Hinweis der Redaktion