Big Data & Text Mining: Finding Nuggets in Mountains of Textual Data
Big amount of information is available in textual form in databases or online sources, and for many enterprise functions (marketing, maintenance, finance, etc.) represents a huge opportunity to improve their business knowledge. For example, text mining is starting to be used in marketing, more specifically in analytical customer relationship management, in order to achieve the holy 360° view of the customer (integrating elements from inbound mails, web comments, surveys, internal notes, etc.).
Facing this new domain I have make a personal research, and realize a synthesis, which has help me to clarify some ideas. The below presentation does not intend to be exhaustive on the subject, but could perhaps bring you some useful insights.
2. Information context
Big amount of information is available in
textual form in databases and online
sources
In this context, manual analysis and
effective extraction of useful information
are not possible
It is relevant to provide automatic tools
for analyzing large textual collections
www.decideo.fr/bruley
3. Text mining definition
The objective of Text Mining is to exploit
information contained in textual documents
in various ways, including … discovery of
patterns and trends in data, associations
among entities, predictive rules, etc.
The results can be important both for:
the analysis of the collection, and
providing intelligent navigation and
browsing methods
www.decideo.fr/bruley
4. Text mining pipeline
Unstructured Text
(implicit knowledge)
Information
Retrieval
Information
extraction
Knowledge
Discovery
Structured content
(explicit knowledge)
www.decideo.fr/bruley
Sem ant ic
Sea rch /
Dat a Min ing
Semantic
metadata
5. Text mining process
Text preprocessing
Syntactic/Semantic text
analysis
Features Generation
Bag of words
Features Selection
Simple counting
Statistics
Text/Data Mining
Classification- Supervised
learning
Clustering- Unsupervised
learning
Analyzing results
Mapping/Visualization
Result interpretation
www.decideo.fr/bruley
Iterative and interactive process
6. Text mining actors
Publishers
Enriched content
Annotation tools
Tools for authors
New applications based on annotation layers
Richer cross linking based on content…
Analysts
Empowers them
Annotating research output
Hypothesis generation
Summarisation of findings
Focused semantic search…
www.decideo.fr/bruley
Libraries
Linking between Institutional repositories
Access to richer metadata
Aggregation
Aids to subject analysis/classification …
7. Challenges in text mining
Data collection is “free text”, is not well-organized (Semistructured or unstructured)
No uniform access over all sources, each source has
separate storage and algebra, examples: email, databases,
applications, web
A quintuple heterogeneity: semantic, linguistic, structure,
format, size of unit information
Learning techniques for processing text typically need
annotated training
XML as the common model, it allows:
– Manipulation data with standards
– Mining becomes more data mining
– RDF emerging as a complementary model
The more structure you can explore the better you can do
mining
www.decideo.fr/bruley
8. Data source administration
Intranet
File System
Databases
EDMS
Internet
Web
Crawling
On-line
Databank
XML Normalisation
-subject
-Author
-text corpora
-keywords
Information Provider
Format filter
www.decideo.fr/bruley
9. Text mining tasks
Name Extractions
Term Extraction
Feature extraction
Categorization
Text Analysis
Tools
Abbreviation Extraction
Relationship Extraction
Summarization
Clustering
Hierarchical Clustering
Binary relational Clustering
TM
Text search engine
Web Searching
Tools
NetQuestion Solution
Web Crawler
www.decideo.fr/bruley
10. Information extraction
Keyword Ranking
Link Analysis
Query Log Analysis
Metadata Extraction
Intelligent Match
Duplicate Elimination
www.decideo.fr/bruley
Extract domain-specific
information from natural
language text
– Need a dictionary of
extraction patterns (e.g.,
“traveled to <x>” or
“presidents of <x>”)
• Constructed by hand
• Automatically learned
from hand-annotated
training data
– Need a semantic lexicon
(dictionary of words with
semantic category labels)
• Typically constructed
by hand
13. Aster Data position for Text
Analysis
Data
Data
Acquisition
Acquisition
Gather text from
relevant sources
(web crawling, document
scanning, news feeds,
Twitter feeds, …)
Pre-Processing
Pre-Processing
Mining
Mining
Analytic
Analytic
Applications
Applications
Perform processing
required to transform and
store text data and
information
Apply data mining
techniques to derive
insights about stored
information
Leverage insights from
text mining to provide
information that improves
decisions and processes
(stemming, parsing, indexing,
entity extraction, …)
(statistical analysis,
classification, natural
language processing, …)
(sentiment analysis, document
management, fraud analysis,
e-discovery, ...)
Aster Data Fit
Third-Party Tools Fit
Aster Data Value: Massive scalability of text storage and processing, Functions for text processing, Flexibility to develop diverse
custom analytics and incorporate third-party libraries
www.decideo.fr/bruley
14. Aster Data Value for Text
Analytics
•
Ability to store and process massive volumes of text data
– Massively parallel data stores and massively parallel analytics engine
– SQL-MapReduce framework enables in-database processing for
specialized text analytics tools
•
Tools and extensibility for processing diverse text data
– SQL-MapReduce framework enables loading and transforming diverse
sources and types of text data
– Pre-built functions for text processing
•
Flexible platform for building and processing diverse analytics
– SQL-MapReduce framework enables creation of flexible, reusable
analytics
– Embedded MapReduce processing engine for high-performance analytics
www.decideo.fr/bruley
15. Aster Data Capabilities for Text
Data
Pre-built SQL-MapReduce functions for text processing
•
•
•
Data transformation utilities
- Pack: compress multi-column data into a
single column
- Unpack: extract nested data for further
analysis
Custom and Packaged Analytics
Aster Data nCluster
App
App
Web log analysis
- Sessionization: identify unique
browsing sessions in clickstream data
Text analysis
- Text parser: general tool for tokenizing,
stemming, and counting text data
- nGram: split text into component parts
(words & phrases)
- Levenstein distance: compute “distance”
between words
www.decideo.fr/bruley
App
App
App
App
Aster Data Analytic Foundation
SQL-MapReduce
SQL
Data
Data
Data
Hinweis der Redaktion
Input Data System:
This part of the system is related to the collection of the data.
-Getting data from the internet with a crawler
-Getting data from Online vendors
-Getting data from the internal data banks
Regarding the input format (physical and logical), data are physicaly reformated into html format and then it's loaded into an XML format
Feature extraction tools
It recognizes significant vocabulary items in documents, and
measures their importance to the document content.
2. Clustering tools
Clustering is used to segment a document collection into subsets, called
clusters.
3. Summarization tool
Summarization is the process of condensing a source text into a shorter version preserving its information content.
4. Categorization tool
Categorization is used to assign objects to predefined categories, or classes from a taxonomy.