SlideShare ist ein Scribd-Unternehmen logo
1 von 15
Text mining
michel.bruley@teradata.com

Extract from various presentations: Temis, URI-INIST-CNRS, Aster
Data …
www.decideo.fr/bruley
Information context
Big amount of information is available in
textual form in databases and online
sources
In this context, manual analysis and
effective extraction of useful information
are not possible
It is relevant to provide automatic tools
for analyzing large textual collections
www.decideo.fr/bruley
Text mining definition
The objective of Text Mining is to exploit
information contained in textual documents
in various ways, including … discovery of
patterns and trends in data, associations
among entities, predictive rules, etc.
The results can be important both for:
the analysis of the collection, and
providing intelligent navigation and
browsing methods
www.decideo.fr/bruley
Text mining pipeline
Unstructured Text
(implicit knowledge)

Information
Retrieval

Information
extraction

Knowledge
Discovery

Structured content
(explicit knowledge)

www.decideo.fr/bruley

Sem ant ic
Sea rch /
Dat a Min ing

Semantic
metadata
Text mining process
Text preprocessing
Syntactic/Semantic text
analysis
Features Generation
Bag of words
Features Selection
Simple counting
Statistics
Text/Data Mining
Classification- Supervised
learning
Clustering- Unsupervised
learning
Analyzing results
Mapping/Visualization
Result interpretation
www.decideo.fr/bruley

Iterative and interactive process
Text mining actors
Publishers
Enriched content
Annotation tools
Tools for authors
New applications based on annotation layers
Richer cross linking based on content…

Analysts
Empowers them
Annotating research output
Hypothesis generation
Summarisation of findings
Focused semantic search…

www.decideo.fr/bruley

Libraries
Linking between Institutional repositories
Access to richer metadata
Aggregation
Aids to subject analysis/classification …
Challenges in text mining
Data collection is “free text”, is not well-organized (Semistructured or unstructured)
No uniform access over all sources, each source has
separate storage and algebra, examples: email, databases,
applications, web
A quintuple heterogeneity: semantic, linguistic, structure,
format, size of unit information
Learning techniques for processing text typically need
annotated training
XML as the common model, it allows:
– Manipulation data with standards
– Mining becomes more data mining
– RDF emerging as a complementary model
The more structure you can explore the better you can do
mining
www.decideo.fr/bruley
Data source administration

Intranet

File System
Databases
EDMS

Internet

Web
Crawling
On-line
Databank

XML Normalisation
-subject
-Author
-text corpora
-keywords

Information Provider

Format filter
www.decideo.fr/bruley
Text mining tasks
Name Extractions
Term Extraction
Feature extraction
Categorization

Text Analysis
Tools

Abbreviation Extraction
Relationship Extraction

Summarization
Clustering

Hierarchical Clustering
Binary relational Clustering

TM

Text search engine
Web Searching
Tools

NetQuestion Solution
Web Crawler

www.decideo.fr/bruley
Information extraction
Keyword Ranking
Link Analysis
Query Log Analysis
Metadata Extraction
Intelligent Match
Duplicate Elimination

www.decideo.fr/bruley

Extract domain-specific
information from natural
language text
– Need a dictionary of
extraction patterns (e.g.,
“traveled to <x>” or
“presidents of <x>”)
• Constructed by hand
• Automatically learned
from hand-annotated
training data
– Need a semantic lexicon
(dictionary of words with
semantic category labels)
• Typically constructed
by hand
Document collections treatment

Categorization

www.decideo.fr/bruley

Clustering
Text Mining example: Obama vs. McCain

www.decideo.fr/bruley
Aster Data position for Text
Analysis
Data
Data
Acquisition
Acquisition
Gather text from
relevant sources
(web crawling, document
scanning, news feeds,
Twitter feeds, …)

Pre-Processing
Pre-Processing

Mining
Mining

Analytic
Analytic
Applications
Applications

Perform processing
required to transform and
store text data and
information

Apply data mining
techniques to derive
insights about stored
information

Leverage insights from
text mining to provide
information that improves
decisions and processes

(stemming, parsing, indexing,
entity extraction, …)

(statistical analysis,
classification, natural
language processing, …)

(sentiment analysis, document
management, fraud analysis,
e-discovery, ...)

Aster Data Fit
Third-Party Tools Fit
Aster Data Value: Massive scalability of text storage and processing, Functions for text processing, Flexibility to develop diverse
custom analytics and incorporate third-party libraries

www.decideo.fr/bruley
Aster Data Value for Text
Analytics
•

Ability to store and process massive volumes of text data
– Massively parallel data stores and massively parallel analytics engine
– SQL-MapReduce framework enables in-database processing for
specialized text analytics tools

•

Tools and extensibility for processing diverse text data
– SQL-MapReduce framework enables loading and transforming diverse
sources and types of text data
– Pre-built functions for text processing

•

Flexible platform for building and processing diverse analytics
– SQL-MapReduce framework enables creation of flexible, reusable
analytics
– Embedded MapReduce processing engine for high-performance analytics

www.decideo.fr/bruley
Aster Data Capabilities for Text
Data
Pre-built SQL-MapReduce functions for text processing
•

•

•

Data transformation utilities
- Pack: compress multi-column data into a
single column
- Unpack: extract nested data for further
analysis

Custom and Packaged Analytics

Aster Data nCluster
App
App

Web log analysis
- Sessionization: identify unique
browsing sessions in clickstream data
Text analysis
- Text parser: general tool for tokenizing,
stemming, and counting text data
- nGram: split text into component parts
(words & phrases)
- Levenstein distance: compute “distance”
between words

www.decideo.fr/bruley

App
App

App
App

Aster Data Analytic Foundation

SQL-MapReduce

SQL

Data

Data

Data

Weitere ähnliche Inhalte

Was ist angesagt?

Text Analytics Presentation
Text Analytics PresentationText Analytics Presentation
Text Analytics PresentationSkylar Ritchie
 
Introduction to Data Science and Analytics
Introduction to Data Science and AnalyticsIntroduction to Data Science and Analytics
Introduction to Data Science and AnalyticsSrinath Perera
 
Machine Learning
Machine LearningMachine Learning
Machine LearningShrey Malik
 
Intro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big DataIntro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big DataPaco Nathan
 
Text mining presentation in Data mining Area
Text mining presentation in Data mining AreaText mining presentation in Data mining Area
Text mining presentation in Data mining AreaMahamudHasanCSE
 
Text data mining1
Text data mining1Text data mining1
Text data mining1KU Leuven
 
Nlp toolkits and_preprocessing_techniques
Nlp toolkits and_preprocessing_techniquesNlp toolkits and_preprocessing_techniques
Nlp toolkits and_preprocessing_techniquesankit_ppt
 
data mining
data miningdata mining
data mininguoitc
 
Data Mining: What is Data Mining?
Data Mining: What is Data Mining?Data Mining: What is Data Mining?
Data Mining: What is Data Mining?Seerat Malik
 
Data Mining: an Introduction
Data Mining: an IntroductionData Mining: an Introduction
Data Mining: an IntroductionAli Abbasi
 
Introduction To Data Mining
Introduction To Data Mining   Introduction To Data Mining
Introduction To Data Mining Phi Jack
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceSrishti44
 
Exploratory data analysis in R - Data Science Club
Exploratory data analysis in R - Data Science ClubExploratory data analysis in R - Data Science Club
Exploratory data analysis in R - Data Science ClubMartin Bago
 
Decision Tree In R | Decision Tree Algorithm | Data Science Tutorial | Machin...
Decision Tree In R | Decision Tree Algorithm | Data Science Tutorial | Machin...Decision Tree In R | Decision Tree Algorithm | Data Science Tutorial | Machin...
Decision Tree In R | Decision Tree Algorithm | Data Science Tutorial | Machin...Simplilearn
 
Word embeddings, RNN, GRU and LSTM
Word embeddings, RNN, GRU and LSTMWord embeddings, RNN, GRU and LSTM
Word embeddings, RNN, GRU and LSTMDivya Gera
 

Was ist angesagt? (20)

Text Analytics Presentation
Text Analytics PresentationText Analytics Presentation
Text Analytics Presentation
 
Introduction to Data Science and Analytics
Introduction to Data Science and AnalyticsIntroduction to Data Science and Analytics
Introduction to Data Science and Analytics
 
Next word Prediction
Next word PredictionNext word Prediction
Next word Prediction
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Intro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big DataIntro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big Data
 
Text mining presentation in Data mining Area
Text mining presentation in Data mining AreaText mining presentation in Data mining Area
Text mining presentation in Data mining Area
 
Tesxt mining
Tesxt miningTesxt mining
Tesxt mining
 
Text data mining1
Text data mining1Text data mining1
Text data mining1
 
Nlp toolkits and_preprocessing_techniques
Nlp toolkits and_preprocessing_techniquesNlp toolkits and_preprocessing_techniques
Nlp toolkits and_preprocessing_techniques
 
Text mining
Text miningText mining
Text mining
 
data mining
data miningdata mining
data mining
 
Data Mining: What is Data Mining?
Data Mining: What is Data Mining?Data Mining: What is Data Mining?
Data Mining: What is Data Mining?
 
Data Mining: an Introduction
Data Mining: an IntroductionData Mining: an Introduction
Data Mining: an Introduction
 
Introduction To Data Mining
Introduction To Data Mining   Introduction To Data Mining
Introduction To Data Mining
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Exploratory data analysis in R - Data Science Club
Exploratory data analysis in R - Data Science ClubExploratory data analysis in R - Data Science Club
Exploratory data analysis in R - Data Science Club
 
Introduction to Deep learning
Introduction to Deep learningIntroduction to Deep learning
Introduction to Deep learning
 
Decision Tree In R | Decision Tree Algorithm | Data Science Tutorial | Machin...
Decision Tree In R | Decision Tree Algorithm | Data Science Tutorial | Machin...Decision Tree In R | Decision Tree Algorithm | Data Science Tutorial | Machin...
Decision Tree In R | Decision Tree Algorithm | Data Science Tutorial | Machin...
 
Word embeddings, RNN, GRU and LSTM
Word embeddings, RNN, GRU and LSTMWord embeddings, RNN, GRU and LSTM
Word embeddings, RNN, GRU and LSTM
 
Top data science projects
Top data science projectsTop data science projects
Top data science projects
 

Ähnlich wie Big Data & Text Mining

1 _text_mining_v0a
1  _text_mining_v0a1  _text_mining_v0a
1 _text_mining_v0asaira gilani
 
Text mining and analytics v6 - p1
Text mining and analytics   v6 - p1Text mining and analytics   v6 - p1
Text mining and analytics v6 - p1Dave King
 
Knowledge discovery thru data mining
Knowledge discovery thru data miningKnowledge discovery thru data mining
Knowledge discovery thru data miningDevakumar Jain
 
Lect 1 introduction
Lect 1 introductionLect 1 introduction
Lect 1 introductionhktripathy
 
CognitiveComputing_CirrusShakeri_final
CognitiveComputing_CirrusShakeri_finalCognitiveComputing_CirrusShakeri_final
CognitiveComputing_CirrusShakeri_finalCirrus Shakeri
 
Web_Mining_Overview_Nfaoui_El_Habib
Web_Mining_Overview_Nfaoui_El_HabibWeb_Mining_Overview_Nfaoui_El_Habib
Web_Mining_Overview_Nfaoui_El_HabibEl Habib NFAOUI
 
Lect 1 introduction
Lect 1 introductionLect 1 introduction
Lect 1 introductionhktripathy
 
Applications of Semantic Technology in the Real World Today
Applications of Semantic Technology in the Real World TodayApplications of Semantic Technology in the Real World Today
Applications of Semantic Technology in the Real World TodayAmit Sheth
 
Week-1-Introduction to Data Mining.pptx
Week-1-Introduction to Data Mining.pptxWeek-1-Introduction to Data Mining.pptx
Week-1-Introduction to Data Mining.pptxTake1As
 
Chapter 1. Introduction
Chapter 1. IntroductionChapter 1. Introduction
Chapter 1. Introductionbutest
 
Post 1What is text analytics How does it differ from text mini.docx
Post 1What is text analytics How does it differ from text mini.docxPost 1What is text analytics How does it differ from text mini.docx
Post 1What is text analytics How does it differ from text mini.docxstilliegeorgiana
 
Post 1What is text analytics How does it differ from text mini
Post 1What is text analytics How does it differ from text miniPost 1What is text analytics How does it differ from text mini
Post 1What is text analytics How does it differ from text minianhcrowley
 
Capitalizing on Machine Reading to Engage Bigger Data
Capitalizing on Machine Reading to Engage Bigger DataCapitalizing on Machine Reading to Engage Bigger Data
Capitalizing on Machine Reading to Engage Bigger DataShalin Hai-Jew
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discoveryYoung Alista
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discoveryHarry Potter
 

Ähnlich wie Big Data & Text Mining (20)

1 _text_mining_v0a
1  _text_mining_v0a1  _text_mining_v0a
1 _text_mining_v0a
 
Text mining and analytics v6 - p1
Text mining and analytics   v6 - p1Text mining and analytics   v6 - p1
Text mining and analytics v6 - p1
 
Knowledge discovery thru data mining
Knowledge discovery thru data miningKnowledge discovery thru data mining
Knowledge discovery thru data mining
 
Lect 1 introduction
Lect 1 introductionLect 1 introduction
Lect 1 introduction
 
CognitiveComputing_CirrusShakeri_final
CognitiveComputing_CirrusShakeri_finalCognitiveComputing_CirrusShakeri_final
CognitiveComputing_CirrusShakeri_final
 
Web_Mining_Overview_Nfaoui_El_Habib
Web_Mining_Overview_Nfaoui_El_HabibWeb_Mining_Overview_Nfaoui_El_Habib
Web_Mining_Overview_Nfaoui_El_Habib
 
2 Data-mining process
2   Data-mining process2   Data-mining process
2 Data-mining process
 
Introduction to data warehouse
Introduction to data warehouseIntroduction to data warehouse
Introduction to data warehouse
 
Lect 1 introduction
Lect 1 introductionLect 1 introduction
Lect 1 introduction
 
TEXT MINING.pptx
TEXT MINING.pptxTEXT MINING.pptx
TEXT MINING.pptx
 
Applications of Semantic Technology in the Real World Today
Applications of Semantic Technology in the Real World TodayApplications of Semantic Technology in the Real World Today
Applications of Semantic Technology in the Real World Today
 
Week-1-Introduction to Data Mining.pptx
Week-1-Introduction to Data Mining.pptxWeek-1-Introduction to Data Mining.pptx
Week-1-Introduction to Data Mining.pptx
 
Chapter 1. Introduction
Chapter 1. IntroductionChapter 1. Introduction
Chapter 1. Introduction
 
Dm unit i r16
Dm unit i   r16Dm unit i   r16
Dm unit i r16
 
Hci
HciHci
Hci
 
Post 1What is text analytics How does it differ from text mini.docx
Post 1What is text analytics How does it differ from text mini.docxPost 1What is text analytics How does it differ from text mini.docx
Post 1What is text analytics How does it differ from text mini.docx
 
Post 1What is text analytics How does it differ from text mini
Post 1What is text analytics How does it differ from text miniPost 1What is text analytics How does it differ from text mini
Post 1What is text analytics How does it differ from text mini
 
Capitalizing on Machine Reading to Engage Bigger Data
Capitalizing on Machine Reading to Engage Bigger DataCapitalizing on Machine Reading to Engage Bigger Data
Capitalizing on Machine Reading to Engage Bigger Data
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discovery
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discovery
 

Mehr von Michel Bruley

Religion : Dieu y es-tu ? (les articles)
Religion : Dieu y es-tu ? (les articles)Religion : Dieu y es-tu ? (les articles)
Religion : Dieu y es-tu ? (les articles)Michel Bruley
 
Réflexion sur les religions : Dieu y es-tu ?
Réflexion sur les religions : Dieu y es-tu ?Réflexion sur les religions : Dieu y es-tu ?
Réflexion sur les religions : Dieu y es-tu ?Michel Bruley
 
La chute de l'Empire romain comme modèle.pdf
La chute de l'Empire romain comme modèle.pdfLa chute de l'Empire romain comme modèle.pdf
La chute de l'Empire romain comme modèle.pdfMichel Bruley
 
Synthèse sur Neuville.pdf
Synthèse sur Neuville.pdfSynthèse sur Neuville.pdf
Synthèse sur Neuville.pdfMichel Bruley
 
Propos sur des sujets qui m'ont titillé.pdf
Propos sur des sujets qui m'ont titillé.pdfPropos sur des sujets qui m'ont titillé.pdf
Propos sur des sujets qui m'ont titillé.pdfMichel Bruley
 
Propos sur les Big Data.pdf
Propos sur les Big Data.pdfPropos sur les Big Data.pdf
Propos sur les Big Data.pdfMichel Bruley
 
Georges Anselmi - 1914 - 1918 Campagnes de France et d'Orient
Georges Anselmi - 1914 - 1918 Campagnes de France et d'OrientGeorges Anselmi - 1914 - 1918 Campagnes de France et d'Orient
Georges Anselmi - 1914 - 1918 Campagnes de France et d'OrientMichel Bruley
 
Poc banking industry - Churn
Poc banking industry - ChurnPoc banking industry - Churn
Poc banking industry - ChurnMichel Bruley
 
Big Data POC in communication industry
Big Data POC in communication industryBig Data POC in communication industry
Big Data POC in communication industryMichel Bruley
 
Photos de famille 1895 1966
Photos de famille 1895   1966Photos de famille 1895   1966
Photos de famille 1895 1966Michel Bruley
 
Compilation d'autres textes de famille
Compilation d'autres textes de familleCompilation d'autres textes de famille
Compilation d'autres textes de familleMichel Bruley
 
Textes de famille concernant les guerres (1814 - 1944)
Textes de famille concernant les guerres (1814 - 1944)Textes de famille concernant les guerres (1814 - 1944)
Textes de famille concernant les guerres (1814 - 1944)Michel Bruley
 
Recette de la dinde au whisky
Recette de la dinde au whiskyRecette de la dinde au whisky
Recette de la dinde au whiskyMichel Bruley
 
Les 2 guerres de René Puig
Les 2 guerres de René PuigLes 2 guerres de René Puig
Les 2 guerres de René PuigMichel Bruley
 
Une societe se_presente
Une societe se_presenteUne societe se_presente
Une societe se_presenteMichel Bruley
 
Dossiers noirs va 4191
Dossiers noirs va 4191Dossiers noirs va 4191
Dossiers noirs va 4191Michel Bruley
 
Irfm mini guide de mauvaise conduite
Irfm mini guide de mauvaise  conduiteIrfm mini guide de mauvaise  conduite
Irfm mini guide de mauvaise conduiteMichel Bruley
 
Estissac et thuisy 2017
Estissac et thuisy   2017Estissac et thuisy   2017
Estissac et thuisy 2017Michel Bruley
 

Mehr von Michel Bruley (20)

Religion : Dieu y es-tu ? (les articles)
Religion : Dieu y es-tu ? (les articles)Religion : Dieu y es-tu ? (les articles)
Religion : Dieu y es-tu ? (les articles)
 
Réflexion sur les religions : Dieu y es-tu ?
Réflexion sur les religions : Dieu y es-tu ?Réflexion sur les religions : Dieu y es-tu ?
Réflexion sur les religions : Dieu y es-tu ?
 
La chute de l'Empire romain comme modèle.pdf
La chute de l'Empire romain comme modèle.pdfLa chute de l'Empire romain comme modèle.pdf
La chute de l'Empire romain comme modèle.pdf
 
Synthèse sur Neuville.pdf
Synthèse sur Neuville.pdfSynthèse sur Neuville.pdf
Synthèse sur Neuville.pdf
 
Propos sur des sujets qui m'ont titillé.pdf
Propos sur des sujets qui m'ont titillé.pdfPropos sur des sujets qui m'ont titillé.pdf
Propos sur des sujets qui m'ont titillé.pdf
 
Propos sur les Big Data.pdf
Propos sur les Big Data.pdfPropos sur les Big Data.pdf
Propos sur les Big Data.pdf
 
Sun tzu
Sun tzuSun tzu
Sun tzu
 
Georges Anselmi - 1914 - 1918 Campagnes de France et d'Orient
Georges Anselmi - 1914 - 1918 Campagnes de France et d'OrientGeorges Anselmi - 1914 - 1918 Campagnes de France et d'Orient
Georges Anselmi - 1914 - 1918 Campagnes de France et d'Orient
 
Poc banking industry - Churn
Poc banking industry - ChurnPoc banking industry - Churn
Poc banking industry - Churn
 
Big Data POC in communication industry
Big Data POC in communication industryBig Data POC in communication industry
Big Data POC in communication industry
 
Photos de famille 1895 1966
Photos de famille 1895   1966Photos de famille 1895   1966
Photos de famille 1895 1966
 
Compilation d'autres textes de famille
Compilation d'autres textes de familleCompilation d'autres textes de famille
Compilation d'autres textes de famille
 
J'aime BRULEY
J'aime BRULEYJ'aime BRULEY
J'aime BRULEY
 
Textes de famille concernant les guerres (1814 - 1944)
Textes de famille concernant les guerres (1814 - 1944)Textes de famille concernant les guerres (1814 - 1944)
Textes de famille concernant les guerres (1814 - 1944)
 
Recette de la dinde au whisky
Recette de la dinde au whiskyRecette de la dinde au whisky
Recette de la dinde au whisky
 
Les 2 guerres de René Puig
Les 2 guerres de René PuigLes 2 guerres de René Puig
Les 2 guerres de René Puig
 
Une societe se_presente
Une societe se_presenteUne societe se_presente
Une societe se_presente
 
Dossiers noirs va 4191
Dossiers noirs va 4191Dossiers noirs va 4191
Dossiers noirs va 4191
 
Irfm mini guide de mauvaise conduite
Irfm mini guide de mauvaise  conduiteIrfm mini guide de mauvaise  conduite
Irfm mini guide de mauvaise conduite
 
Estissac et thuisy 2017
Estissac et thuisy   2017Estissac et thuisy   2017
Estissac et thuisy 2017
 

Kürzlich hochgeladen

8447779800, Low rate Call girls in Kotla Mubarakpur Delhi NCR
8447779800, Low rate Call girls in Kotla Mubarakpur Delhi NCR8447779800, Low rate Call girls in Kotla Mubarakpur Delhi NCR
8447779800, Low rate Call girls in Kotla Mubarakpur Delhi NCRashishs7044
 
BEST Call Girls In Greater Noida ✨ 9773824855 ✨ Escorts Service In Delhi Ncr,
BEST Call Girls In Greater Noida ✨ 9773824855 ✨ Escorts Service In Delhi Ncr,BEST Call Girls In Greater Noida ✨ 9773824855 ✨ Escorts Service In Delhi Ncr,
BEST Call Girls In Greater Noida ✨ 9773824855 ✨ Escorts Service In Delhi Ncr,noida100girls
 
Call US-88OO1O2216 Call Girls In Mahipalpur Female Escort Service
Call US-88OO1O2216 Call Girls In Mahipalpur Female Escort ServiceCall US-88OO1O2216 Call Girls In Mahipalpur Female Escort Service
Call US-88OO1O2216 Call Girls In Mahipalpur Female Escort Servicecallgirls2057
 
Case study on tata clothing brand zudio in detail
Case study on tata clothing brand zudio in detailCase study on tata clothing brand zudio in detail
Case study on tata clothing brand zudio in detailAriel592675
 
The CMO Survey - Highlights and Insights Report - Spring 2024
The CMO Survey - Highlights and Insights Report - Spring 2024The CMO Survey - Highlights and Insights Report - Spring 2024
The CMO Survey - Highlights and Insights Report - Spring 2024christinemoorman
 
2024 Numerator Consumer Study of Cannabis Usage
2024 Numerator Consumer Study of Cannabis Usage2024 Numerator Consumer Study of Cannabis Usage
2024 Numerator Consumer Study of Cannabis UsageNeil Kimberley
 
Call Girls In Radisson Blu Hotel New Delhi Paschim Vihar ❤️8860477959 Escorts...
Call Girls In Radisson Blu Hotel New Delhi Paschim Vihar ❤️8860477959 Escorts...Call Girls In Radisson Blu Hotel New Delhi Paschim Vihar ❤️8860477959 Escorts...
Call Girls In Radisson Blu Hotel New Delhi Paschim Vihar ❤️8860477959 Escorts...lizamodels9
 
Organizational Structure Running A Successful Business
Organizational Structure Running A Successful BusinessOrganizational Structure Running A Successful Business
Organizational Structure Running A Successful BusinessSeta Wicaksana
 
NewBase 19 April 2024 Energy News issue - 1717 by Khaled Al Awadi.pdf
NewBase  19 April  2024  Energy News issue - 1717 by Khaled Al Awadi.pdfNewBase  19 April  2024  Energy News issue - 1717 by Khaled Al Awadi.pdf
NewBase 19 April 2024 Energy News issue - 1717 by Khaled Al Awadi.pdfKhaled Al Awadi
 
Buy gmail accounts.pdf Buy Old Gmail Accounts
Buy gmail accounts.pdf Buy Old Gmail AccountsBuy gmail accounts.pdf Buy Old Gmail Accounts
Buy gmail accounts.pdf Buy Old Gmail AccountsBuy Verified Accounts
 
Innovation Conference 5th March 2024.pdf
Innovation Conference 5th March 2024.pdfInnovation Conference 5th March 2024.pdf
Innovation Conference 5th March 2024.pdfrichard876048
 
Islamabad Escorts | Call 03070433345 | Escort Service in Islamabad
Islamabad Escorts | Call 03070433345 | Escort Service in IslamabadIslamabad Escorts | Call 03070433345 | Escort Service in Islamabad
Islamabad Escorts | Call 03070433345 | Escort Service in IslamabadAyesha Khan
 
Market Sizes Sample Report - 2024 Edition
Market Sizes Sample Report - 2024 EditionMarket Sizes Sample Report - 2024 Edition
Market Sizes Sample Report - 2024 EditionMintel Group
 
(Best) ENJOY Call Girls in Faridabad Ex | 8377087607
(Best) ENJOY Call Girls in Faridabad Ex | 8377087607(Best) ENJOY Call Girls in Faridabad Ex | 8377087607
(Best) ENJOY Call Girls in Faridabad Ex | 8377087607dollysharma2066
 
/:Call Girls In Indirapuram Ghaziabad ➥9990211544 Independent Best Escorts In...
/:Call Girls In Indirapuram Ghaziabad ➥9990211544 Independent Best Escorts In.../:Call Girls In Indirapuram Ghaziabad ➥9990211544 Independent Best Escorts In...
/:Call Girls In Indirapuram Ghaziabad ➥9990211544 Independent Best Escorts In...lizamodels9
 
8447779800, Low rate Call girls in Saket Delhi NCR
8447779800, Low rate Call girls in Saket Delhi NCR8447779800, Low rate Call girls in Saket Delhi NCR
8447779800, Low rate Call girls in Saket Delhi NCRashishs7044
 
Kenya Coconut Production Presentation by Dr. Lalith Perera
Kenya Coconut Production Presentation by Dr. Lalith PereraKenya Coconut Production Presentation by Dr. Lalith Perera
Kenya Coconut Production Presentation by Dr. Lalith Pereraictsugar
 
MAHA Global and IPR: Do Actions Speak Louder Than Words?
MAHA Global and IPR: Do Actions Speak Louder Than Words?MAHA Global and IPR: Do Actions Speak Louder Than Words?
MAHA Global and IPR: Do Actions Speak Louder Than Words?Olivia Kresic
 
Youth Involvement in an Innovative Coconut Value Chain by Mwalimu Menza
Youth Involvement in an Innovative Coconut Value Chain by Mwalimu MenzaYouth Involvement in an Innovative Coconut Value Chain by Mwalimu Menza
Youth Involvement in an Innovative Coconut Value Chain by Mwalimu Menzaictsugar
 

Kürzlich hochgeladen (20)

8447779800, Low rate Call girls in Kotla Mubarakpur Delhi NCR
8447779800, Low rate Call girls in Kotla Mubarakpur Delhi NCR8447779800, Low rate Call girls in Kotla Mubarakpur Delhi NCR
8447779800, Low rate Call girls in Kotla Mubarakpur Delhi NCR
 
BEST Call Girls In Greater Noida ✨ 9773824855 ✨ Escorts Service In Delhi Ncr,
BEST Call Girls In Greater Noida ✨ 9773824855 ✨ Escorts Service In Delhi Ncr,BEST Call Girls In Greater Noida ✨ 9773824855 ✨ Escorts Service In Delhi Ncr,
BEST Call Girls In Greater Noida ✨ 9773824855 ✨ Escorts Service In Delhi Ncr,
 
Call US-88OO1O2216 Call Girls In Mahipalpur Female Escort Service
Call US-88OO1O2216 Call Girls In Mahipalpur Female Escort ServiceCall US-88OO1O2216 Call Girls In Mahipalpur Female Escort Service
Call US-88OO1O2216 Call Girls In Mahipalpur Female Escort Service
 
Case study on tata clothing brand zudio in detail
Case study on tata clothing brand zudio in detailCase study on tata clothing brand zudio in detail
Case study on tata clothing brand zudio in detail
 
The CMO Survey - Highlights and Insights Report - Spring 2024
The CMO Survey - Highlights and Insights Report - Spring 2024The CMO Survey - Highlights and Insights Report - Spring 2024
The CMO Survey - Highlights and Insights Report - Spring 2024
 
2024 Numerator Consumer Study of Cannabis Usage
2024 Numerator Consumer Study of Cannabis Usage2024 Numerator Consumer Study of Cannabis Usage
2024 Numerator Consumer Study of Cannabis Usage
 
Call Girls In Radisson Blu Hotel New Delhi Paschim Vihar ❤️8860477959 Escorts...
Call Girls In Radisson Blu Hotel New Delhi Paschim Vihar ❤️8860477959 Escorts...Call Girls In Radisson Blu Hotel New Delhi Paschim Vihar ❤️8860477959 Escorts...
Call Girls In Radisson Blu Hotel New Delhi Paschim Vihar ❤️8860477959 Escorts...
 
Organizational Structure Running A Successful Business
Organizational Structure Running A Successful BusinessOrganizational Structure Running A Successful Business
Organizational Structure Running A Successful Business
 
NewBase 19 April 2024 Energy News issue - 1717 by Khaled Al Awadi.pdf
NewBase  19 April  2024  Energy News issue - 1717 by Khaled Al Awadi.pdfNewBase  19 April  2024  Energy News issue - 1717 by Khaled Al Awadi.pdf
NewBase 19 April 2024 Energy News issue - 1717 by Khaled Al Awadi.pdf
 
Buy gmail accounts.pdf Buy Old Gmail Accounts
Buy gmail accounts.pdf Buy Old Gmail AccountsBuy gmail accounts.pdf Buy Old Gmail Accounts
Buy gmail accounts.pdf Buy Old Gmail Accounts
 
Innovation Conference 5th March 2024.pdf
Innovation Conference 5th March 2024.pdfInnovation Conference 5th March 2024.pdf
Innovation Conference 5th March 2024.pdf
 
Islamabad Escorts | Call 03070433345 | Escort Service in Islamabad
Islamabad Escorts | Call 03070433345 | Escort Service in IslamabadIslamabad Escorts | Call 03070433345 | Escort Service in Islamabad
Islamabad Escorts | Call 03070433345 | Escort Service in Islamabad
 
Market Sizes Sample Report - 2024 Edition
Market Sizes Sample Report - 2024 EditionMarket Sizes Sample Report - 2024 Edition
Market Sizes Sample Report - 2024 Edition
 
(Best) ENJOY Call Girls in Faridabad Ex | 8377087607
(Best) ENJOY Call Girls in Faridabad Ex | 8377087607(Best) ENJOY Call Girls in Faridabad Ex | 8377087607
(Best) ENJOY Call Girls in Faridabad Ex | 8377087607
 
/:Call Girls In Indirapuram Ghaziabad ➥9990211544 Independent Best Escorts In...
/:Call Girls In Indirapuram Ghaziabad ➥9990211544 Independent Best Escorts In.../:Call Girls In Indirapuram Ghaziabad ➥9990211544 Independent Best Escorts In...
/:Call Girls In Indirapuram Ghaziabad ➥9990211544 Independent Best Escorts In...
 
Japan IT Week 2024 Brochure by 47Billion (English)
Japan IT Week 2024 Brochure by 47Billion (English)Japan IT Week 2024 Brochure by 47Billion (English)
Japan IT Week 2024 Brochure by 47Billion (English)
 
8447779800, Low rate Call girls in Saket Delhi NCR
8447779800, Low rate Call girls in Saket Delhi NCR8447779800, Low rate Call girls in Saket Delhi NCR
8447779800, Low rate Call girls in Saket Delhi NCR
 
Kenya Coconut Production Presentation by Dr. Lalith Perera
Kenya Coconut Production Presentation by Dr. Lalith PereraKenya Coconut Production Presentation by Dr. Lalith Perera
Kenya Coconut Production Presentation by Dr. Lalith Perera
 
MAHA Global and IPR: Do Actions Speak Louder Than Words?
MAHA Global and IPR: Do Actions Speak Louder Than Words?MAHA Global and IPR: Do Actions Speak Louder Than Words?
MAHA Global and IPR: Do Actions Speak Louder Than Words?
 
Youth Involvement in an Innovative Coconut Value Chain by Mwalimu Menza
Youth Involvement in an Innovative Coconut Value Chain by Mwalimu MenzaYouth Involvement in an Innovative Coconut Value Chain by Mwalimu Menza
Youth Involvement in an Innovative Coconut Value Chain by Mwalimu Menza
 

Big Data & Text Mining

  • 1. Text mining michel.bruley@teradata.com Extract from various presentations: Temis, URI-INIST-CNRS, Aster Data … www.decideo.fr/bruley
  • 2. Information context Big amount of information is available in textual form in databases and online sources In this context, manual analysis and effective extraction of useful information are not possible It is relevant to provide automatic tools for analyzing large textual collections www.decideo.fr/bruley
  • 3. Text mining definition The objective of Text Mining is to exploit information contained in textual documents in various ways, including … discovery of patterns and trends in data, associations among entities, predictive rules, etc. The results can be important both for: the analysis of the collection, and providing intelligent navigation and browsing methods www.decideo.fr/bruley
  • 4. Text mining pipeline Unstructured Text (implicit knowledge) Information Retrieval Information extraction Knowledge Discovery Structured content (explicit knowledge) www.decideo.fr/bruley Sem ant ic Sea rch / Dat a Min ing Semantic metadata
  • 5. Text mining process Text preprocessing Syntactic/Semantic text analysis Features Generation Bag of words Features Selection Simple counting Statistics Text/Data Mining Classification- Supervised learning Clustering- Unsupervised learning Analyzing results Mapping/Visualization Result interpretation www.decideo.fr/bruley Iterative and interactive process
  • 6. Text mining actors Publishers Enriched content Annotation tools Tools for authors New applications based on annotation layers Richer cross linking based on content… Analysts Empowers them Annotating research output Hypothesis generation Summarisation of findings Focused semantic search… www.decideo.fr/bruley Libraries Linking between Institutional repositories Access to richer metadata Aggregation Aids to subject analysis/classification …
  • 7. Challenges in text mining Data collection is “free text”, is not well-organized (Semistructured or unstructured) No uniform access over all sources, each source has separate storage and algebra, examples: email, databases, applications, web A quintuple heterogeneity: semantic, linguistic, structure, format, size of unit information Learning techniques for processing text typically need annotated training XML as the common model, it allows: – Manipulation data with standards – Mining becomes more data mining – RDF emerging as a complementary model The more structure you can explore the better you can do mining www.decideo.fr/bruley
  • 8. Data source administration Intranet File System Databases EDMS Internet Web Crawling On-line Databank XML Normalisation -subject -Author -text corpora -keywords Information Provider Format filter www.decideo.fr/bruley
  • 9. Text mining tasks Name Extractions Term Extraction Feature extraction Categorization Text Analysis Tools Abbreviation Extraction Relationship Extraction Summarization Clustering Hierarchical Clustering Binary relational Clustering TM Text search engine Web Searching Tools NetQuestion Solution Web Crawler www.decideo.fr/bruley
  • 10. Information extraction Keyword Ranking Link Analysis Query Log Analysis Metadata Extraction Intelligent Match Duplicate Elimination www.decideo.fr/bruley Extract domain-specific information from natural language text – Need a dictionary of extraction patterns (e.g., “traveled to <x>” or “presidents of <x>”) • Constructed by hand • Automatically learned from hand-annotated training data – Need a semantic lexicon (dictionary of words with semantic category labels) • Typically constructed by hand
  • 12. Text Mining example: Obama vs. McCain www.decideo.fr/bruley
  • 13. Aster Data position for Text Analysis Data Data Acquisition Acquisition Gather text from relevant sources (web crawling, document scanning, news feeds, Twitter feeds, …) Pre-Processing Pre-Processing Mining Mining Analytic Analytic Applications Applications Perform processing required to transform and store text data and information Apply data mining techniques to derive insights about stored information Leverage insights from text mining to provide information that improves decisions and processes (stemming, parsing, indexing, entity extraction, …) (statistical analysis, classification, natural language processing, …) (sentiment analysis, document management, fraud analysis, e-discovery, ...) Aster Data Fit Third-Party Tools Fit Aster Data Value: Massive scalability of text storage and processing, Functions for text processing, Flexibility to develop diverse custom analytics and incorporate third-party libraries www.decideo.fr/bruley
  • 14. Aster Data Value for Text Analytics • Ability to store and process massive volumes of text data – Massively parallel data stores and massively parallel analytics engine – SQL-MapReduce framework enables in-database processing for specialized text analytics tools • Tools and extensibility for processing diverse text data – SQL-MapReduce framework enables loading and transforming diverse sources and types of text data – Pre-built functions for text processing • Flexible platform for building and processing diverse analytics – SQL-MapReduce framework enables creation of flexible, reusable analytics – Embedded MapReduce processing engine for high-performance analytics www.decideo.fr/bruley
  • 15. Aster Data Capabilities for Text Data Pre-built SQL-MapReduce functions for text processing • • • Data transformation utilities - Pack: compress multi-column data into a single column - Unpack: extract nested data for further analysis Custom and Packaged Analytics Aster Data nCluster App App Web log analysis - Sessionization: identify unique browsing sessions in clickstream data Text analysis - Text parser: general tool for tokenizing, stemming, and counting text data - nGram: split text into component parts (words & phrases) - Levenstein distance: compute “distance” between words www.decideo.fr/bruley App App App App Aster Data Analytic Foundation SQL-MapReduce SQL Data Data Data

Hinweis der Redaktion

  1. Input Data System: This part of the system is related to the collection of the data. -Getting data from the internet with a crawler -Getting data from Online vendors -Getting data from the internal data banks Regarding the input format (physical and logical), data are physicaly reformated into html format and then it&apos;s loaded into an XML format
  2. Feature extraction tools It recognizes significant vocabulary items in documents, and measures their importance to the document content. 2. Clustering tools Clustering is used to segment a document collection into subsets, called clusters. 3. Summarization tool Summarization is the process of condensing a source text into a shorter version preserving its information content. 4. Categorization tool Categorization is used to assign objects to predefined categories, or classes from a taxonomy.
  3. http://services.alphaworks.ibm.com/manyeyes/view/SWhH8QsOtha6qL3F~y5HQ2~