Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.
www.decideo.fr/bruley
Text miningText mining
michel.bruley@teradata.com
Extract from various presentations: Temis, URI-INI...
www.decideo.fr/bruley
Information contextInformation context
Big amount of information is available in
textual form in dat...
www.decideo.fr/bruley
Text mining definitionText mining definition
The objective of Text Mining is to exploit
information ...
www.decideo.fr/bruley
Text mining pipelineText mining pipeline
Unstructured Text
(implicit knowledge)
Structured content
(...
www.decideo.fr/bruley
Text mining processText mining process
Text preprocessing
Syntactic/Semantic text
analysis
Features ...
www.decideo.fr/bruley
PublishersPublishers
Enriched content
Annotation tools
Tools for authors
New applications based on a...
www.decideo.fr/bruley
Challenges in text miningChallenges in text mining
Data collection is “free text”, is not well-organ...
www.decideo.fr/bruley
Intranet
Internet
On-line
Databank
Information Provider
File System
Databases
EDMS
Web
Crawling
XML ...
www.decideo.fr/bruley
Text mining tasksText mining tasks
TM
Text Analysis
Tools
Feature extraction
Categorization
Summariz...
www.decideo.fr/bruley
Information extractionInformation extraction
Extract domain-specific
information from natural
langua...
www.decideo.fr/bruley
CategorizationCategorization
Document collections treatmentDocument collections treatment
Clustering...
www.decideo.fr/bruley
Text Mining example:Text Mining example: Obama vs. McCain
www.decideo.fr/bruley
Aster Data position for TextAster Data position for Text
AnalysisAnalysis
Data
Acquisition
Data
Acqu...
www.decideo.fr/bruley
• Ability to store and process massive volumes of text data
– Massively parallel data stores and mas...
www.decideo.fr/bruley
• Data transformation utilities
- Pack: compress multi-column data into a
single column
- Unpack: ex...
Nächste SlideShare
Wird geladen in …5
×

1 _text_mining_v0a

176 Aufrufe

Veröffentlicht am

text mining

Veröffentlicht in: Technologie
  • Loggen Sie sich ein, um Kommentare anzuzeigen.

  • Gehören Sie zu den Ersten, denen das gefällt!

1 _text_mining_v0a

  1. 1. www.decideo.fr/bruley Text miningText mining michel.bruley@teradata.com Extract from various presentations: Temis, URI-INIST-CNRS, Aster Data …
  2. 2. www.decideo.fr/bruley Information contextInformation context Big amount of information is available in textual form in databases and online sources In this context, manual analysis and effective extraction of useful information are not possible It is relevant to provide automatic tools for analyzing large textual collections
  3. 3. www.decideo.fr/bruley Text mining definitionText mining definition The objective of Text Mining is to exploit information contained in textual documents in various ways, including … discovery of patterns and trends in data, associations among entities, predictive rules, etc. The results can be important both for: the analysis of the collection, and providing intelligent navigation and browsing methods
  4. 4. www.decideo.fr/bruley Text mining pipelineText mining pipeline Unstructured Text (implicit knowledge) Structured content (explicit knowledge) Information extraction Semantic metadata Knowledge Discovery Information Retrieval Semantic Search/ Data Mining
  5. 5. www.decideo.fr/bruley Text mining processText mining process Text preprocessing Syntactic/Semantic text analysis Features Generation Bag of words Features Selection Simple counting Statistics Text/Data Mining Classification- Supervised learning Clustering- Unsupervised learning Analyzing results Mapping/Visualization Result interpretation Iterative and interactive process
  6. 6. www.decideo.fr/bruley PublishersPublishers Enriched content Annotation tools Tools for authors New applications based on annotation layers Richer cross linking based on content… AnalystsAnalysts Empowers them Annotating research output Hypothesis generation Summarisation of findings Focused semantic search… LibrariesLibraries Linking between Institutional repositories Access to richer metadata Aggregation Aids to subject analysis/classification … Text mining actorsText mining actors
  7. 7. www.decideo.fr/bruley Challenges in text miningChallenges in text mining Data collection is “free text”, is not well-organized (Semi- structured or unstructured) No uniform access over all sources, each source has separate storage and algebra, examples: email, databases, applications, web A quintuple heterogeneity: semantic, linguistic, structure, format, size of unit information Learning techniques for processing text typically need annotated training XML as the common model, it allows: – Manipulation data with standards – Mining becomes more data mining – RDF emerging as a complementary model The more structure you can explore the better you can do mining
  8. 8. www.decideo.fr/bruley Intranet Internet On-line Databank Information Provider File System Databases EDMS Web Crawling XML Normalisation -subject -Author -text corpora -keywords Format filter Data source administrationData source administration
  9. 9. www.decideo.fr/bruley Text mining tasksText mining tasks TM Text Analysis Tools Feature extraction Categorization Summarization Clustering Name Extractions Term Extraction Abbreviation Extraction Relationship Extraction Hierarchical Clustering Binary relational Clustering Web Searching Tools Text search engine NetQuestion Solution Web Crawler
  10. 10. www.decideo.fr/bruley Information extractionInformation extraction Extract domain-specific information from natural language text – Need a dictionary of extraction patterns (e.g., “traveled to <x>” or “presidents of <x>”) • Constructed by hand • Automatically learned from hand-annotated training data – Need a semantic lexicon (dictionary of words with semantic category labels) • Typically constructed by hand Link Analysis Query Log Analysis Metadata Extraction Keyword Ranking Intelligent Match Duplicate Elimination
  11. 11. www.decideo.fr/bruley CategorizationCategorization Document collections treatmentDocument collections treatment ClusteringClustering
  12. 12. www.decideo.fr/bruley Text Mining example:Text Mining example: Obama vs. McCain
  13. 13. www.decideo.fr/bruley Aster Data position for TextAster Data position for Text AnalysisAnalysis Data Acquisition Data Acquisition Pre-ProcessingPre-Processing MiningMining Analytic Applications Analytic Applications Perform processing required to transform and store text data and information (stemming, parsing, indexing, entity extraction, …) Gather text from relevant sources (web crawling, document scanning, news feeds, Twitter feeds, …) Apply data mining techniques to derive insights about stored information (statistical analysis, classification, natural language processing, …) Leverage insights from text mining to provide information that improves decisions and processes (sentiment analysis, document management, fraud analysis, e-discovery, ...) Third-Party Tools Fit Aster Data Fit Aster Data Value: Massive scalability of text storage and processing, Functions for text processing, Flexibility to develop diverse custom analytics and incorporate third-party libraries
  14. 14. www.decideo.fr/bruley • Ability to store and process massive volumes of text data – Massively parallel data stores and massively parallel analytics engine – SQL-MapReduce framework enables in-database processing for specialized text analytics tools • Tools and extensibility for processing diverse text data – SQL-MapReduce framework enables loading and transforming diverse sources and types of text data – Pre-built functions for text processing • Flexible platform for building and processing diverse analytics – SQL-MapReduce framework enables creation of flexible, reusable analytics – Embedded MapReduce processing engine for high-performance analytics Aster Data Value for TextAster Data Value for Text AnalyticsAnalytics
  15. 15. www.decideo.fr/bruley • Data transformation utilities - Pack: compress multi-column data into a single column - Unpack: extract nested data for further analysis • Web log analysis - Sessionization: identify unique browsing sessions in clickstream data • Text analysis - Text parser: general tool for tokenizing, stemming, and counting text data - nGram: split text into component parts (words & phrases) - Levenstein distance: compute “distance” between words Aster Data Capabilities for TextAster Data Capabilities for Text DataData Pre-built SQL-MapReduce functions for text processing Data Data Data Aster Data Analytic Foundation SQL SQL-MapReduce App App App App App App Custom and Packaged Analytics Aster Data nCluster

×