Introduction to Text and Data Visualization for modelling Text Analytics applications. Incl. Who is Treparel / Why visualize data / How do we visualize data / multiple coupled views and interaction
The Codex of Business Writing Software for Real-World Solutions 2.pptx
Text and Data Visualization Introduction 2012
1. Introduction to
Text
Visualization
Dr. Anton Heijs
CEO
Treparel
Delftechpark 26
2628 XH Delft July 2012
The Netherlands
www.treparel.com
2. KMX enables information and knowledge professionals
to gain faster, reliable, more precise insights in large
complex unstructured data sets allowing them to make
better informed decisions.
Treparel is a leading technology solution provider in
Big Data Text Analytics & Visualization
Treparel KMX – All rights reserved 2012
3. Topics covered in this presentation
• Who is Treparel?
• Why visualize data?
• How do we visualize data?
• Multiple coupled views and interaction.
Treparel KMX – All rights reserved 2012 www.treparel.com 3
4. Nexus of Forces: Social, Cloud, Mobile, Information
IT Market shift driving Big Data challenges
Copyright: Gartner, 2011
80% of data is Unstructured (Documents, Text, Images, Graphs)
Treparel KMX – All rights reserved 2012 www.treparel.com 4
5. About Treparel
• Delft, The Netherlands, 2006.
• Treparel is an innovative technology solution provider in Big Data
Analytics, Text Mining and Visualization.
• KMX is an integrated data analysis toolset which provide faster,
reliable intelligent insights in large complex unstructured data sets to
allow companies to make better informed decisions.
• Clients: Philips, Bayer, Abbott, European Patent Office, European
Commission
• Part of Research Centers and University ecosystem; TU Delft,
Universities of Paris and Sao Paulo
• More info: www.treparel.com
Treparel KMX – All rights reserved 2012 www.treparel.com 5
6. Positioning of Treparel’s KMX technology
Text Acquisition & Preparation Analysis and processing Output and display
‘Seek’ ‘Model’ ‘Adapt’
External sources Reporting &
Text preprocessing
Patents Presentation
Legal
Media and publishing
Research Indexing databases
Media / Publishers
Content management
Other sources Clustering systems
Documents
Websites Line-of-business
Classification applications
Blogs
Newsfeeds Research applications
Email Semantic Analysis
Application notes Search engines
Search results
Social networks Visualization
Information extraction (entities, facts, relationships, concepts, patents)
Management, Development and Configuration
Treparel KMX – All rights reserved 2012 Copyright: Gartner, J. Popkin 2010
7. Why visualize data?
• Gives bird’s eye view to comprehensive and
large datasets
• Discover previously unknown patterns of
the dataset
• Reveals inherent problems of the data (noise, outliers)
• Enables both the examination of the large scale features
of the dataset as well as the local features (frame of
reference)
• Allows the user to form hypotheses on developed
insights and make better informed decisions
Treparel KMX – All rights reserved 2012 www.treparel.com 7
8. Why visualize data?
Analyse without visualization
• Initial selection of training documents is ‘blind’
• No insight on how the data is separated
• Improving the Classifiertm a lengthy process
• No clear insights in the performance of a Classifiertm
Treparel KMX – All rights reserved 2012 www.treparel.com 8
9. Why visualize data?
Analyse with visualization
• Representive selection of training documents data by information
professional
• A clean training set is used (no noise/outliers) for building a Classifiertm
• Visual feedback if the Classifiertm results in a separated section within
the frame of reference of the whole dataset.
• Insight into the performance of a Classifiertm
Treparel KMX – All rights reserved 2012 www.treparel.com 9
10. How do we visualize data?
• Create a representation we can use (vector space model)
• Correct for (normalization):
– Terms that appear often in most documents (non-discriminative
terms)
– Length of documents (remove bias towards longer documents)
Treparel KMX – All rights reserved 2012 www.treparel.com 10
11. 3 types of visualization
1. Visualize frequency distribution
2. Visualize classification scores
3. Visualize document similarity
Treparel KMX – All rights reserved 2012 www.treparel.com 11
12. 1. Visualize frequency distribution
Requires carefull interpretation
Treparel KMX – All rights reserved 2012 www.treparel.com 12
13. 2. Visualize classification scores
Parallel Coordinate Plot
Treparel KMX – All rights reserved 2012 www.treparel.com 13
14. 3. Visualize document similarity
Also visualizes the Classification scores
Treparel KMX – All rights reserved 2012 www.treparel.com 14
15. How do we visualize data?
What is Least Squares Projection (LSP)?
• LSP attempts to retain the neighborhood relations of
objects in a high dimensional feature space when the
data gets projected onto a two dimensional surface.
Advantage Disadvantage
Objects that are close to each other As a result of the projection to two
(similar) in high dimensional feature dimensions (for human
space are also close to each other in interpretation) information is lost.
the LSP visualization.
This can result in clusters that for
This results in the formation of
human interpretation are not
clusters on the two dimensional
condensed in an identifiable cluster
surface.
(e.g. scattered over the
visualization).
Treparel KMX – All rights reserved 2012 www.treparel.com 15
16. 3. Visualize document classification scores
After applying LSP
Treparel KMX – All rights reserved 2012 www.treparel.com 16
17. Multiple coupled views and interaction
Treparel KMX – All rights reserved 2012 www.treparel.com 17
18. Treparel is a leading technology solution provider
in Big Data Text Analytics & Visualization
Treparel
Delftechpark 26
2628 XH Delft
The Netherlands
www.treparel.com
Treparel KMX – All rights reserved 2012 www.treparel.com 18