Seth Redmore talks about text and data visualization at this year's Smart Data Conference.
He covers:
-Common software packages for visualization
-Structured plots for unstructured text: Lines vs. bars vs. boxplots vs. piecharts vs. bubble charts
-Less structured plots: word clouds vs. treemaps vs. clusters vs. graphs
-Moving plots: animations over time
The (in)famous word cloud. A lot of our customers ask for word clouds. We think that there are much better ways to visualize data, and if you insist on a word cloud, there’s good ways to use them, and poor ways to use them.
Word clouds are packing algorithms. Part of the problem is that in order to pack the words, if there is a change in the word list, the word cloud itself will be completely redrawn. This makes it really hard to compare one word cloud to another, making them difficult for comparative analysis purposes.
The answer to the question is on the next slide. Ponder this for like 30 seconds, and then go to the next slide.
The word missing was suffix. What’s the word missing on this slide?
See how much easier that was. And we’re giving you actually 3 pieces of information here – the actual number of occurrences, the word itself, and an actual ordered list of the words themselves.
The problem is that this list just doesn’t look very sexy. Hold that thought.
Let’s step back for a second.
The “content-derived” set are some of the most important things that you extract from unstructured data.
On the right are common associated structured data that you will often want to associate with the unstructured data. There are, of course, other things like sales, customer ID, etc.
It is important to think about how you’re going to show these two things together.
One piece of information that you can give visually is the “relationships” between things. You can do so spatially, through clustering or graphs. You can do side-by-side bars, and there’s a lot of other ways you can group things together.
Text is interesting in that one piece of text may mention concept A and concept B, and another piece of text may mention B and C. There’s a relationship there, but not a complete overlap. This can sometimes be important for analytical purposes.
Grouping is an important concept, and is one that we’ll examine more later on in this presentation, particularly when we’re talking about graphing and clustering.
Pie charts are another default visualization. Think about them for a second. Do you really need to show people how to do percentages? They don’t work very well for anything more than 3-4 things. They’re also really easy to lie with – ever see a 3d pie chart where it’s kinda on an angle? That makes it really hard to see.
Tables are better. Use tables. They’re not going to take up graphical space, but they’re really easy to read.
Since we’re talking about pie charts, let’s talk about another easy area for manipulation.
One needs to ask some stuff about “what are the bounds of neutral” – if you have a narrow bound, then you’ll get more content in positive and negative.
We’ve actually found with our software that the results most agree with humans when the bounds are a little lopsided (-.1 for negative, +.15 for positive).
Scales and such will differ, but for any sentiment tool that is reporting numbers, it’s important to understand these bounds.
Lines are great. Now you start getting into more information packed onto the graph.
This chart is the result of taking 150,000 songs, running them line-by-line through our sentiment engine, and then bucketing them into songs that follow different topologies – for example, “positive-positive” would be a song that in the first half is positive, and then stays positive for the last half.
This graph has some flaws on it – namely the sharp dropoff. It gives a misleading bit of information due to a lack of data in the later years. It’s true, but looks misleading.
This is the same information as was on the last graph, but put with side-by-side bars.
What is interesting about this graph is that you can see, by percentage, how the positive-positive songs decrease until the 2000’s, then start to increase again, where the converse happens with negative-negative songs.
That’s more immediately informative than the previous line chart where you have to dig a little harder to get that information.
Bubble charts allow you to present up to 4 axes of information, two for x&y, a third for the size of the bubble, and a fourth for color of the bubble.
I’m a big fan of bubble charts, and you’ll be seeing at least one more of these in the presentation in a real world example.
This chart is courtesy of Provalis Research, who make cool statistical and text analytics packages for desktop use.
http://www.provalisresearch.com
Let’s talk a little bit about words and relationships.
When you stem a word, you’re trying to find the root word. We’ll discuss the difference between stems and lemmas on the next slide.
The point of this slide is that if you ignore phrases and just pull out the stemmed forms, you’re missing part of the deal. In the top example, “satisfied” is modified by “greatly”, and should be associated with dinner. If you just stemmed the words down, you’d end up down at “great” which would be cool, but knowing “great” and it’s relationship to “satisfied” and eventually “dinner” is really important.
Similar with the bottom example. Cracked screen is a whole thing in and of itself. You can probably infer if you just see “crack” that there was a, well, crack. But you don’t know what was cracked on the thing and you don’t know what thing it was – which is why it’s important to expose “cracked screen” and associate it with “phone.”
Stemming is trying to find the root word without taking the part-of-speech type into account. Lemmatization (also okay if spelled with an “s” – lemmatisation) takes the part-of-speech into account.
Meeting can be a verb or a noun. If a verb, the root form is “meet” – if a noun, then the root form is actually “meeting.”
You can generally get away with just stems, but lemmas provide a richer experience.
On the next few slides, we’re going to work through content that was gathered around the time of the Samsung Galaxy S5 announcement.
We’re going to focus on themes, which is a way to extract noun phrases that are contextually important.
This is the simplest possible “word cloud” of these themes. No size, no color, just a list of the terms.
Here we have a word cloud, where size is dependent on occurrence, but color isn’t used.
And now we add color for sentiment. You can immediately see that “Gangnam Style” and “Android Source Code” are negative themes.
You should be wondering why at this point.
Here’s the exact same set of themes, but arranged in a bubble chart.
Now you have a timeline, so you can actually make some inferences about when things happened and what themes co-occurred in time.
For example, the Android Source Code thing happened later, where the Gangnam style thing happened around the time of the launch.
Now we add back in the sentiment.
Here you see all of the information that was in the word cloud, but arranged in time so that you can see stuff that was associated with the launch, vs. content that occurred later in time.
The Gangnam Style negativity happened around launch time. Digging into the content (not shown) you just see that the song was overplayed and people were like “really?”
The Android Source Code bit happens later on, and is associated with the ongoing legal battles between Apple and Samsung.
It’s really nice to be able to start integrating structured data with the unstructured visualization. In this case, I have demographic information based on the names associated with the tweets. You could do location, or any of a number of other things.
This slide becomes too complex, and so we need a different visualization to show the demographic interaction.
We already know when and how much (and sentiment) for each of the themes. We just need to know something about the demographics.
And so we can do this sort of visualization, where the gender is represented by the female/male symbol. You can see that men were more positive on the announcement than women were, and you can see that it was men who were most negative about the Gangnam Style tie-in.
The next few slides are going to compare word clouds to treemaps.
Here’s a word cloud from content surrounding one of the recent olympics.
Here’s the same content as a treemap.
Treemaps are really best for content that is easily divided into “subsets” – I’ll show an example of this in a few slides.
Let’s do the same exercise we did before with the two word clouds early on.
Which word is missing?
The difference in packing algorithms between a treemap and a word cloud means that these differences are really much easier to pick out. The ordering of the sizes helps tremendously.
Here’s a treemap of the Usenet hierarchy.
(Remember Usenet? I do. If you don’t remember it, it was basically a decentralized set of newsgroups. It’s still around in one form or another.)
Note the hierarchy/subset nature, where you can see different parts of usenet, and see the relative sizes. Treemaps make for nice navigational interfaces for highly complex, but inherently ordered content.
Force directed graphs use physics and repulsion to lay out information in a pleasing, relationship-retaining way.
There’s a number of packages that will lay out force directed graphs for you. (Packages at the end of the presentation)
It’s important to note the inherent connections – you can see the words that are literally and figuratively attached to other words or concepts. Some graphs have directionality indicated by arrows, others simply indicate connectivity.
Graphs like this can get really messy if you have too much content and haven’t pruned the connections down, but careful zooming and filtering can really help.
Bottlenose provides a platform for analysis of streaming data, of which text content is a part. (http://www.bottlenose.com)
Clustering is both a text analytics and visual concept.
They feed into each other very nicely. There are a lot of different ways to do clustering of text, but they all have to do with similarities. You could cluster based on all content that mentioned a particular entity – which is what you see on the Bottlenose graphic from the previous slide. The content is inherently clustered, but is laid out in a simple layout because it is clustering based on the occurrence of a particular phrase or entity.
Other clustering algorithms take all of the content into account This is what Quid is doing. (http://www.quid.com) They are using both force directed graphs (which is layout technology) and clustering to add information to the topology of the graph itself. Different shapes of clusters mean different things – if its nice and round and compact, then that means all the articles are saying roughly the same thing. Spread out clusters contain highly differentiated stories. Clusters in the center of the graph indicate central topics or bridging ideas. Distance between clusters shows how inter-related the stories are, closer means more inter-relationships.
This slide shows one cluster around bitcoin regulation. You can see that some of the articles take a different slant than other, but that they’re all related to the same core topic.
Individual articles further away from each other take a different slant on the core topic.
Dendrograms are another way to show clusters. You can see the politicians on the right, and how they relate naturally to each other via their communications.
Different concepts are along the top, again clustered by similarity through the dendrogram.
The frequency of each concept as uttered by each politician is shown in the heatmap. This allows for a nice visual grasp of some of the differences in the stances of the politicians.
This is also a really common type of visualization for gene expression, showing how different genes relate to each other.
This is a list of some of the easiest and most sophisticated free visualization tools out there.
No-code systems allow you to upload something like a CSV and graph from there.
The coding systems require you to do some coding, but are far more sophisticated in how they allow you to graph.
If you have some free time, spend some time in the D3 gallery – the breadth of visualizations is quite amazing.
There are two other ways to approach these visualizations.
One is to get a commercial toolkit. The list on the right is roughly ordered from top to bottom in terms of how “full” a package they are. For example, Tableau and Jreport are all about visualization, where SAS and SPSS are full-blown statistics packages.
The other way is to get all your text from an off-the-shelf system that includes all the content, like a Social Marketing System, or Customer Experience Management System.
The TED talk is a really great example of using animations to tell a story.
Telling a story is, at the end of the day, what we’re all trying to accomplish. The goal of that story is to help make a point, to drive a change in behavior.
If you take anything away from this presentation, it should be that word clouds and pie charts aren’t the best way to tell that story and that there are myriad other ways to accomplish the storytelling.