SlideShare ist ein Scribd-Unternehmen logo
1 von 57
Downloaden Sie, um offline zu lesen
Bestseller Analysis:
Visualizing Fiction
Lynn Cherny	

@arnicas	

PyData Boston 2013
Language, Sex,Violence
(also spoilers)
TEXT
Today’s Books
THEVIDEO OF THAT TALK:
http://blogger.ghostweather.com/2013/06/analysis-of-fiction-
my-openvisconf-talk.html	

	

http://www.youtube.com/watch?
v=f41U936WqPM	

	

BASED ON A PREVIOUS
TALK:
This talk focuses on some more technical details and more on topic analysis.	

	

The IPython notebook of code samples for this lives here: 	

http://ghostweather.com/essays/talks/openvisconf/Pydata_Code.ipynb
http://www.economist.com/blogs/graphicdetail/2012/11/fifty-shades-data-visualisations
BY
Text Classification (Commonly)
§ “Bag of words” – each document is considered
a collection of words, independent of order	

§ Frequencies of certain words are used to
identify the texts 	

Seems like this should work with sex scenes,
right? Only so many body parts and behaviors,
right?!
Data	

 Label	

Estdsgfd fdsatreatret dfds	

 Yes	

Dsrdsf drerear ewrewtrew	

 No	

Reret retdrtd rewrewrtew	

 Yes	

Dsfgdg fdsfd	

 Yes	

Algorithm
Train	

Test	

New data in the wild
Sex Scene Detection First Steps
1.  Buy 50 Shades on Amazon, unlock text in
Calibre, save as TXT file.	

2.  Cut up a doc into 500 “word” chunks using
Python
Cutting up the document
“Would you like to sit?” He waves me toward an L-shaped white leather couch.	

	

His office is way too big for just one man. In front of the floor-to-ceiling windows, there’s a
modern dark wood desk that six people could comfortably eat around. It matches the
coffee table by the couch. Everything else is white—ceiling, floors, and walls, except for the
wall by the door, where a mosaic of small paintings hang, thirty-six of them arranged in a
square.They are exquisite—a series of mundane, forgotten objects painted in such precise
detail they look like photographs. Displayed together, they are breathtaking.	

	

“A local artist.Trouton,” says Grey when he catches my gaze.	

	

“They’re lovely. Raising the ordinary to extraordinary,” I murmur, distracted both by him
and the paintings. He cocks his head to one side and regards me intently.	

	

“I couldn’t agree more, Miss Steele,” he replies, his voice soft, and for some inexplicable
reason I find myself blushing.	

Sample of 50 Shades of Grey
Manual labeling suckage
http://www.deargrumpycat.com/
Outsourced to Mechanical Turk
WHAT’S A SEX SCENE,
ANYWAY?
Zara.com
http://www.ebay.com/itm/Adult-Sex-Toys-Tools-Handcuffs-Eye-mask-Neck-Band-Strap-Whip-Rope-/330845727274?pt=
UK_Home_Garden_Celebrations_Occasions_ET&hash=item4d07f12a2a
Sexually Exxxplicit,
but still a
http://www.icts.uiowa.edu/sites/default/files/contract.jpg
How’d the raters do?
Sex Scenes
Steamy Scenes
Comparing to “Pornographic”…
Comparing:
On to the learning algorithm…
So, the training data:	

- The text chunks	

- The score the raters gave it (averaged) as “truth”	

	

I started with Python’s NLTK (Natural Language
Toolkit) and Naïve Bayes for classifying (working
in an ipython notebook).
Resources on NLTK Naïve Bayes
§ The NLTK book chapter:
http://nltk.googlecode.com/svn/trunk/doc/
book/ch06.html	

§ Jacob Perkins’ example of sentiment analysis
with NLTK:
http://streamhacker.com/2010/05/10/text-
classification-sentiment-analysis-naive-bayes-
classifier/
Perkins’ NLTK code for this…
import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews
def word_feats(words):
    return dict([(word, True) for word in words])
negids = movie_reviews.fileids('neg')
posids = movie_reviews.fileids('pos')
negfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'neg') for f in negids]
posfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'pos') for f in posids]
negcutoff = len(negfeats)*3/4
poscutoff = len(posfeats)*3/4
trainfeats = negfeats[:negcutoff] + posfeats[:poscutoff]
testfeats = negfeats[negcutoff:] + posfeats[poscutoff:]
print 'train on %d instances, test on %d instances' % (len(trainfeats),
len(testfeats))
classifier = NaiveBayesClassifier.train(trainfeats)
print 'accuracy:', nltk.classify.util.accuracy(classifier, testfeats)
classifier.show_most_informative_features()
His movie sentiment output
72% accuracy, trained on 1500 inputs.
My results on 50 Shades sex
Scenes
82 % accuracy!
Previously with less “pos” data: not so
great at 68%
“packet” (they use a lot of condoms)
Python’s sklearn (scikit-learn)
Lots of classifiers 	

for sparse data like
text!	

http://scikit-learn.org/0.13/auto_examples/
document_classification_20newsgroups.html
Using a lemmatizer step in the pipeline (to strip endings off words, since some fiction in my
later samples was in present tense)	

Pipelines in sklearn makes it incredibly easy to run lots of experiments.	

Fit the model, using training data and “target” answers (in this case,“50 Shades of Grey”)	

Test the model on new data (in this case,“50 Shades Darker”). Check how it did against the
answers.	

Now
we’re
at 88%
Interpreting the results…
Let’s make a tool!
Demo:
http://www.ghostweather.com/essays/talks/openvisconf/text_scores/
rollover.html
Really amazing P.S. here…
I paid for coding of a bunch of fan-fiction for sex
scenes too, and fed them in to the sklearn SGD
classifier.	

	

(Note that 50 Shades started life as Twilight
fanfic.)	

	

	

*cross-validating with entire set, not just 50 Shades books.	

97% accuracy achieved!*
TOPIC ANALYSIS
Moving on to Dan Brown!
Almost naked, Silas hurled his pale body down the staircase. He knew he
had been betrayed, but by whom? When he reached the foyer, more
officers were surging through the front door. Silas turned the other way
and dashed deeper into the residence hall.The women's entrance. Every
Opus Dei building has one.Winding down narrow hallways, Silas snaked
through a kitchen, past terrified workers, who left to avoid the naked
albino as he knocked over bowls and silverware, bursting into a dark
hallway near the boiler room. He now saw the door he sought, an exit light
gleaming at the end.
Running full speed through the door out into the rain, Silas leapt off the
low landing, not seeing the officer coming the other way until it was too
late.The two men collided, Silas's broad, naked shoulder grinding into the
man's sternum with crushing force. He drove the officer backward onto the
pavement, landing hard on top of him.The officer's gun clattered away.
Silas could hear men running down the hall shouting. Rolling, he grabbed
the loose gun just as the officers emerged. A shot rang out on the stairs,
and Silas felt a searing pain below his ribs. Filled with rage, he opened
fire at all three officers, their blood spraying.
A dark shadow loomed behind, coming out of nowhere.The angry
hands that grabbed at his bare shoulders felt as if they were infused with
the power of the devil himself.The man roared in his ear. SILAS, NO!
Silas spun and fired.Their eyes met. Silas was already screaming in
Chapter 96
DaVinci Code
Blei (2011)
Resources for Topic Analysis
§ David Mimno’s java Mallet is “the one everyone
uses”:	

- http://mallet.cs.umass.edu/index.php	

- The R mallet package is rather nice, too:
http://www.cs.princeton.edu/~mimno/R/	

- This is a GUI wrapper for mallet that outputs nice csv
and html pages:
https://code.google.com/p/topic-modeling-tool/	

§ Some pure python (and C) implementations (toy
code, primarily) are listed on Blei’s website:
http://www.cs.princeton.edu/~blei/
topicmodeling.html
Topic Modeling Tool (GUI)
Post run…
Pros/Cons vs CMD-Line Mallet
Pros	

§  Allows stopword file
specifying	

§  Produces csv and html
output in a near dir
structure	

§  Has a GUI (simpler to just
get going)	

Cons	

§  Runs with defaults, so no
optimize-interval or other
cmd line options	

§  No diagnostic output (a
command-line option)	

§  Not super-well doc’d	

Tutorial on cmd line usage:
http://programminghistorian.org/lessons/topic-modeling-and-
mallet
2 of the 3 CSV Output files
Notice a horrible thing here:
My notebook has lots of code to
process these files…
A few pandas stats…
107 chapters, 10 topics “requested”…
Topic proportion distribution…
The default HTML output is a little
lacking…
A bipartite graph of chapters and topics is an
obvious vis method….
Network JSON in D3.js
Making the objects:
Make objects of nodes, links, and any extra data
values on each that you want…
Let’s try a hairball!
Improving the network’s UI…
Adding strength, highlight effect, another variable, and informative tooltips.	

Demo: 	

http://
www.ghostweather.com
/essays/talks/
openvisconf/
topic_docs_network/
index_better.html
Tricks in D3 – scales:
Maybe I need One More Tool. Any word relations of interest?
Let’s try another hairball…
Demo: http://www.ghostweather.com/essays/talks/openvisconf/topic_words_network/index.html
Small
“constellations”
show shared
words (an
accident that’s
useful!)	

Filtered to only the
“exciting” nodes…
Another tool:	

DaVinci Code topics to
chapters mapping	

“Excitement” rating color scale
avg by chapter, ordered
(obviously)	

Topics (48ish) per
chapter (108)	

Chapter 1… to Chapter 108
Ah, but since it’s svg/d3…	

 var chart = chart.append("g").attr("translate","0," +
y).attr("transform","rotate(90 600 600)");	

But, maybe I need chapter
summaries…. So I can relate
them to the topics?
Add some topic-tooltips
and fade-outs….	

Demo: http://www.ghostweather.com/essays/talks/openvisconf/topic_arc_diagram/TopicArc.html
But what did this
show?
Some topics are just neither exciting nor
dull – topic clustering (as I did it) had little
to do with action scenes. It’s slightly helpful
for topics, though J	

These nodes are shaded from
gray (dull) to red (exciting)
Coming soon…
Color words in texts by topic assigment, to help
tune the stopwords and set up next steps:
•  Pre-process text for just the verbs?
•  Clean out a class of proper names
•  Extract sentences containing the topic words
to help describe the topics/texts better
Wrapping up…
§ Python is great for the data munging and
analysis	

§ Some analysis needs serious vis support	

§ Save yourself some work in javascript using
Python before you get into js 	

§ D3 is a great tool for iterative interactive
exploration of your analysis results
THANKS!
@arnicas, Lynn@ghostweather.com 	

My thanks to….	

Luminosity for help with Dan Brown summaries, JimVallandingham (@vlandham)
for network parameter and coffeescript help.	

Hey, I am a consultant for data analysis and visualization. Look me up!
A Few More References
§  Applied Machine Learning with Scikit-Learn:
http://scikit-learn.github.io/scikit-learn-tutorial/index.html	

§  Naïve Bayes for text in Scikit-Learn:
http://scikit-learn.org/stable/modules/naive_bayes.html#naive-bayes 	

§  Stochastic Gradient Descent in Scikit-Learn: http://scikit-learn.org/0.13/modules/sgd.html 	

§  Nice tutorial overview of working with text data:
scikit-learn.github.io/scikit-learn-tutorial/working_with_text_data.html	

§  Bearcart by Rob Story – Rickshaw timeseries graphs from python pandas datastructure in 4
lines (https://github.com/wrobstory/bearcart)	

§  LDA topic modeling tool with UI - https://code.google.com/p/topic-modeling-tool/ 	

§  Scott Weingart’s nice overview of LDA Topic Modeling in Digital Humanities:
http://www.scottbot.net/HIAL/?p=221 	

§  Elijah Meeks’ lovely set of articles on LDA & Digital Humanties vis:
https://dhs.stanford.edu/comprehending-the-digital-humanities/	

§  JimVallandingham’s tooltip code and a great demo/tutorial:
http://flowingdata.com/2012/08/02/how-to-make-an-interactive-network-visualization/	

§  Rickshaw for timeseries graphs: https://github.com/shutterstock/rickshaw

Weitere ähnliche Inhalte

Ähnlich wie Bestseller Analysis: Visualization Fiction (for PyData Boston 2013)

machine learning and art hack day
machine learning and art hack daymachine learning and art hack day
machine learning and art hack dayjtoy
 
Stephen Downes on Personal Learning
Stephen Downes on Personal LearningStephen Downes on Personal Learning
Stephen Downes on Personal LearningAlec Couros
 
HarambeeNet: Data by the people, for the people
HarambeeNet: Data by the people, for the peopleHarambeeNet: Data by the people, for the people
HarambeeNet: Data by the people, for the peopleMichael Bernstein
 
Respond_HomeworkAstronomy.docxRespond to forum. 150 word mini.docx
Respond_HomeworkAstronomy.docxRespond to forum. 150 word mini.docxRespond_HomeworkAstronomy.docxRespond to forum. 150 word mini.docx
Respond_HomeworkAstronomy.docxRespond to forum. 150 word mini.docxronak56
 
The Science Of Social Networks
The Science Of Social NetworksThe Science Of Social Networks
The Science Of Social NetworksEhren Foss
 
Just the basics_strata_2013
Just the basics_strata_2013Just the basics_strata_2013
Just the basics_strata_2013Ken Mwai
 
Communication 200 Social Science
Communication 200 Social ScienceCommunication 200 Social Science
Communication 200 Social ScienceTiffini Travis
 
Semantic Web: A web that is not the Web
Semantic Web: A web that is not the WebSemantic Web: A web that is not the Web
Semantic Web: A web that is not the WebBruce Esrig
 
Deep Learning from Scratch - Building with Python from First Principles.pdf
Deep Learning from Scratch - Building with Python from First Principles.pdfDeep Learning from Scratch - Building with Python from First Principles.pdf
Deep Learning from Scratch - Building with Python from First Principles.pdfYungSang1
 
Kill All Mutants! (Intro to Mutation Testing), Code Europe (Poland), 2021
Kill All Mutants! (Intro to Mutation Testing), Code Europe (Poland), 2021Kill All Mutants! (Intro to Mutation Testing), Code Europe (Poland), 2021
Kill All Mutants! (Intro to Mutation Testing), Code Europe (Poland), 2021Dave Aronson
 
Retrospecting our Retrospectives
Retrospecting our RetrospectivesRetrospecting our Retrospectives
Retrospecting our RetrospectivesJessica DeVita
 
Huxley and the Flying Robot Monkeys
Huxley and the Flying Robot MonkeysHuxley and the Flying Robot Monkeys
Huxley and the Flying Robot MonkeysSean Moubry
 
Writing A Contrast Essay.pdf
Writing A Contrast Essay.pdfWriting A Contrast Essay.pdf
Writing A Contrast Essay.pdfLaura Cappabianca
 
Everything you always wanted to know about psychology and technical communica...
Everything you always wanted to know about psychology and technical communica...Everything you always wanted to know about psychology and technical communica...
Everything you always wanted to know about psychology and technical communica...Chris Atherton @finiteattention
 
Easy Topics To Write An Essay On
Easy Topics To Write An Essay OnEasy Topics To Write An Essay On
Easy Topics To Write An Essay OnSusan Souza
 
Keepler Data Tech | Entendiendo tus propios modelos predictivos
Keepler Data Tech | Entendiendo tus propios modelos predictivosKeepler Data Tech | Entendiendo tus propios modelos predictivos
Keepler Data Tech | Entendiendo tus propios modelos predictivosKeepler Data Tech
 

Ähnlich wie Bestseller Analysis: Visualization Fiction (for PyData Boston 2013) (20)

Online Learning
Online LearningOnline Learning
Online Learning
 
machine learning and art hack day
machine learning and art hack daymachine learning and art hack day
machine learning and art hack day
 
Stephen Downes on Personal Learning
Stephen Downes on Personal LearningStephen Downes on Personal Learning
Stephen Downes on Personal Learning
 
HarambeeNet: Data by the people, for the people
HarambeeNet: Data by the people, for the peopleHarambeeNet: Data by the people, for the people
HarambeeNet: Data by the people, for the people
 
Respond_HomeworkAstronomy.docxRespond to forum. 150 word mini.docx
Respond_HomeworkAstronomy.docxRespond to forum. 150 word mini.docxRespond_HomeworkAstronomy.docxRespond to forum. 150 word mini.docx
Respond_HomeworkAstronomy.docxRespond to forum. 150 word mini.docx
 
The Science Of Social Networks
The Science Of Social NetworksThe Science Of Social Networks
The Science Of Social Networks
 
Just the basics_strata_2013
Just the basics_strata_2013Just the basics_strata_2013
Just the basics_strata_2013
 
50 ton of Backdoors
50 ton of Backdoors50 ton of Backdoors
50 ton of Backdoors
 
Communication 200 Social Science
Communication 200 Social ScienceCommunication 200 Social Science
Communication 200 Social Science
 
Progressing and enhancing
Progressing and enhancingProgressing and enhancing
Progressing and enhancing
 
Semantic Web: A web that is not the Web
Semantic Web: A web that is not the WebSemantic Web: A web that is not the Web
Semantic Web: A web that is not the Web
 
Deep Learning from Scratch - Building with Python from First Principles.pdf
Deep Learning from Scratch - Building with Python from First Principles.pdfDeep Learning from Scratch - Building with Python from First Principles.pdf
Deep Learning from Scratch - Building with Python from First Principles.pdf
 
Kill All Mutants! (Intro to Mutation Testing), Code Europe (Poland), 2021
Kill All Mutants! (Intro to Mutation Testing), Code Europe (Poland), 2021Kill All Mutants! (Intro to Mutation Testing), Code Europe (Poland), 2021
Kill All Mutants! (Intro to Mutation Testing), Code Europe (Poland), 2021
 
Retrospecting our Retrospectives
Retrospecting our RetrospectivesRetrospecting our Retrospectives
Retrospecting our Retrospectives
 
Huxley and the Flying Robot Monkeys
Huxley and the Flying Robot MonkeysHuxley and the Flying Robot Monkeys
Huxley and the Flying Robot Monkeys
 
Writing A Contrast Essay.pdf
Writing A Contrast Essay.pdfWriting A Contrast Essay.pdf
Writing A Contrast Essay.pdf
 
Everything you always wanted to know about psychology and technical communica...
Everything you always wanted to know about psychology and technical communica...Everything you always wanted to know about psychology and technical communica...
Everything you always wanted to know about psychology and technical communica...
 
Easy Topics To Write An Essay On
Easy Topics To Write An Essay OnEasy Topics To Write An Essay On
Easy Topics To Write An Essay On
 
Keepler Data Tech | Entendiendo tus propios modelos predictivos
Keepler Data Tech | Entendiendo tus propios modelos predictivosKeepler Data Tech | Entendiendo tus propios modelos predictivos
Keepler Data Tech | Entendiendo tus propios modelos predictivos
 
sent_analysis_report
sent_analysis_reportsent_analysis_report
sent_analysis_report
 

Kürzlich hochgeladen

Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observabilityitnewsafrica
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...itnewsafrica
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh
 

Kürzlich hochgeladen (20)

Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
 

Bestseller Analysis: Visualization Fiction (for PyData Boston 2013)

  • 1. Bestseller Analysis: Visualizing Fiction Lynn Cherny @arnicas PyData Boston 2013
  • 4. THEVIDEO OF THAT TALK: http://blogger.ghostweather.com/2013/06/analysis-of-fiction- my-openvisconf-talk.html http://www.youtube.com/watch? v=f41U936WqPM BASED ON A PREVIOUS TALK: This talk focuses on some more technical details and more on topic analysis. The IPython notebook of code samples for this lives here: http://ghostweather.com/essays/talks/openvisconf/Pydata_Code.ipynb
  • 6. Text Classification (Commonly) § “Bag of words” – each document is considered a collection of words, independent of order § Frequencies of certain words are used to identify the texts Seems like this should work with sex scenes, right? Only so many body parts and behaviors, right?!
  • 7. Data Label Estdsgfd fdsatreatret dfds Yes Dsrdsf drerear ewrewtrew No Reret retdrtd rewrewrtew Yes Dsfgdg fdsfd Yes Algorithm Train Test New data in the wild
  • 8. Sex Scene Detection First Steps 1.  Buy 50 Shades on Amazon, unlock text in Calibre, save as TXT file. 2.  Cut up a doc into 500 “word” chunks using Python
  • 9. Cutting up the document
  • 10. “Would you like to sit?” He waves me toward an L-shaped white leather couch. His office is way too big for just one man. In front of the floor-to-ceiling windows, there’s a modern dark wood desk that six people could comfortably eat around. It matches the coffee table by the couch. Everything else is white—ceiling, floors, and walls, except for the wall by the door, where a mosaic of small paintings hang, thirty-six of them arranged in a square.They are exquisite—a series of mundane, forgotten objects painted in such precise detail they look like photographs. Displayed together, they are breathtaking. “A local artist.Trouton,” says Grey when he catches my gaze. “They’re lovely. Raising the ordinary to extraordinary,” I murmur, distracted both by him and the paintings. He cocks his head to one side and regards me intently. “I couldn’t agree more, Miss Steele,” he replies, his voice soft, and for some inexplicable reason I find myself blushing. Sample of 50 Shades of Grey
  • 13. WHAT’S A SEX SCENE, ANYWAY?
  • 16. Sexually Exxxplicit, but still a http://www.icts.uiowa.edu/sites/default/files/contract.jpg
  • 17.
  • 18. How’d the raters do? Sex Scenes Steamy Scenes
  • 21. On to the learning algorithm… So, the training data: - The text chunks - The score the raters gave it (averaged) as “truth” I started with Python’s NLTK (Natural Language Toolkit) and Naïve Bayes for classifying (working in an ipython notebook).
  • 22. Resources on NLTK Naïve Bayes § The NLTK book chapter: http://nltk.googlecode.com/svn/trunk/doc/ book/ch06.html § Jacob Perkins’ example of sentiment analysis with NLTK: http://streamhacker.com/2010/05/10/text- classification-sentiment-analysis-naive-bayes- classifier/
  • 23. Perkins’ NLTK code for this… import nltk.classify.util from nltk.classify import NaiveBayesClassifier from nltk.corpus import movie_reviews def word_feats(words):     return dict([(word, True) for word in words]) negids = movie_reviews.fileids('neg') posids = movie_reviews.fileids('pos') negfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'neg') for f in negids] posfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'pos') for f in posids] negcutoff = len(negfeats)*3/4 poscutoff = len(posfeats)*3/4 trainfeats = negfeats[:negcutoff] + posfeats[:poscutoff] testfeats = negfeats[negcutoff:] + posfeats[poscutoff:] print 'train on %d instances, test on %d instances' % (len(trainfeats), len(testfeats)) classifier = NaiveBayesClassifier.train(trainfeats) print 'accuracy:', nltk.classify.util.accuracy(classifier, testfeats) classifier.show_most_informative_features()
  • 24. His movie sentiment output 72% accuracy, trained on 1500 inputs.
  • 25. My results on 50 Shades sex Scenes 82 % accuracy!
  • 26. Previously with less “pos” data: not so great at 68% “packet” (they use a lot of condoms)
  • 27. Python’s sklearn (scikit-learn) Lots of classifiers for sparse data like text! http://scikit-learn.org/0.13/auto_examples/ document_classification_20newsgroups.html
  • 28. Using a lemmatizer step in the pipeline (to strip endings off words, since some fiction in my later samples was in present tense) Pipelines in sklearn makes it incredibly easy to run lots of experiments. Fit the model, using training data and “target” answers (in this case,“50 Shades of Grey”) Test the model on new data (in this case,“50 Shades Darker”). Check how it did against the answers. Now we’re at 88%
  • 29. Interpreting the results… Let’s make a tool! Demo: http://www.ghostweather.com/essays/talks/openvisconf/text_scores/ rollover.html
  • 30. Really amazing P.S. here… I paid for coding of a bunch of fan-fiction for sex scenes too, and fed them in to the sklearn SGD classifier. (Note that 50 Shades started life as Twilight fanfic.) *cross-validating with entire set, not just 50 Shades books. 97% accuracy achieved!*
  • 31. TOPIC ANALYSIS Moving on to Dan Brown!
  • 32. Almost naked, Silas hurled his pale body down the staircase. He knew he had been betrayed, but by whom? When he reached the foyer, more officers were surging through the front door. Silas turned the other way and dashed deeper into the residence hall.The women's entrance. Every Opus Dei building has one.Winding down narrow hallways, Silas snaked through a kitchen, past terrified workers, who left to avoid the naked albino as he knocked over bowls and silverware, bursting into a dark hallway near the boiler room. He now saw the door he sought, an exit light gleaming at the end. Running full speed through the door out into the rain, Silas leapt off the low landing, not seeing the officer coming the other way until it was too late.The two men collided, Silas's broad, naked shoulder grinding into the man's sternum with crushing force. He drove the officer backward onto the pavement, landing hard on top of him.The officer's gun clattered away. Silas could hear men running down the hall shouting. Rolling, he grabbed the loose gun just as the officers emerged. A shot rang out on the stairs, and Silas felt a searing pain below his ribs. Filled with rage, he opened fire at all three officers, their blood spraying. A dark shadow loomed behind, coming out of nowhere.The angry hands that grabbed at his bare shoulders felt as if they were infused with the power of the devil himself.The man roared in his ear. SILAS, NO! Silas spun and fired.Their eyes met. Silas was already screaming in Chapter 96 DaVinci Code
  • 34. Resources for Topic Analysis § David Mimno’s java Mallet is “the one everyone uses”: - http://mallet.cs.umass.edu/index.php - The R mallet package is rather nice, too: http://www.cs.princeton.edu/~mimno/R/ - This is a GUI wrapper for mallet that outputs nice csv and html pages: https://code.google.com/p/topic-modeling-tool/ § Some pure python (and C) implementations (toy code, primarily) are listed on Blei’s website: http://www.cs.princeton.edu/~blei/ topicmodeling.html
  • 37. Pros/Cons vs CMD-Line Mallet Pros §  Allows stopword file specifying §  Produces csv and html output in a near dir structure §  Has a GUI (simpler to just get going) Cons §  Runs with defaults, so no optimize-interval or other cmd line options §  No diagnostic output (a command-line option) §  Not super-well doc’d Tutorial on cmd line usage: http://programminghistorian.org/lessons/topic-modeling-and- mallet
  • 38. 2 of the 3 CSV Output files
  • 39. Notice a horrible thing here:
  • 40. My notebook has lots of code to process these files…
  • 41. A few pandas stats… 107 chapters, 10 topics “requested”… Topic proportion distribution…
  • 42. The default HTML output is a little lacking… A bipartite graph of chapters and topics is an obvious vis method….
  • 44. Making the objects: Make objects of nodes, links, and any extra data values on each that you want…
  • 45. Let’s try a hairball!
  • 46. Improving the network’s UI… Adding strength, highlight effect, another variable, and informative tooltips. Demo: http:// www.ghostweather.com /essays/talks/ openvisconf/ topic_docs_network/ index_better.html
  • 47. Tricks in D3 – scales:
  • 48. Maybe I need One More Tool. Any word relations of interest? Let’s try another hairball… Demo: http://www.ghostweather.com/essays/talks/openvisconf/topic_words_network/index.html
  • 49. Small “constellations” show shared words (an accident that’s useful!) Filtered to only the “exciting” nodes…
  • 50. Another tool: DaVinci Code topics to chapters mapping “Excitement” rating color scale avg by chapter, ordered (obviously) Topics (48ish) per chapter (108) Chapter 1… to Chapter 108
  • 51. Ah, but since it’s svg/d3… var chart = chart.append("g").attr("translate","0," + y).attr("transform","rotate(90 600 600)"); But, maybe I need chapter summaries…. So I can relate them to the topics?
  • 52. Add some topic-tooltips and fade-outs…. Demo: http://www.ghostweather.com/essays/talks/openvisconf/topic_arc_diagram/TopicArc.html
  • 53. But what did this show? Some topics are just neither exciting nor dull – topic clustering (as I did it) had little to do with action scenes. It’s slightly helpful for topics, though J These nodes are shaded from gray (dull) to red (exciting)
  • 54. Coming soon… Color words in texts by topic assigment, to help tune the stopwords and set up next steps: •  Pre-process text for just the verbs? •  Clean out a class of proper names •  Extract sentences containing the topic words to help describe the topics/texts better
  • 55. Wrapping up… § Python is great for the data munging and analysis § Some analysis needs serious vis support § Save yourself some work in javascript using Python before you get into js § D3 is a great tool for iterative interactive exploration of your analysis results
  • 56. THANKS! @arnicas, Lynn@ghostweather.com My thanks to…. Luminosity for help with Dan Brown summaries, JimVallandingham (@vlandham) for network parameter and coffeescript help. Hey, I am a consultant for data analysis and visualization. Look me up!
  • 57. A Few More References §  Applied Machine Learning with Scikit-Learn: http://scikit-learn.github.io/scikit-learn-tutorial/index.html §  Naïve Bayes for text in Scikit-Learn: http://scikit-learn.org/stable/modules/naive_bayes.html#naive-bayes §  Stochastic Gradient Descent in Scikit-Learn: http://scikit-learn.org/0.13/modules/sgd.html §  Nice tutorial overview of working with text data: scikit-learn.github.io/scikit-learn-tutorial/working_with_text_data.html §  Bearcart by Rob Story – Rickshaw timeseries graphs from python pandas datastructure in 4 lines (https://github.com/wrobstory/bearcart) §  LDA topic modeling tool with UI - https://code.google.com/p/topic-modeling-tool/ §  Scott Weingart’s nice overview of LDA Topic Modeling in Digital Humanities: http://www.scottbot.net/HIAL/?p=221 §  Elijah Meeks’ lovely set of articles on LDA & Digital Humanties vis: https://dhs.stanford.edu/comprehending-the-digital-humanities/ §  JimVallandingham’s tooltip code and a great demo/tutorial: http://flowingdata.com/2012/08/02/how-to-make-an-interactive-network-visualization/ §  Rickshaw for timeseries graphs: https://github.com/shutterstock/rickshaw