SlideShare a Scribd company logo
1 of 25
Workshop:
Sentiment Analysis with
Python
Rob Fahey robfahey@fuji.waseda.jp @robfahey
Data Science Week at Waseda, January 2019
”How does it make you
feel?”
Sentiment Analysis
Also called “Tone Analysis” (Grimmer & Stewart 2013)
or “Opinion Mining” (Dave, Lawrence & Pennock 2003)
Whatever you call it, the question it aims to answer is
always the same:
THE OBJECTIVE
• In the Internet age, humans create and publish billions of
pieces of content (text, movies, images etc.) every single day.
• Many of those data express a sentiment about a subject of some kind.
• By selecting data related to a subject (a person, a country, a
brand, etc.), we can measure public sentiment in a very detailed
way.
• We can even see how sentiment changes minute-by-minute, or
day-by-day – giving us unprecedented insights into political
trends, marketing campaigns or financial market movements.
THE CHALLENGE
• Sentiment Analysis is easy for humans, but hard for computers.
• Humans: can process complex texts, images or videos with an
understanding of cultural and social contexts, allowing us to
quickly and naturally judge the sentiment or emotion being
expressed.
• Computers: can count things really, really fast.
• Sentiment Analysis methodologies all try to overcome the
weaknesses of computers (no context, no understanding) by
using their strengths (counting very fast!).
TWO APPROACHES
UNSUPERVISED METHODS
• Dictionary / Lexicon
Methods
• Word Embeddings
SUPERVISED METHODS
• Classification Algorithms
• Aggregate Algorithms
Requires Training DataNo Training Data Required
HOW A MACHINE LEARNS
• To carry out “Machine Learning”, the machine needs something
to learn from.
• In dictionary approaches, you teach the computer a lexicon –
a set of words that are associated with different sentiments.
• This approach can be improved (or at least complicated) by using
techniques like word embeddings, which try to estimate the sentiment of
unknown words by seeing how frequently they occur in proximity to
known words;
• Or by trying to consider the grammatical context in which a word
appears.
great +1
awful -1
HOW A MACHINE LEARNS (2)
• In supervised approaches, the computer instead learns from a
set of sample data which you have categorized by hand, using
human coding.
• There are lots of different algorithms and approaches for supervised
learning, but they all have this in common – you need to create training
data first.
• The algorithms try to learn the patterns which are associated with each
sentiment.
“This movie was terrible - why would Brad Pitt agree to star
in this rubbish? It’s not like he needs the money.”
Negativ
e
“Just had a great time at the cinema, what a fantastic movie!
I don’t want to ruin the ending but it’s a crazy surprise. Well
worth the money.”
Positive
PREPARING YOUR DATA:
WORD SEGMENTATION
• The first challenge is how to divide sentences in your data into
words.
• In English or other European languages, this is fairly easy –
These / languages / have / spaces / between / the / words.
• It’s not quite that simple – a process called stemming is often
used to change every word back to its most simple form by
removing plurals, tenses etc.
• Otherwise the computer won’t know that ”dog” and “dogs”, or “go” and
“going”, express the same concept!
PREPARING YOUR DATA:
WORD SEGMENTATION (IN OTHER
LANGUAGES)
• In other languages like Japanese, word segmentation is more
challenging.
• 日本語の文書は言葉と言葉の間にスペースがないから、形態素解析をし
ないといけない。 Where do the words begin and end in that
sentence?
• Thankfully there is software to help with this process in many
languages.
• Japanese: MeCab, ChaSen, Janome (Python package)
• Chinese (and Arabic): Stanford Word Segmenter
• Korean: Open-Korean-Text (looks good, but I haven’t tried it)
DICTIONARY APPROACHES
• To use a dictionary approach, you need to start by acquiring a
dictionary (or “lexicon”) which you’ll use to calculate sentiment.
• There are many of these available for the English language and
other major languages. In minority languages, however, these
resources might not be available – or might be of very dubious
quality.
• Your dictionary needs to be appropriate to your text. Using a
dictionary full of Twitter slang on newspaper texts will yield
bad results – and vice versa.
A SIMPLE EXAMPLE
Just had a great time at the cinema, what a
fantastic movie! I don’t want to ruin the
ending but it’s a crazy surprise. Well worth
the money.
“This movie was terrible - why would Brad
Pitt agree to star in this rubbish? It’s not like
he needs the money.”
A SIMPLE EXAMPLE…?
This movie has a fantastic cast, an
interesting concept and amazing special
effects – but the end result is utterly
boring.
DICTIONARY APPROACHES
PLEASE OPEN JUPYTER LAB!
THE BAG OF WORDS
• You may have noticed something about the examples we
looked at – the order of the words doesn’t matter.
• This is actually true of (almost) every
sentiment analysis approach (and text
mining approaches in general).
• It’s counter-intuitive, but computers are much
better at treating texts as a ”bag of words”
than they are at understanding grammar,
word order etc.
VECTOR REPRESENTATIONS
• Often, after dividing the sentence into words, we represent it
using a vector of word frequencies. An entire corpus of
documents can be represented in a single matrix: the term-
document matrix (TDM).
I like to eat sushi
You like to eat
burgers
She doesn’t like
sushi
I Like To Eat Sushi You Burgers She Doesn’t
1 1 1 1 1 0 0 0 0
0 1 1 1 0 1 1 0 0
0 1 0 0 1 0 0 1 1
FEATURE SELECTION
• A term-document matrix could easily get VERY big –
overwhelming a computer’s memory and taking a very long
time to process. We often need to focus somehow on the most
relevant terms in the vocabulary. How?
• Stopwords: Very commonly used words are of little value in
distinguishing documents, so we can remove them.
• Document Frequency: Ignoring words which appear in too many or too
few documents allows us to focus only on words useful to our research.
• TF-IDF: Less useful for short documents (e.g. Twitter), but “Term
Frequency / Inverse Document Frequency” points out words that are
especially good at distinguishing differences between texts.
CLASSIFICATION ALGORITHMS
• Classification algorithms are the most commonly used tool in
machine learning – not just in text mining, but also in fields
like voice recognition, computer vision or predicting behaviour.
• They are essentially tools for pattern recognition – you show
them a number of labelled examples of vector representations
(in our case, term-document matrices) and they try to find the
patterns which maximise the probability of a vector belonging
to a certain label.
CHOOSING AN ALGORITHM
• There are many kinds of classification algorithm – from simple
statistical methods like Naïve Bayes, to evolutions of
regression-based approaches like Support Vector Machines, to
science-fiction sounding approaches like Random Forest (which
constructs a “forest” of “decision trees” and uses them to vote
of classification) and Neural Networks (which were designed to
emulate the decision-making behavior of neurons in the human
brain).
• How do you pick the best one for your research?
• Simple answer: try them all and see what works best. Luckily,
CLASSIFICATION APPROACHES
PLEASE GO BACK TO JUPYTER LAB!
AGGREGATE ALGORITHMS
• There is one final group of sentiment analysis approaches
which has been gaining in popularity in recent years.
• Aggregate algorithms are similar to classification algorithms in
many ways (they need training data and function on pattern
recognition), but different in one crucial way – they do not
classify individual documents, but instead aim to give an
accurate measurement of the distribution of classes in the
overall corpus.
AGGREGATE ALGORITHMS
• This has some serious advantages! Aggregate algorithms tend
to be able to give accurate results with a much smaller amount
of training data, for example.
• Aggregate algorithms are also really good at handling data with
a lot of “off-topic” texts.
• Classification algorithms have a statistical problem with this data – when
the “off-topic” category is very common, there is a bias towards mis-
classifying a lot of texts as off-topic.
• But… You can’t see classifications for individual texts, so
they’re not appropriate for every kind of research.
AGGREGATE APPROACHES
PLEASE GO BACK TO JUPYTER LAB!
PITFALLS AND WARNINGS
• Clean your Data! Data accessed from the internet often includes
a lot of texts you didn’t actually mean to analyse – check
carefully to make sure your data isn’t full of bots reposting
garbage, or posts about a totally different topic.
• Read your Data! Don’t just take the results of any algorithm to
be accurate – even if it agrees with your hypothesis. At some
point you’re going to need to dive in and read samples of the
data you’ve collected, to confirm that you’re really observing
WRAPPING UP
• This workshop can really only introduce a few of the most
commonly used approaches in sentiment analysis. This is a
rapidly changing field and new algorithms and approaches are
being developed all the time.
• There are some approaches which require a lot more technical
skill than the ones we looked at today – for example, creating
your own sentiment dictionary and analyser that’s perfectly
appropriate for your corpus of texts is possible, but difficult
unless you’re a skilled programmer.
• The approaches we looked at today are very mainstream and
commonly used in a lot of academic studies – I hope they’ll be
THANK YOU!
• Questions, ideas or feedback?
• Email: robfahey@fuji.waseda.jp
• Twitter: @robfahey
• Website: robfahey.co.uk

More Related Content

Recently uploaded

ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxiammrhaywood
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for BeginnersSabitha Banu
 
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfVirtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfErwinPantujan2
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfTechSoup
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designMIPLM
 
Activity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translationActivity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translationRosabel UA
 
Music 9 - 4th quarter - Vocal Music of the Romantic Period.pptx
Music 9 - 4th quarter - Vocal Music of the Romantic Period.pptxMusic 9 - 4th quarter - Vocal Music of the Romantic Period.pptx
Music 9 - 4th quarter - Vocal Music of the Romantic Period.pptxleah joy valeriano
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17Celine George
 
Food processing presentation for bsc agriculture hons
Food processing presentation for bsc agriculture honsFood processing presentation for bsc agriculture hons
Food processing presentation for bsc agriculture honsManeerUddin
 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4MiaBumagat1
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Mark Reed
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptxmary850239
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxHumphrey A Beña
 
Integumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.pptIntegumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.pptshraddhaparab530
 
Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4JOYLYNSAMANIEGO
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parentsnavabharathschool99
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Celine George
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfJemuel Francisco
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPCeline George
 

Recently uploaded (20)

ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for Beginners
 
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfVirtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-design
 
Activity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translationActivity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translation
 
Music 9 - 4th quarter - Vocal Music of the Romantic Period.pptx
Music 9 - 4th quarter - Vocal Music of the Romantic Period.pptxMusic 9 - 4th quarter - Vocal Music of the Romantic Period.pptx
Music 9 - 4th quarter - Vocal Music of the Romantic Period.pptx
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17
 
Food processing presentation for bsc agriculture hons
Food processing presentation for bsc agriculture honsFood processing presentation for bsc agriculture hons
Food processing presentation for bsc agriculture hons
 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)
 
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptxYOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
 
Integumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.pptIntegumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.ppt
 
Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parents
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERP
 

Featured

How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at WorkGetSmarter
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...DevGAMM Conference
 
Barbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationBarbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationErica Santiago
 
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellGood Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellSaba Software
 
Introduction to C Programming Language
Introduction to C Programming LanguageIntroduction to C Programming Language
Introduction to C Programming LanguageSimplilearn
 

Featured (20)

How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
 
More than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike RoutesMore than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike Routes
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
 
Barbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationBarbie - Brand Strategy Presentation
Barbie - Brand Strategy Presentation
 
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellGood Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
 
Introduction to C Programming Language
Introduction to C Programming LanguageIntroduction to C Programming Language
Introduction to C Programming Language
 

Sentiment Analysis in Python - Waseda Data Science Week 2019

  • 1. Workshop: Sentiment Analysis with Python Rob Fahey robfahey@fuji.waseda.jp @robfahey Data Science Week at Waseda, January 2019
  • 2. ”How does it make you feel?” Sentiment Analysis Also called “Tone Analysis” (Grimmer & Stewart 2013) or “Opinion Mining” (Dave, Lawrence & Pennock 2003) Whatever you call it, the question it aims to answer is always the same:
  • 3. THE OBJECTIVE • In the Internet age, humans create and publish billions of pieces of content (text, movies, images etc.) every single day. • Many of those data express a sentiment about a subject of some kind. • By selecting data related to a subject (a person, a country, a brand, etc.), we can measure public sentiment in a very detailed way. • We can even see how sentiment changes minute-by-minute, or day-by-day – giving us unprecedented insights into political trends, marketing campaigns or financial market movements.
  • 4. THE CHALLENGE • Sentiment Analysis is easy for humans, but hard for computers. • Humans: can process complex texts, images or videos with an understanding of cultural and social contexts, allowing us to quickly and naturally judge the sentiment or emotion being expressed. • Computers: can count things really, really fast. • Sentiment Analysis methodologies all try to overcome the weaknesses of computers (no context, no understanding) by using their strengths (counting very fast!).
  • 5. TWO APPROACHES UNSUPERVISED METHODS • Dictionary / Lexicon Methods • Word Embeddings SUPERVISED METHODS • Classification Algorithms • Aggregate Algorithms Requires Training DataNo Training Data Required
  • 6. HOW A MACHINE LEARNS • To carry out “Machine Learning”, the machine needs something to learn from. • In dictionary approaches, you teach the computer a lexicon – a set of words that are associated with different sentiments. • This approach can be improved (or at least complicated) by using techniques like word embeddings, which try to estimate the sentiment of unknown words by seeing how frequently they occur in proximity to known words; • Or by trying to consider the grammatical context in which a word appears. great +1 awful -1
  • 7. HOW A MACHINE LEARNS (2) • In supervised approaches, the computer instead learns from a set of sample data which you have categorized by hand, using human coding. • There are lots of different algorithms and approaches for supervised learning, but they all have this in common – you need to create training data first. • The algorithms try to learn the patterns which are associated with each sentiment. “This movie was terrible - why would Brad Pitt agree to star in this rubbish? It’s not like he needs the money.” Negativ e “Just had a great time at the cinema, what a fantastic movie! I don’t want to ruin the ending but it’s a crazy surprise. Well worth the money.” Positive
  • 8. PREPARING YOUR DATA: WORD SEGMENTATION • The first challenge is how to divide sentences in your data into words. • In English or other European languages, this is fairly easy – These / languages / have / spaces / between / the / words. • It’s not quite that simple – a process called stemming is often used to change every word back to its most simple form by removing plurals, tenses etc. • Otherwise the computer won’t know that ”dog” and “dogs”, or “go” and “going”, express the same concept!
  • 9. PREPARING YOUR DATA: WORD SEGMENTATION (IN OTHER LANGUAGES) • In other languages like Japanese, word segmentation is more challenging. • 日本語の文書は言葉と言葉の間にスペースがないから、形態素解析をし ないといけない。 Where do the words begin and end in that sentence? • Thankfully there is software to help with this process in many languages. • Japanese: MeCab, ChaSen, Janome (Python package) • Chinese (and Arabic): Stanford Word Segmenter • Korean: Open-Korean-Text (looks good, but I haven’t tried it)
  • 10. DICTIONARY APPROACHES • To use a dictionary approach, you need to start by acquiring a dictionary (or “lexicon”) which you’ll use to calculate sentiment. • There are many of these available for the English language and other major languages. In minority languages, however, these resources might not be available – or might be of very dubious quality. • Your dictionary needs to be appropriate to your text. Using a dictionary full of Twitter slang on newspaper texts will yield bad results – and vice versa.
  • 11. A SIMPLE EXAMPLE Just had a great time at the cinema, what a fantastic movie! I don’t want to ruin the ending but it’s a crazy surprise. Well worth the money. “This movie was terrible - why would Brad Pitt agree to star in this rubbish? It’s not like he needs the money.”
  • 12. A SIMPLE EXAMPLE…? This movie has a fantastic cast, an interesting concept and amazing special effects – but the end result is utterly boring.
  • 14. THE BAG OF WORDS • You may have noticed something about the examples we looked at – the order of the words doesn’t matter. • This is actually true of (almost) every sentiment analysis approach (and text mining approaches in general). • It’s counter-intuitive, but computers are much better at treating texts as a ”bag of words” than they are at understanding grammar, word order etc.
  • 15. VECTOR REPRESENTATIONS • Often, after dividing the sentence into words, we represent it using a vector of word frequencies. An entire corpus of documents can be represented in a single matrix: the term- document matrix (TDM). I like to eat sushi You like to eat burgers She doesn’t like sushi I Like To Eat Sushi You Burgers She Doesn’t 1 1 1 1 1 0 0 0 0 0 1 1 1 0 1 1 0 0 0 1 0 0 1 0 0 1 1
  • 16. FEATURE SELECTION • A term-document matrix could easily get VERY big – overwhelming a computer’s memory and taking a very long time to process. We often need to focus somehow on the most relevant terms in the vocabulary. How? • Stopwords: Very commonly used words are of little value in distinguishing documents, so we can remove them. • Document Frequency: Ignoring words which appear in too many or too few documents allows us to focus only on words useful to our research. • TF-IDF: Less useful for short documents (e.g. Twitter), but “Term Frequency / Inverse Document Frequency” points out words that are especially good at distinguishing differences between texts.
  • 17. CLASSIFICATION ALGORITHMS • Classification algorithms are the most commonly used tool in machine learning – not just in text mining, but also in fields like voice recognition, computer vision or predicting behaviour. • They are essentially tools for pattern recognition – you show them a number of labelled examples of vector representations (in our case, term-document matrices) and they try to find the patterns which maximise the probability of a vector belonging to a certain label.
  • 18. CHOOSING AN ALGORITHM • There are many kinds of classification algorithm – from simple statistical methods like Naïve Bayes, to evolutions of regression-based approaches like Support Vector Machines, to science-fiction sounding approaches like Random Forest (which constructs a “forest” of “decision trees” and uses them to vote of classification) and Neural Networks (which were designed to emulate the decision-making behavior of neurons in the human brain). • How do you pick the best one for your research? • Simple answer: try them all and see what works best. Luckily,
  • 19. CLASSIFICATION APPROACHES PLEASE GO BACK TO JUPYTER LAB!
  • 20. AGGREGATE ALGORITHMS • There is one final group of sentiment analysis approaches which has been gaining in popularity in recent years. • Aggregate algorithms are similar to classification algorithms in many ways (they need training data and function on pattern recognition), but different in one crucial way – they do not classify individual documents, but instead aim to give an accurate measurement of the distribution of classes in the overall corpus.
  • 21. AGGREGATE ALGORITHMS • This has some serious advantages! Aggregate algorithms tend to be able to give accurate results with a much smaller amount of training data, for example. • Aggregate algorithms are also really good at handling data with a lot of “off-topic” texts. • Classification algorithms have a statistical problem with this data – when the “off-topic” category is very common, there is a bias towards mis- classifying a lot of texts as off-topic. • But… You can’t see classifications for individual texts, so they’re not appropriate for every kind of research.
  • 22. AGGREGATE APPROACHES PLEASE GO BACK TO JUPYTER LAB!
  • 23. PITFALLS AND WARNINGS • Clean your Data! Data accessed from the internet often includes a lot of texts you didn’t actually mean to analyse – check carefully to make sure your data isn’t full of bots reposting garbage, or posts about a totally different topic. • Read your Data! Don’t just take the results of any algorithm to be accurate – even if it agrees with your hypothesis. At some point you’re going to need to dive in and read samples of the data you’ve collected, to confirm that you’re really observing
  • 24. WRAPPING UP • This workshop can really only introduce a few of the most commonly used approaches in sentiment analysis. This is a rapidly changing field and new algorithms and approaches are being developed all the time. • There are some approaches which require a lot more technical skill than the ones we looked at today – for example, creating your own sentiment dictionary and analyser that’s perfectly appropriate for your corpus of texts is possible, but difficult unless you’re a skilled programmer. • The approaches we looked at today are very mainstream and commonly used in a lot of academic studies – I hope they’ll be
  • 25. THANK YOU! • Questions, ideas or feedback? • Email: robfahey@fuji.waseda.jp • Twitter: @robfahey • Website: robfahey.co.uk