SlideShare ist ein Scribd-Unternehmen logo
1 von 32
Twitter Author Prediction
from Tweets using
Bayesian Network
Hendy Irawan
23214344
TMDG 9 – Electrical Engineering - STEI ITB
Can We Predict the Author from a
Tweet?
 Most authors have a distinct writing style
 ... And unique topics to talk about
 ... And signature distribution of words used to tweet
 Can we train Bayesian Network so that occurrence of words in a tweet can be
used to infer the author of that tweet?
 In summary: YES!
 Disclaimer: Accuracy varies
 In a test suite with @dakwatuna vs @farhatabbaslaw (very different tweet topics)
– 100% prediction accuracy is achieved
Analysis & Implementation Plan
 Visualize Word Distribution in Tweets with Word Clouds
 Using R Statistical Language in RStudio
 Implement in Java
 Natural Language Preprocessing
 Train Bayesian Network
 Predict Tweet Author
Visualize Word Distribution in Tweets
with Word Clouds
Using R Statistical Language in RStudio
All documentation and sources (open
source) available at:
http://ceefour.github.io/r-tutorials/
 Install R Packages
 libcurl4-openssl-dev, TwitteR,
httpuv, tm, wordcloud,
RColorBrewer
 Setup Twitter Oauth
 Grab Data
 Prepare Stop Words
 Make A Corpus
 Word Cloud
1. Install R Packages
2. Setup Twitter OAuth
3. Grab Data
4. Prepare Stop Words
5. Make A Corpus
6. Visualize Word Cloud: @dakwatuna
Word Clouds (2)
@suaradotcom @kompascom
Word Clouds (3)
@VIVAnews @liputan6dotcom
Word Clouds (3)
@pkspiyungan @MTlovenhoney
Word Clouds (4)
@hidcom @farhatabbaslaw
Java Implementation
 Natural Language Preprocessing
 Read tweets from CSV
 Lower case
 Remove http(s) links
 Remove punctuation symbols
 Remove numbers
 Canonicalize different word forms
 Remove stop words
 Train Bayesian Network
 Predict Tweet Author
 Initial experiments and dataset
validation available at:
http://ceefour.github.io/r-
tutorials/
 Java application source code (open
source) available on GitHub at:
https://github.com/lumenitb/nlu-
sentiment
1. Read Tweets from CSV
/**
* Read CSV file {@code f} and put its contents into {@link #rows},
* {@link #texts}, and {@link #origTexts}.
* @param f
*/
public void readCsv(File f) {
try (final CSVReader csv = new CSVReader(new FileReader(f))) {
headerNames = csv.readNext(); // header
rows = csv.readAll();
texts = rows.stream().map(it -> Maps.immutableEntry(it[0], it[1]))
.collect(Collectors.toMap(Map.Entry::getKey, Map.Entry::getValue));
origTexts = ImmutableMap.copyOf(texts);
} catch (Exception e) {
throw new RuntimeException("Cannot read " + f, e);
}
}
2. Lower Case
/**
* Lower case all texts.
*/
public void lowerCaseAll() {
texts = Maps.transformValues(texts, String::toLowerCase);
}
3. Remove Links
/**
* Remove http(s) links from texts.
*/
public void removeLinks() {
texts = Maps.transformValues(texts, it -> it.replaceAll("http(s?)://(S+)", " "));
}
4. Remove Punctuation Symbols
/**
* Remove punctuation symbols from texts.
*/
public void removePunctuation() {
texts = Maps.transformValues(texts, it -> it.replaceAll("[^a-zA-Z0-9]+", " "));
}
5. Remove Numbers
/**
* Remove numbers from texts.
*/
public void removeNumbers() {
texts = Maps.transformValues(texts, it -> it.replaceAll("[0-9]+", ""));
}
6. Canonicalize Words
/**
* Canonicalize different word forms using {@link #CANONICAL_WORDS}.
*/
public void canonicalizeWords() {
log.info("Canonicalize {} words for {} texts: {}",
CANONICAL_WORDS.size(), texts.size(), CANONICAL_WORDS);
CANONICAL_WORDS.entries().forEach(entry ->
texts = Maps.transformValues(texts,
it -> it.replaceAll("(W|^)" + Pattern.quote(entry.getValue()) +
"(W|$)", "1" + entry.getKey() + "2"))
);
}
// Define contents of CANONICAL_WORDS
final ImmutableMultimap.Builder<String, String> mmb =
ImmutableMultimap.builder();
mmb.putAll("yang", "yg", "yng");
mmb.putAll("dengan", "dg", "dgn");
mmb.putAll("saya", "sy");
mmb.putAll("punya", "pny");
mmb.putAll("ya", "iya");
mmb.putAll("tidak", "tak", "tdk");
mmb.putAll("jangan", "jgn", "jngn");
mmb.putAll("jika", "jika", "bila");
mmb.putAll("sudah", "udah", "sdh", "dah", "telah", "tlh");
mmb.putAll("hanya", "hny");
mmb.putAll("banyak", "byk", "bnyk");
mmb.putAll("juga", "jg");
mmb.putAll("mereka", "mrk", "mereka");
mmb.putAll("gue", "gw", "gwe", "gua", "gwa");
mmb.putAll("sebagai", "sbg", "sbgai");
mmb.putAll("silaturahim", "silaturrahim", "silaturahmi",
"silaturrahmi");
mmb.putAll("shalat", "sholat", "salat", "solat");
mmb.putAll("harus", "hrs");
mmb.putAll("oleh", "olh");
mmb.putAll("tentang", "ttg", "tntg");
mmb.putAll("dalam", "dlm");
mmb.putAll("banget", "bngt", "bgt", "bingit", "bingits");
CANONICAL_WORDS = mmb.build();
7. Remove Stop Words
/**
* Remove stop words using {@link #STOP_WORDS_ID} and {@code additions}.
* @param additions
*/
public void removeStopWords(String... additions) {
final Sets.SetView<String> stopWords = Sets.union(
STOP_WORDS_ID, ImmutableSet.copyOf(additions));
log.info("Removing {} stop words for {} texts: {}",
stopWords.size(), texts.size(), stopWords);
stopWords.forEach(stopWord ->
texts = Maps.transformValues(texts, it ->
it.replaceAll("(W|^)" + Pattern.quote(stopWord) +
"(W|$)", "12"))
);
}
/**
* Indonesian stop words.
*/
public static final Set<String> STOP_WORDS_ID = ImmutableSet.of(
"di", "ke", "ini", "dengan", "untuk", "yang", "tak", "tidak",
"gak",
"dari", "dan", "atau", "bisa", "kita", "ada", "itu",
"akan", "jadi", "menjadi", "tetap", "per", "bagi", "saat",
"tapi", "bukan", "adalah", "pula", "aja", "saja",
"kalo", "kalau", "karena", "pada", "kepada", "terhadap",
"amp", // &amp;
"rt" // RT:
);
8. Split Text into Words
/**
* Split texts into {@link #words}.
*/
public void splitWords() {
Splitter whitespace = Splitter.on(
Pattern.compile("s+")).omitEmptyStrings().trimResults();
words = Maps.transformValues(texts,
it -> whitespace.splitToList(it));
}
Train Bayesian Network
BN Graph model Prior probabilities
Train Bayesian Network: Java (1)
/**
* Creates a {@link SentimentAnalyzer} then analyzes the
file {@code f},
* with limiting words to {@code wordLimit} (based on top
word frequency),
* and additional stop words of {@code moreStopWords}
(base stop words
* are {@link SentimentAnalyzer#STOP_WORDS_ID}.
* @param f
* @param wordLimit
* @param moreStopWords
* @return
*/
protected SentimentAnalyzer analyze(File f, int wordLimit,
Set<String> moreStopWords) {
final SentimentAnalyzer sentimentAnalyzer = new
SentimentAnalyzer();
sentimentAnalyzer.readCsv(f);
sentimentAnalyzer.lowerCaseAll();
sentimentAnalyzer.removeLinks();
sentimentAnalyzer.removePunctuation();
sentimentAnalyzer.removeNumbers();
sentimentAnalyzer.canonicalizeWords();
sentimentAnalyzer.removeStopWords(moreStopWords.toArray(ne
w String[] {}));
log.info("Preprocessed text: {}",
sentimentAnalyzer.texts.entrySet().stream().limit(10)
.collect(Collectors.toMap(Map.Entry::getKey,
Map.Entry::getValue)));
sentimentAnalyzer.splitWords();
log.info("Words: {}",
sentimentAnalyzer.words.entrySet().stream().limit(10)
.collect(Collectors.toMap(Map.Entry::getKey,
Map.Entry::getValue)));
final ImmutableMultiset<String> wordMultiset =
Multisets.copyHighestCountFirst(HashMultiset.create(
sentimentAnalyzer.words.values().stream().flatMap(it
-> it.stream()).collect(Collectors.toList())) );
final Map<String, Integer> wordCounts = new
LinkedHashMap<>();
// only the N most used words
wordMultiset.elementSet().stream().limit(wordLimit).
forEach( it -> wordCounts.put(it,
wordMultiset.count(it)) );
log.info("Word counts (orig): {}", wordCounts);
// Normalize the twitterUser "vector" to length
1.0
// Note that this "vector" is actually user-
specific, i.e. it's not a user-independent vector
long origSumSqrs = 0;
for (final Integer it : wordCounts.values()) {
origSumSqrs += it * it;
}
double origLength = Math.sqrt(origSumSqrs);
final Map<String, Double> normWordCounts =
Maps.transformValues(wordCounts, it -> it /
origLength);
log.info("Word counts (normalized): {}",
normWordCounts);
sentimentAnalyzer.normWordCounts =
normWordCounts;
return sentimentAnalyzer;
}
Train Bayesian Network: Java (2)
/**
* Train Bayesian network {@code bn}, with help of {@link #analyze(File, int, Set)}.
* @param bn
* @param f
* @param screenName
* @return
*/
protected SentimentAnalyzer train(BayesianNetwork bn, File f, String screenName) {
final SentimentAnalyzer analyzer = analyze(f, 100, ImmutableSet.of(screenName));
allWords.addAll(analyzer.normWordCounts.keySet());
for (final Map.Entry<String, Double> entry : analyzer.normWordCounts.entrySet()) {
wordNormLengthByScreenName.put(screenName + "/" + entry.getKey(), entry.getValue());
}
return analyzer;
}
Predict Twitter Author:
“nasional” found
“nasional” found ->
85.37% probability of @dakwatuna
“nasional” found, “olga” missing ->
89.29% probability of @dakwatuna
Predict Twitter author:
“olga” found
 @dakwatuna never tweets
about “olga”
 Not even once
 Therefore, BN assumes
100% probability that
@farhatabbaslaw is the
author
Predict Twitter Author
 Initial corpus:
 @dakwatuna: 3200 tweets
 @farhatabbaslaw: 3172 tweets
 Split into:
 @dakwatuna
 1000 training tweets
 2200 test tweets
 @farhatabbaslaw:
 1000 training tweets
 2172 test tweets
Twitter Author Prediction Test:
@dakwatuna
Classification of 2200 tweets took 7855 ms
~ 3.57 ms per tweet classification
100% accuracy of prediction
Twitter Author Prediction Test:
@farhatabbaslaw
Classification of 2172 tweets took 7353 ms
~ 3.38 ms per tweet classification
100% accuracy of prediction
Conclusion
 Initial results is promising
 Bayesian Networks is able to predict tweet author with “very good” accuracy
 Note that accuracy depends largely of:
 Twitter author’s writing style
 Twitter author’s topics of interest
 Twitter author’s distribution of words
 In other words, two different authors with similar writing style or topics will
have greater chance of “false positive” prediction

Weitere ähnliche Inhalte

Was ist angesagt?

Programming with Python and PostgreSQL
Programming with Python and PostgreSQLProgramming with Python and PostgreSQL
Programming with Python and PostgreSQLPeter Eisentraut
 
CS225_Prelecture_Notes 2nd
CS225_Prelecture_Notes 2ndCS225_Prelecture_Notes 2nd
CS225_Prelecture_Notes 2ndEdward Chen
 
OpenCog Developer Workshop
OpenCog Developer WorkshopOpenCog Developer Workshop
OpenCog Developer WorkshopIbby Benali
 
concurrency with GPars
concurrency with GParsconcurrency with GPars
concurrency with GParsPaul King
 
OOP and FP - Become a Better Programmer
OOP and FP - Become a Better ProgrammerOOP and FP - Become a Better Programmer
OOP and FP - Become a Better ProgrammerMario Fusco
 
Learn 90% of Python in 90 Minutes
Learn 90% of Python in 90 MinutesLearn 90% of Python in 90 Minutes
Learn 90% of Python in 90 MinutesMatt Harrison
 
From object oriented to functional domain modeling
From object oriented to functional domain modelingFrom object oriented to functional domain modeling
From object oriented to functional domain modelingMario Fusco
 
GPars (Groovy Parallel Systems)
GPars (Groovy Parallel Systems)GPars (Groovy Parallel Systems)
GPars (Groovy Parallel Systems)Gagan Agrawal
 
Design Patterns - Compiler Case Study - Hands-on Examples
Design Patterns - Compiler Case Study - Hands-on ExamplesDesign Patterns - Compiler Case Study - Hands-on Examples
Design Patterns - Compiler Case Study - Hands-on ExamplesGanesh Samarthyam
 
Python于Web 2.0网站的应用 - QCon Beijing 2010
Python于Web 2.0网站的应用 - QCon Beijing 2010Python于Web 2.0网站的应用 - QCon Beijing 2010
Python于Web 2.0网站的应用 - QCon Beijing 2010Qiangning Hong
 
C++ tutorial boost – 2013
C++ tutorial   boost – 2013C++ tutorial   boost – 2013
C++ tutorial boost – 2013Ratsietsi Mokete
 
Euro python2011 High Performance Python
Euro python2011 High Performance PythonEuro python2011 High Performance Python
Euro python2011 High Performance PythonIan Ozsvald
 
Natural Language Toolkit (NLTK), Basics
Natural Language Toolkit (NLTK), Basics Natural Language Toolkit (NLTK), Basics
Natural Language Toolkit (NLTK), Basics Prakash Pimpale
 
Matlab and Python: Basic Operations
Matlab and Python: Basic OperationsMatlab and Python: Basic Operations
Matlab and Python: Basic OperationsWai Nwe Tun
 
Новый InterSystems: open-source, митапы, хакатоны
Новый InterSystems: open-source, митапы, хакатоныНовый InterSystems: open-source, митапы, хакатоны
Новый InterSystems: open-source, митапы, хакатоныTimur Safin
 
NIO.2, the I/O API for the future
NIO.2, the I/O API for the futureNIO.2, the I/O API for the future
NIO.2, the I/O API for the futureMasoud Kalali
 

Was ist angesagt? (20)

Programming with Python and PostgreSQL
Programming with Python and PostgreSQLProgramming with Python and PostgreSQL
Programming with Python and PostgreSQL
 
Programming Assignment Help
Programming Assignment HelpProgramming Assignment Help
Programming Assignment Help
 
CS225_Prelecture_Notes 2nd
CS225_Prelecture_Notes 2ndCS225_Prelecture_Notes 2nd
CS225_Prelecture_Notes 2nd
 
OpenCog Developer Workshop
OpenCog Developer WorkshopOpenCog Developer Workshop
OpenCog Developer Workshop
 
concurrency with GPars
concurrency with GParsconcurrency with GPars
concurrency with GPars
 
OOP and FP - Become a Better Programmer
OOP and FP - Become a Better ProgrammerOOP and FP - Become a Better Programmer
OOP and FP - Become a Better Programmer
 
Learn 90% of Python in 90 Minutes
Learn 90% of Python in 90 MinutesLearn 90% of Python in 90 Minutes
Learn 90% of Python in 90 Minutes
 
From object oriented to functional domain modeling
From object oriented to functional domain modelingFrom object oriented to functional domain modeling
From object oriented to functional domain modeling
 
GPars (Groovy Parallel Systems)
GPars (Groovy Parallel Systems)GPars (Groovy Parallel Systems)
GPars (Groovy Parallel Systems)
 
Design Patterns - Compiler Case Study - Hands-on Examples
Design Patterns - Compiler Case Study - Hands-on ExamplesDesign Patterns - Compiler Case Study - Hands-on Examples
Design Patterns - Compiler Case Study - Hands-on Examples
 
Python于Web 2.0网站的应用 - QCon Beijing 2010
Python于Web 2.0网站的应用 - QCon Beijing 2010Python于Web 2.0网站的应用 - QCon Beijing 2010
Python于Web 2.0网站的应用 - QCon Beijing 2010
 
C++ tutorial boost – 2013
C++ tutorial   boost – 2013C++ tutorial   boost – 2013
C++ tutorial boost – 2013
 
The Rust Borrow Checker
The Rust Borrow CheckerThe Rust Borrow Checker
The Rust Borrow Checker
 
Euro python2011 High Performance Python
Euro python2011 High Performance PythonEuro python2011 High Performance Python
Euro python2011 High Performance Python
 
Natural Language Toolkit (NLTK), Basics
Natural Language Toolkit (NLTK), Basics Natural Language Toolkit (NLTK), Basics
Natural Language Toolkit (NLTK), Basics
 
Linked list
Linked listLinked list
Linked list
 
Matlab and Python: Basic Operations
Matlab and Python: Basic OperationsMatlab and Python: Basic Operations
Matlab and Python: Basic Operations
 
Новый InterSystems: open-source, митапы, хакатоны
Новый InterSystems: open-source, митапы, хакатоныНовый InterSystems: open-source, митапы, хакатоны
Новый InterSystems: open-source, митапы, хакатоны
 
Good Code
Good CodeGood Code
Good Code
 
NIO.2, the I/O API for the future
NIO.2, the I/O API for the futureNIO.2, the I/O API for the future
NIO.2, the I/O API for the future
 

Andere mochten auch

Professor Steve Roberts; The Bayesian Crowd: scalable information combinati...
Professor Steve Roberts; The Bayesian Crowd: scalable information combinati...Professor Steve Roberts; The Bayesian Crowd: scalable information combinati...
Professor Steve Roberts; The Bayesian Crowd: scalable information combinati...Bayes Nets meetup London
 
Bayes Nets Meetup Sept 29th 2016 - Bayesian Network Modelling by Marco Scutari
Bayes Nets Meetup Sept 29th 2016 - Bayesian Network Modelling by Marco ScutariBayes Nets Meetup Sept 29th 2016 - Bayesian Network Modelling by Marco Scutari
Bayes Nets Meetup Sept 29th 2016 - Bayesian Network Modelling by Marco ScutariBayes Nets meetup London
 
Bayesian networks and the search for causality
Bayesian networks and the search for causalityBayesian networks and the search for causality
Bayesian networks and the search for causalityBayes Nets meetup London
 
Ralf Herbrich - Introduction to Graphical models in Industry
Ralf Herbrich - Introduction to Graphical models in IndustryRalf Herbrich - Introduction to Graphical models in Industry
Ralf Herbrich - Introduction to Graphical models in IndustryBayes Nets meetup London
 
David Barber - Deep Nets, Bayes and the story of AI
David Barber - Deep Nets, Bayes and the story of AIDavid Barber - Deep Nets, Bayes and the story of AI
David Barber - Deep Nets, Bayes and the story of AIBayes Nets meetup London
 
An Introduction to Causal Discovery, a Bayesian Network Approach
An Introduction to Causal Discovery, a Bayesian Network ApproachAn Introduction to Causal Discovery, a Bayesian Network Approach
An Introduction to Causal Discovery, a Bayesian Network ApproachCOST action BM1006
 
Understanding your data with Bayesian networks (in Python) by Bartek Wilczyns...
Understanding your data with Bayesian networks (in Python) by Bartek Wilczyns...Understanding your data with Bayesian networks (in Python) by Bartek Wilczyns...
Understanding your data with Bayesian networks (in Python) by Bartek Wilczyns...PyData
 
Bayesian Network Modeling using Python and R
Bayesian Network Modeling using Python and RBayesian Network Modeling using Python and R
Bayesian Network Modeling using Python and RPyData
 

Andere mochten auch (10)

Professor Steve Roberts; The Bayesian Crowd: scalable information combinati...
Professor Steve Roberts; The Bayesian Crowd: scalable information combinati...Professor Steve Roberts; The Bayesian Crowd: scalable information combinati...
Professor Steve Roberts; The Bayesian Crowd: scalable information combinati...
 
Bayes Nets Meetup Sept 29th 2016 - Bayesian Network Modelling by Marco Scutari
Bayes Nets Meetup Sept 29th 2016 - Bayesian Network Modelling by Marco ScutariBayes Nets Meetup Sept 29th 2016 - Bayesian Network Modelling by Marco Scutari
Bayes Nets Meetup Sept 29th 2016 - Bayesian Network Modelling by Marco Scutari
 
Bayesian networks and the search for causality
Bayesian networks and the search for causalityBayesian networks and the search for causality
Bayesian networks and the search for causality
 
Ralf Herbrich - Introduction to Graphical models in Industry
Ralf Herbrich - Introduction to Graphical models in IndustryRalf Herbrich - Introduction to Graphical models in Industry
Ralf Herbrich - Introduction to Graphical models in Industry
 
David Barber - Deep Nets, Bayes and the story of AI
David Barber - Deep Nets, Bayes and the story of AIDavid Barber - Deep Nets, Bayes and the story of AI
David Barber - Deep Nets, Bayes and the story of AI
 
Bayes Belief Network
Bayes Belief NetworkBayes Belief Network
Bayes Belief Network
 
An Introduction to Causal Discovery, a Bayesian Network Approach
An Introduction to Causal Discovery, a Bayesian Network ApproachAn Introduction to Causal Discovery, a Bayesian Network Approach
An Introduction to Causal Discovery, a Bayesian Network Approach
 
Understanding your data with Bayesian networks (in Python) by Bartek Wilczyns...
Understanding your data with Bayesian networks (in Python) by Bartek Wilczyns...Understanding your data with Bayesian networks (in Python) by Bartek Wilczyns...
Understanding your data with Bayesian networks (in Python) by Bartek Wilczyns...
 
Bayesian Network Modeling using Python and R
Bayesian Network Modeling using Python and RBayesian Network Modeling using Python and R
Bayesian Network Modeling using Python and R
 
Naive Bayes Presentation
Naive Bayes PresentationNaive Bayes Presentation
Naive Bayes Presentation
 

Ähnlich wie Twitter Author Prediction from Tweets using Bayesian Network

Building a friendly .NET SDK to connect to Space
Building a friendly .NET SDK to connect to SpaceBuilding a friendly .NET SDK to connect to Space
Building a friendly .NET SDK to connect to SpaceMaarten Balliauw
 
Java 7, 8 & 9 - Moving the language forward
Java 7, 8 & 9 - Moving the language forwardJava 7, 8 & 9 - Moving the language forward
Java 7, 8 & 9 - Moving the language forwardMario Fusco
 
Groovy Introduction - JAX Germany - 2008
Groovy Introduction - JAX Germany - 2008Groovy Introduction - JAX Germany - 2008
Groovy Introduction - JAX Germany - 2008Guillaume Laforge
 
Data Analysis with R (combined slides)
Data Analysis with R (combined slides)Data Analysis with R (combined slides)
Data Analysis with R (combined slides)Guy Lebanon
 
Spark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and FutureSpark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and FutureRussell Spitzer
 
DataStax: Spark Cassandra Connector - Past, Present and Future
DataStax: Spark Cassandra Connector - Past, Present and FutureDataStax: Spark Cassandra Connector - Past, Present and Future
DataStax: Spark Cassandra Connector - Past, Present and FutureDataStax Academy
 
Apache Velocity
Apache Velocity Apache Velocity
Apache Velocity yesprakash
 
Scalable and Flexible Machine Learning With Scala @ LinkedIn
Scalable and Flexible Machine Learning With Scala @ LinkedInScalable and Flexible Machine Learning With Scala @ LinkedIn
Scalable and Flexible Machine Learning With Scala @ LinkedInVitaly Gordon
 
Introduction To Groovy 2005
Introduction To Groovy 2005Introduction To Groovy 2005
Introduction To Groovy 2005Tugdual Grall
 
Golang slidesaudrey
Golang slidesaudreyGolang slidesaudrey
Golang slidesaudreyAudrey Lim
 
Cs267 hadoop programming
Cs267 hadoop programmingCs267 hadoop programming
Cs267 hadoop programmingKuldeep Dhole
 
Psycopg2 - Connect to PostgreSQL using Python Script
Psycopg2 - Connect to PostgreSQL using Python ScriptPsycopg2 - Connect to PostgreSQL using Python Script
Psycopg2 - Connect to PostgreSQL using Python ScriptSurvey Department
 
VelocityGraph Introduction
VelocityGraph IntroductionVelocityGraph Introduction
VelocityGraph IntroductionMats Persson
 

Ähnlich wie Twitter Author Prediction from Tweets using Bayesian Network (20)

Building a friendly .NET SDK to connect to Space
Building a friendly .NET SDK to connect to SpaceBuilding a friendly .NET SDK to connect to Space
Building a friendly .NET SDK to connect to Space
 
Java 7, 8 & 9 - Moving the language forward
Java 7, 8 & 9 - Moving the language forwardJava 7, 8 & 9 - Moving the language forward
Java 7, 8 & 9 - Moving the language forward
 
Groovy Introduction - JAX Germany - 2008
Groovy Introduction - JAX Germany - 2008Groovy Introduction - JAX Germany - 2008
Groovy Introduction - JAX Germany - 2008
 
Having Fun with Play
Having Fun with PlayHaving Fun with Play
Having Fun with Play
 
Data Analysis with R (combined slides)
Data Analysis with R (combined slides)Data Analysis with R (combined slides)
Data Analysis with R (combined slides)
 
Spark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and FutureSpark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and Future
 
DataStax: Spark Cassandra Connector - Past, Present and Future
DataStax: Spark Cassandra Connector - Past, Present and FutureDataStax: Spark Cassandra Connector - Past, Present and Future
DataStax: Spark Cassandra Connector - Past, Present and Future
 
Jug java7
Jug java7Jug java7
Jug java7
 
Apache Velocity
Apache Velocity Apache Velocity
Apache Velocity
 
Os Bubna
Os BubnaOs Bubna
Os Bubna
 
Apache Velocity
Apache VelocityApache Velocity
Apache Velocity
 
Scalable and Flexible Machine Learning With Scala @ LinkedIn
Scalable and Flexible Machine Learning With Scala @ LinkedInScalable and Flexible Machine Learning With Scala @ LinkedIn
Scalable and Flexible Machine Learning With Scala @ LinkedIn
 
Introduction To Groovy 2005
Introduction To Groovy 2005Introduction To Groovy 2005
Introduction To Groovy 2005
 
68837.ppt
68837.ppt68837.ppt
68837.ppt
 
Golang slidesaudrey
Golang slidesaudreyGolang slidesaudrey
Golang slidesaudrey
 
In kor we Trust
In kor we TrustIn kor we Trust
In kor we Trust
 
Java
JavaJava
Java
 
Cs267 hadoop programming
Cs267 hadoop programmingCs267 hadoop programming
Cs267 hadoop programming
 
Psycopg2 - Connect to PostgreSQL using Python Script
Psycopg2 - Connect to PostgreSQL using Python ScriptPsycopg2 - Connect to PostgreSQL using Python Script
Psycopg2 - Connect to PostgreSQL using Python Script
 
VelocityGraph Introduction
VelocityGraph IntroductionVelocityGraph Introduction
VelocityGraph Introduction
 

Mehr von Hendy Irawan

Apa yang dapat Anda raih dengan Besut Kode Universitas
Apa yang dapat Anda raih dengan Besut Kode UniversitasApa yang dapat Anda raih dengan Besut Kode Universitas
Apa yang dapat Anda raih dengan Besut Kode UniversitasHendy Irawan
 
Persiapan Google Summer of Code (GSoC)
Persiapan Google Summer of Code (GSoC)Persiapan Google Summer of Code (GSoC)
Persiapan Google Summer of Code (GSoC)Hendy Irawan
 
Tutorial JSON-LD dan RabbitMQ di Java
Tutorial JSON-LD dan RabbitMQ di JavaTutorial JSON-LD dan RabbitMQ di Java
Tutorial JSON-LD dan RabbitMQ di JavaHendy Irawan
 
EBA Internship Program 2015-2016
EBA Internship Program 2015-2016EBA Internship Program 2015-2016
EBA Internship Program 2015-2016Hendy Irawan
 
Big Data innovation in Japan’s energy industry - EBA Fieldwork 2015
Big Data innovation in Japan’s energy industry - EBA Fieldwork 2015Big Data innovation in Japan’s energy industry - EBA Fieldwork 2015
Big Data innovation in Japan’s energy industry - EBA Fieldwork 2015Hendy Irawan
 
EBA (Evidence-Based Approach) Culture
EBA (Evidence-Based Approach) CultureEBA (Evidence-Based Approach) Culture
EBA (Evidence-Based Approach) CultureHendy Irawan
 
Peraturan Walikota (Perwal) PPDB Kota Bandung Tahun 2015
Peraturan Walikota (Perwal) PPDB Kota Bandung Tahun 2015Peraturan Walikota (Perwal) PPDB Kota Bandung Tahun 2015
Peraturan Walikota (Perwal) PPDB Kota Bandung Tahun 2015Hendy Irawan
 
Sosialisasi Petunjuk Teknis Penerimaan Peserta Didik Baru (PPDB) Kota Bandung...
Sosialisasi Petunjuk Teknis Penerimaan Peserta Didik Baru (PPDB) Kota Bandung...Sosialisasi Petunjuk Teknis Penerimaan Peserta Didik Baru (PPDB) Kota Bandung...
Sosialisasi Petunjuk Teknis Penerimaan Peserta Didik Baru (PPDB) Kota Bandung...Hendy Irawan
 
Biased Media - Game Theory (EL5000) Course Project
Biased Media - Game Theory (EL5000) Course ProjectBiased Media - Game Theory (EL5000) Course Project
Biased Media - Game Theory (EL5000) Course ProjectHendy Irawan
 
3D Reality Tracking in Realtime - Team Hendy-Sigit
3D Reality Tracking in Realtime - Team Hendy-Sigit3D Reality Tracking in Realtime - Team Hendy-Sigit
3D Reality Tracking in Realtime - Team Hendy-SigitHendy Irawan
 
Pemrograman Logika Induktif (Inductive Logic Programming)
Pemrograman Logika Induktif (Inductive Logic Programming)Pemrograman Logika Induktif (Inductive Logic Programming)
Pemrograman Logika Induktif (Inductive Logic Programming)Hendy Irawan
 
Inductive Logic Programming
Inductive Logic ProgrammingInductive Logic Programming
Inductive Logic ProgrammingHendy Irawan
 
AksiMata Studio Tablet
AksiMata Studio TabletAksiMata Studio Tablet
AksiMata Studio TabletHendy Irawan
 
AksiMata Studio for Lenovo® AIO
AksiMata Studio for Lenovo® AIOAksiMata Studio for Lenovo® AIO
AksiMata Studio for Lenovo® AIOHendy Irawan
 
Dasar Koperasi Kredit (Credit Union)
Dasar Koperasi Kredit (Credit Union)Dasar Koperasi Kredit (Credit Union)
Dasar Koperasi Kredit (Credit Union)Hendy Irawan
 
How to Develop a Basic Magento Extension Tutorial
How to Develop a Basic Magento Extension TutorialHow to Develop a Basic Magento Extension Tutorial
How to Develop a Basic Magento Extension TutorialHendy Irawan
 
Search Engine Marketing (SEM)
Search Engine Marketing (SEM)Search Engine Marketing (SEM)
Search Engine Marketing (SEM)Hendy Irawan
 
How to Create A Magento Adminhtml Controller in Magento Extension
How to Create A Magento Adminhtml Controller in Magento ExtensionHow to Create A Magento Adminhtml Controller in Magento Extension
How to Create A Magento Adminhtml Controller in Magento ExtensionHendy Irawan
 
How to create a magento controller in magento extension
How to create a magento controller in magento extensionHow to create a magento controller in magento extension
How to create a magento controller in magento extensionHendy Irawan
 

Mehr von Hendy Irawan (20)

Apa yang dapat Anda raih dengan Besut Kode Universitas
Apa yang dapat Anda raih dengan Besut Kode UniversitasApa yang dapat Anda raih dengan Besut Kode Universitas
Apa yang dapat Anda raih dengan Besut Kode Universitas
 
Persiapan Google Summer of Code (GSoC)
Persiapan Google Summer of Code (GSoC)Persiapan Google Summer of Code (GSoC)
Persiapan Google Summer of Code (GSoC)
 
Tutorial JSON-LD dan RabbitMQ di Java
Tutorial JSON-LD dan RabbitMQ di JavaTutorial JSON-LD dan RabbitMQ di Java
Tutorial JSON-LD dan RabbitMQ di Java
 
EBA Internship Program 2015-2016
EBA Internship Program 2015-2016EBA Internship Program 2015-2016
EBA Internship Program 2015-2016
 
Big Data innovation in Japan’s energy industry - EBA Fieldwork 2015
Big Data innovation in Japan’s energy industry - EBA Fieldwork 2015Big Data innovation in Japan’s energy industry - EBA Fieldwork 2015
Big Data innovation in Japan’s energy industry - EBA Fieldwork 2015
 
EBA (Evidence-Based Approach) Culture
EBA (Evidence-Based Approach) CultureEBA (Evidence-Based Approach) Culture
EBA (Evidence-Based Approach) Culture
 
Peraturan Walikota (Perwal) PPDB Kota Bandung Tahun 2015
Peraturan Walikota (Perwal) PPDB Kota Bandung Tahun 2015Peraturan Walikota (Perwal) PPDB Kota Bandung Tahun 2015
Peraturan Walikota (Perwal) PPDB Kota Bandung Tahun 2015
 
Sosialisasi Petunjuk Teknis Penerimaan Peserta Didik Baru (PPDB) Kota Bandung...
Sosialisasi Petunjuk Teknis Penerimaan Peserta Didik Baru (PPDB) Kota Bandung...Sosialisasi Petunjuk Teknis Penerimaan Peserta Didik Baru (PPDB) Kota Bandung...
Sosialisasi Petunjuk Teknis Penerimaan Peserta Didik Baru (PPDB) Kota Bandung...
 
Biased Media - Game Theory (EL5000) Course Project
Biased Media - Game Theory (EL5000) Course ProjectBiased Media - Game Theory (EL5000) Course Project
Biased Media - Game Theory (EL5000) Course Project
 
3D Reality Tracking in Realtime - Team Hendy-Sigit
3D Reality Tracking in Realtime - Team Hendy-Sigit3D Reality Tracking in Realtime - Team Hendy-Sigit
3D Reality Tracking in Realtime - Team Hendy-Sigit
 
Pemrograman Logika Induktif (Inductive Logic Programming)
Pemrograman Logika Induktif (Inductive Logic Programming)Pemrograman Logika Induktif (Inductive Logic Programming)
Pemrograman Logika Induktif (Inductive Logic Programming)
 
Inductive Logic Programming
Inductive Logic ProgrammingInductive Logic Programming
Inductive Logic Programming
 
AksiMata Studio Tablet
AksiMata Studio TabletAksiMata Studio Tablet
AksiMata Studio Tablet
 
AksiMata Studio for Lenovo® AIO
AksiMata Studio for Lenovo® AIOAksiMata Studio for Lenovo® AIO
AksiMata Studio for Lenovo® AIO
 
AksiMata Studio
AksiMata StudioAksiMata Studio
AksiMata Studio
 
Dasar Koperasi Kredit (Credit Union)
Dasar Koperasi Kredit (Credit Union)Dasar Koperasi Kredit (Credit Union)
Dasar Koperasi Kredit (Credit Union)
 
How to Develop a Basic Magento Extension Tutorial
How to Develop a Basic Magento Extension TutorialHow to Develop a Basic Magento Extension Tutorial
How to Develop a Basic Magento Extension Tutorial
 
Search Engine Marketing (SEM)
Search Engine Marketing (SEM)Search Engine Marketing (SEM)
Search Engine Marketing (SEM)
 
How to Create A Magento Adminhtml Controller in Magento Extension
How to Create A Magento Adminhtml Controller in Magento ExtensionHow to Create A Magento Adminhtml Controller in Magento Extension
How to Create A Magento Adminhtml Controller in Magento Extension
 
How to create a magento controller in magento extension
How to create a magento controller in magento extensionHow to create a magento controller in magento extension
How to create a magento controller in magento extension
 

Kürzlich hochgeladen

🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 

Kürzlich hochgeladen (20)

🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 

Twitter Author Prediction from Tweets using Bayesian Network

  • 1. Twitter Author Prediction from Tweets using Bayesian Network Hendy Irawan 23214344 TMDG 9 – Electrical Engineering - STEI ITB
  • 2. Can We Predict the Author from a Tweet?  Most authors have a distinct writing style  ... And unique topics to talk about  ... And signature distribution of words used to tweet  Can we train Bayesian Network so that occurrence of words in a tweet can be used to infer the author of that tweet?  In summary: YES!  Disclaimer: Accuracy varies  In a test suite with @dakwatuna vs @farhatabbaslaw (very different tweet topics) – 100% prediction accuracy is achieved
  • 3. Analysis & Implementation Plan  Visualize Word Distribution in Tweets with Word Clouds  Using R Statistical Language in RStudio  Implement in Java  Natural Language Preprocessing  Train Bayesian Network  Predict Tweet Author
  • 4. Visualize Word Distribution in Tweets with Word Clouds Using R Statistical Language in RStudio All documentation and sources (open source) available at: http://ceefour.github.io/r-tutorials/  Install R Packages  libcurl4-openssl-dev, TwitteR, httpuv, tm, wordcloud, RColorBrewer  Setup Twitter Oauth  Grab Data  Prepare Stop Words  Make A Corpus  Word Cloud
  • 5. 1. Install R Packages
  • 9. 5. Make A Corpus
  • 10. 6. Visualize Word Cloud: @dakwatuna
  • 12. Word Clouds (3) @VIVAnews @liputan6dotcom
  • 14. Word Clouds (4) @hidcom @farhatabbaslaw
  • 15. Java Implementation  Natural Language Preprocessing  Read tweets from CSV  Lower case  Remove http(s) links  Remove punctuation symbols  Remove numbers  Canonicalize different word forms  Remove stop words  Train Bayesian Network  Predict Tweet Author  Initial experiments and dataset validation available at: http://ceefour.github.io/r- tutorials/  Java application source code (open source) available on GitHub at: https://github.com/lumenitb/nlu- sentiment
  • 16. 1. Read Tweets from CSV /** * Read CSV file {@code f} and put its contents into {@link #rows}, * {@link #texts}, and {@link #origTexts}. * @param f */ public void readCsv(File f) { try (final CSVReader csv = new CSVReader(new FileReader(f))) { headerNames = csv.readNext(); // header rows = csv.readAll(); texts = rows.stream().map(it -> Maps.immutableEntry(it[0], it[1])) .collect(Collectors.toMap(Map.Entry::getKey, Map.Entry::getValue)); origTexts = ImmutableMap.copyOf(texts); } catch (Exception e) { throw new RuntimeException("Cannot read " + f, e); } }
  • 17. 2. Lower Case /** * Lower case all texts. */ public void lowerCaseAll() { texts = Maps.transformValues(texts, String::toLowerCase); }
  • 18. 3. Remove Links /** * Remove http(s) links from texts. */ public void removeLinks() { texts = Maps.transformValues(texts, it -> it.replaceAll("http(s?)://(S+)", " ")); }
  • 19. 4. Remove Punctuation Symbols /** * Remove punctuation symbols from texts. */ public void removePunctuation() { texts = Maps.transformValues(texts, it -> it.replaceAll("[^a-zA-Z0-9]+", " ")); }
  • 20. 5. Remove Numbers /** * Remove numbers from texts. */ public void removeNumbers() { texts = Maps.transformValues(texts, it -> it.replaceAll("[0-9]+", "")); }
  • 21. 6. Canonicalize Words /** * Canonicalize different word forms using {@link #CANONICAL_WORDS}. */ public void canonicalizeWords() { log.info("Canonicalize {} words for {} texts: {}", CANONICAL_WORDS.size(), texts.size(), CANONICAL_WORDS); CANONICAL_WORDS.entries().forEach(entry -> texts = Maps.transformValues(texts, it -> it.replaceAll("(W|^)" + Pattern.quote(entry.getValue()) + "(W|$)", "1" + entry.getKey() + "2")) ); } // Define contents of CANONICAL_WORDS final ImmutableMultimap.Builder<String, String> mmb = ImmutableMultimap.builder(); mmb.putAll("yang", "yg", "yng"); mmb.putAll("dengan", "dg", "dgn"); mmb.putAll("saya", "sy"); mmb.putAll("punya", "pny"); mmb.putAll("ya", "iya"); mmb.putAll("tidak", "tak", "tdk"); mmb.putAll("jangan", "jgn", "jngn"); mmb.putAll("jika", "jika", "bila"); mmb.putAll("sudah", "udah", "sdh", "dah", "telah", "tlh"); mmb.putAll("hanya", "hny"); mmb.putAll("banyak", "byk", "bnyk"); mmb.putAll("juga", "jg"); mmb.putAll("mereka", "mrk", "mereka"); mmb.putAll("gue", "gw", "gwe", "gua", "gwa"); mmb.putAll("sebagai", "sbg", "sbgai"); mmb.putAll("silaturahim", "silaturrahim", "silaturahmi", "silaturrahmi"); mmb.putAll("shalat", "sholat", "salat", "solat"); mmb.putAll("harus", "hrs"); mmb.putAll("oleh", "olh"); mmb.putAll("tentang", "ttg", "tntg"); mmb.putAll("dalam", "dlm"); mmb.putAll("banget", "bngt", "bgt", "bingit", "bingits"); CANONICAL_WORDS = mmb.build();
  • 22. 7. Remove Stop Words /** * Remove stop words using {@link #STOP_WORDS_ID} and {@code additions}. * @param additions */ public void removeStopWords(String... additions) { final Sets.SetView<String> stopWords = Sets.union( STOP_WORDS_ID, ImmutableSet.copyOf(additions)); log.info("Removing {} stop words for {} texts: {}", stopWords.size(), texts.size(), stopWords); stopWords.forEach(stopWord -> texts = Maps.transformValues(texts, it -> it.replaceAll("(W|^)" + Pattern.quote(stopWord) + "(W|$)", "12")) ); } /** * Indonesian stop words. */ public static final Set<String> STOP_WORDS_ID = ImmutableSet.of( "di", "ke", "ini", "dengan", "untuk", "yang", "tak", "tidak", "gak", "dari", "dan", "atau", "bisa", "kita", "ada", "itu", "akan", "jadi", "menjadi", "tetap", "per", "bagi", "saat", "tapi", "bukan", "adalah", "pula", "aja", "saja", "kalo", "kalau", "karena", "pada", "kepada", "terhadap", "amp", // &amp; "rt" // RT: );
  • 23. 8. Split Text into Words /** * Split texts into {@link #words}. */ public void splitWords() { Splitter whitespace = Splitter.on( Pattern.compile("s+")).omitEmptyStrings().trimResults(); words = Maps.transformValues(texts, it -> whitespace.splitToList(it)); }
  • 24. Train Bayesian Network BN Graph model Prior probabilities
  • 25. Train Bayesian Network: Java (1) /** * Creates a {@link SentimentAnalyzer} then analyzes the file {@code f}, * with limiting words to {@code wordLimit} (based on top word frequency), * and additional stop words of {@code moreStopWords} (base stop words * are {@link SentimentAnalyzer#STOP_WORDS_ID}. * @param f * @param wordLimit * @param moreStopWords * @return */ protected SentimentAnalyzer analyze(File f, int wordLimit, Set<String> moreStopWords) { final SentimentAnalyzer sentimentAnalyzer = new SentimentAnalyzer(); sentimentAnalyzer.readCsv(f); sentimentAnalyzer.lowerCaseAll(); sentimentAnalyzer.removeLinks(); sentimentAnalyzer.removePunctuation(); sentimentAnalyzer.removeNumbers(); sentimentAnalyzer.canonicalizeWords(); sentimentAnalyzer.removeStopWords(moreStopWords.toArray(ne w String[] {})); log.info("Preprocessed text: {}", sentimentAnalyzer.texts.entrySet().stream().limit(10) .collect(Collectors.toMap(Map.Entry::getKey, Map.Entry::getValue))); sentimentAnalyzer.splitWords(); log.info("Words: {}", sentimentAnalyzer.words.entrySet().stream().limit(10) .collect(Collectors.toMap(Map.Entry::getKey, Map.Entry::getValue))); final ImmutableMultiset<String> wordMultiset = Multisets.copyHighestCountFirst(HashMultiset.create( sentimentAnalyzer.words.values().stream().flatMap(it -> it.stream()).collect(Collectors.toList())) ); final Map<String, Integer> wordCounts = new LinkedHashMap<>(); // only the N most used words wordMultiset.elementSet().stream().limit(wordLimit). forEach( it -> wordCounts.put(it, wordMultiset.count(it)) ); log.info("Word counts (orig): {}", wordCounts); // Normalize the twitterUser "vector" to length 1.0 // Note that this "vector" is actually user- specific, i.e. it's not a user-independent vector long origSumSqrs = 0; for (final Integer it : wordCounts.values()) { origSumSqrs += it * it; } double origLength = Math.sqrt(origSumSqrs); final Map<String, Double> normWordCounts = Maps.transformValues(wordCounts, it -> it / origLength); log.info("Word counts (normalized): {}", normWordCounts); sentimentAnalyzer.normWordCounts = normWordCounts; return sentimentAnalyzer; }
  • 26. Train Bayesian Network: Java (2) /** * Train Bayesian network {@code bn}, with help of {@link #analyze(File, int, Set)}. * @param bn * @param f * @param screenName * @return */ protected SentimentAnalyzer train(BayesianNetwork bn, File f, String screenName) { final SentimentAnalyzer analyzer = analyze(f, 100, ImmutableSet.of(screenName)); allWords.addAll(analyzer.normWordCounts.keySet()); for (final Map.Entry<String, Double> entry : analyzer.normWordCounts.entrySet()) { wordNormLengthByScreenName.put(screenName + "/" + entry.getKey(), entry.getValue()); } return analyzer; }
  • 27. Predict Twitter Author: “nasional” found “nasional” found -> 85.37% probability of @dakwatuna “nasional” found, “olga” missing -> 89.29% probability of @dakwatuna
  • 28. Predict Twitter author: “olga” found  @dakwatuna never tweets about “olga”  Not even once  Therefore, BN assumes 100% probability that @farhatabbaslaw is the author
  • 29. Predict Twitter Author  Initial corpus:  @dakwatuna: 3200 tweets  @farhatabbaslaw: 3172 tweets  Split into:  @dakwatuna  1000 training tweets  2200 test tweets  @farhatabbaslaw:  1000 training tweets  2172 test tweets
  • 30. Twitter Author Prediction Test: @dakwatuna Classification of 2200 tweets took 7855 ms ~ 3.57 ms per tweet classification 100% accuracy of prediction
  • 31. Twitter Author Prediction Test: @farhatabbaslaw Classification of 2172 tweets took 7353 ms ~ 3.38 ms per tweet classification 100% accuracy of prediction
  • 32. Conclusion  Initial results is promising  Bayesian Networks is able to predict tweet author with “very good” accuracy  Note that accuracy depends largely of:  Twitter author’s writing style  Twitter author’s topics of interest  Twitter author’s distribution of words  In other words, two different authors with similar writing style or topics will have greater chance of “false positive” prediction