SlideShare a Scribd company logo
1 of 33
Download to read offline
Is that Dothraki or Valyrian?
and other NLP tasks with Python and NLTK
Charlie Redmon | SupStat, Inc.
August 18, 2014
Dothraki
Astapori Valyrian
High Valyrian
Importing raw text
dothraki_f = codecs.open(
"/home/cr/Python/westeros/dothraki.txt",
encoding=’utf -8’)
dothraki_raw = dothraki_f.read ()
print dothraki_raw
Athchomar chomakaan , [zhey] khal vezhven. Azha
anhaan asshilat ... Itte oakah! Jadi , zhey Jora
Andahli. Khal vezhven. Ajjalan anha zalat vitiherat
yer hatif. Kash qoy qoyi thira disse. Hash shafka
zali addrivat mae , zhey Khaleesi? Ishish chare
...
Text processing: Cleaning
punct_re = re.compile(
ur’[. ,;:?! u2014u2019u2026 []] ’,
re.UNICODE)
dothraki_proc = punct_re.sub(’’, dothraki_raw)
dothraki_proc = dothraki_proc.lower ()
print dothraki_proc
athchomar chomakaan zhey khal vezhven azha anhaan
asshilat itte oakah jadi zhey jora andahli khal
vezhven ajjalan anha zalat vitiherat yer hatif kash
qoy qoyi thira disse
...
Text processing: Tokenizing
dothraki_tokens = re.split(ur’s+’, dothraki_proc)
dothraki_types = set(dothraki_tokens )
print dothraki_types
set([u’izzi ’, u’ale’, u’morea ’, u’vesazhao ’,
u’yeri ’, u’ishish ’, u’dalen ’, u’vesazhae ’, u’yera ’,
u’afisi ’, u’rhae ’, u’mawizzi ’, u’vee’, u’arrisse ’,
u’ti’, u’ven’, u’rizh ’, u’afichak ’, u’gache ’,
u’zigerek ’, u’zigereo ’, u’drivoe ’, u’maaz ’,
u’zigeree ’, u’ayyeyoon ’, u’maan ’, u’mahrazhi ’,
u’ma’, u’vos’, u’movekkhi ’, u’mahrazhis ’,
u’meshafka ’, u’qisi ’, u’sani ’, u’ville ’, u’vikeesi ’,
u’ifak ’, u’javrathi ’, u’zisa ’, u’chek ’, u’nem’,
...
])
Inspecting the lexical distribution in a text
dothraki_freqdist = FreqDist( dothraki_tokens)
print dothraki_freqdist
<FreqDist: u’anha ’: 50, u’vos’: 40, u’me’: 39,
u’ma’: 38, u’zhey ’: 29, u’mae’: 27, u’anni ’: 26,
u’hash ’: 23, u’yer’: 23, u’khal ’: 16,
u’khaleesi ’: 16, u’mori ’: 15, u’jin’: 13,
u’kisha ’: 12, u’nem’: 11, u’vo’: 11, u’che’: 10,
u’jini ’: 10, u’she’: 10, ... >
dothraki_freqdist .plot (20, cumulative=True)
CFD of Dothraki words
Top 10: anha, vos, me, ma, zhey, mae, anni, hash, yer, khal
Valyrian vocabulary distribution
Astapori Valyrian (Top 10):
ji, me, do, espo, si, mysa, eji, ez, ivetr´a, sa
High Valyrian (Top 10):
daor, se, issa, syt, ziry, hen, jem¯ele, lue, yne, avy
Feature 1: Consonant proportion
def c_prop(word ):
c_num = 0
for letter in u’bcdfgjklmnpqrstvxz u00f1 ’:
c_num += word.count(letter)
return c_num / len(word)
c_prop(u’zu016bgusy ’)
0.5
Word-internal consonant proportions across languages
Feature 2: Obstruent proportion
def obstruent_prop (word ):
obstruent_num = 0
for letter in u’bcdfgjkpqstvxz ’
obstruent_num += word.count(letter)
return obstruent_num / len(word)
obstruent_prop (u’u012blvi ’)
0.25
Word-internal obstruent proportions across languages
Feature 3: Coda presence
def c_coda(word ):
if word [-1] in u’bcdfgjklmnpqrstvxz u00f1 ’:
return 1
else:
return 0
def obstruent_coda (word ):
if word [-1] in u’bcdfgjkpqstvxz ’:
return 1
else:
return 0
c_coda(u’lysoon ’)
1
obstruent_coda (u’lysoon ’)
0
Mean coda consonant presence across languages
Mean coda obstruent presence across languages
Feature 4: Consonant clusters
regex = ur’[ bcdfghjklmnpqrstvxz u00f1]
[ bcdfghjklmnpqrstvxz u00f1 ]+’
def c_cluster(word ):
cc_set = re.findall(regex , word , re.UNICODE)
return len(cc_set)
c_cluster(u’avvirsosh ’)
3
Mean consonant cluster frequency across languages
Feature 5: Obstruent clusters
regex1 = ur’[bcdfghjkpqstvxz ][ bcdfghjkpqstvxz ]+’
def obs_cluster(word ):
oo_set = re.findall(regex1 , word , re.UNICODE)
return len(oo_set)
obs_cluster(u’avvirsosh ’)
2
Mean obstruent cluster frequency across languages
Feature 6: Vowel clusters
regex2 = ur’[ bcdfghjklmnpqrstvxz u00f1 ]+’
def v_cluster(word ):
v_set = re.split(regex2 , word , re.UNICODE)
vv_set = [v for v in v_set if len(v) > 1]
return len(vv_set)
v_cluster(u’haeshi ’)
1
Mean vowel cluster frequency across languages
Data from real languages
TDIL Assamese Corpus
TDIL Assamese Corpus
Assamese corpus files
directory = "/home/cr/Documents/NLPwP_pres/
TDIL_assamese_corpus_data "
os.listdir(directory)
[’subj_art2.txt’, ’subj_politics1 .txt’, ’lit3.txt’,
’drama.txt’, ’religion2.txt’, ’criticism2.txt’,
’criticism1.txt’, ’subj_science3.txt’,
’ref_encyclopaedia -entry.txt’, ’subj_science2.txt’,
’subj_social -studies.txt’, ’music.txt’, ’subj_art1.txt
’subj_science1.txt’, ’subj_art3.txt’, ’news.txt’,
’subj_sociology .txt’, ’criticism3.txt’, ’lit8.txt’,
’subj_history1.txt’, ’lit4.txt’, ’lit6.txt’, ’religion
’subj_law.txt’, ’lit7.txt’, ’religion1.txt’, ’criticis
’lit5.txt’, ’subj_math.txt’, ’lit1.txt’, ’subj_science
’subj_science_5 .txt’, ’subj_history2.txt’, ’lit2.txt’,
’subj_science4.txt’, ’letter.txt’]
Assamese sample: ‘lit5.txt’
Frequency of the sound /x/ in ’lit5.txt’
len(re.findall(ur’[ u09b6u09b7u09b8]’,
assamese_sample_raw , re.UNICODE ))
1313
len(re.findall(ur’u09b6 ’, assamese_sample_raw ,
re.UNICODE ))
298
len(re.findall(ur’u09b7 ’, assamese_sample_raw ,
re.UNICODE ))
195
len(re.findall(ur’u09b8 ’, assamese_sample_raw ,
re.UNICODE ))
820
Positional restrictions
Beginning a word:
len(re.findall(ur’b[ u09b6u09b7u09b8]’,
assamese_sample_raw , re.UNICODE ))
1129
Ending a word:
len(re.findall(ur’[ u09b6u09b7u09b8 ]b’,
assamese_sample_raw , re.UNICODE ))
895
Positional restrictions
Following /a/:
len(re.findall(ur’u09be [ u09b6u09b7u09b8]’,
assamese_sample_raw , re.UNICODE ))
57
Following /i/:
len(re.findall(ur’[ u09bfu09c0 ][ u09b6u09b7u09b8]’,
ssamese_sample_raw , re.UNICODE ))
70
Following /u/:
len(re.findall(ur’[ u09c1u09c2 ][ u09b6u09b7u09b8]’,
assamese_sample_raw , re.UNICODE ))
10
Further work
Incorporate segmental parameters into classifier (fix Unicode
issues with NLTK’s classify module)
Use classifier to predict assignment of random words from
Westeros to Dothraki, Astapori Valyrian, and High Valyrian
languages
Isolate most important word-internal parameters in
classification model (log-likelihood ranking in Naive Bayes
model)
Use full distributional account of select Assamese consonants
as priors in acoustic classification model
Thank you

More Related Content

What's hot

Being Google
Being GoogleBeing Google
Being GoogleTom Dyson
 
Ravi Prakash Yadav , BCA Third Year
Ravi Prakash Yadav , BCA Third YearRavi Prakash Yadav , BCA Third Year
Ravi Prakash Yadav , BCA Third YearDezyneecole
 
F# delight
F# delightF# delight
F# delightpriort
 
Learning Rust - experiences from a Python/Javascript developer
Learning Rust - experiences from a Python/Javascript developerLearning Rust - experiences from a Python/Javascript developer
Learning Rust - experiences from a Python/Javascript developerJuha-Matti Santala
 
Mithlesh Singh Rawat , BCA Third Year
Mithlesh Singh Rawat , BCA Third YearMithlesh Singh Rawat , BCA Third Year
Mithlesh Singh Rawat , BCA Third Yeardezyneecole
 
Harendra Singh,BCA Third Year
Harendra Singh,BCA Third YearHarendra Singh,BCA Third Year
Harendra Singh,BCA Third Yeardezyneecole
 
Embracing a new world - dynamic languages and .NET
Embracing a new world - dynamic languages and .NETEmbracing a new world - dynamic languages and .NET
Embracing a new world - dynamic languages and .NETBen Hall
 
Go serving: Building server app with go
Go serving: Building server app with goGo serving: Building server app with go
Go serving: Building server app with goHean Hong Leong
 
Akshay Sharma , BCA Third Year
Akshay Sharma , BCA Third YearAkshay Sharma , BCA Third Year
Akshay Sharma , BCA Third YearDezyneecole
 
Nltk - Boston Text Analytics
Nltk - Boston Text AnalyticsNltk - Boston Text Analytics
Nltk - Boston Text Analyticsshanbady
 
Writing and using php streams and sockets
Writing and using php streams and socketsWriting and using php streams and sockets
Writing and using php streams and socketsElizabeth Smith
 
Your Own Metric System
Your Own Metric SystemYour Own Metric System
Your Own Metric SystemErin Dees
 

What's hot (12)

Being Google
Being GoogleBeing Google
Being Google
 
Ravi Prakash Yadav , BCA Third Year
Ravi Prakash Yadav , BCA Third YearRavi Prakash Yadav , BCA Third Year
Ravi Prakash Yadav , BCA Third Year
 
F# delight
F# delightF# delight
F# delight
 
Learning Rust - experiences from a Python/Javascript developer
Learning Rust - experiences from a Python/Javascript developerLearning Rust - experiences from a Python/Javascript developer
Learning Rust - experiences from a Python/Javascript developer
 
Mithlesh Singh Rawat , BCA Third Year
Mithlesh Singh Rawat , BCA Third YearMithlesh Singh Rawat , BCA Third Year
Mithlesh Singh Rawat , BCA Third Year
 
Harendra Singh,BCA Third Year
Harendra Singh,BCA Third YearHarendra Singh,BCA Third Year
Harendra Singh,BCA Third Year
 
Embracing a new world - dynamic languages and .NET
Embracing a new world - dynamic languages and .NETEmbracing a new world - dynamic languages and .NET
Embracing a new world - dynamic languages and .NET
 
Go serving: Building server app with go
Go serving: Building server app with goGo serving: Building server app with go
Go serving: Building server app with go
 
Akshay Sharma , BCA Third Year
Akshay Sharma , BCA Third YearAkshay Sharma , BCA Third Year
Akshay Sharma , BCA Third Year
 
Nltk - Boston Text Analytics
Nltk - Boston Text AnalyticsNltk - Boston Text Analytics
Nltk - Boston Text Analytics
 
Writing and using php streams and sockets
Writing and using php streams and socketsWriting and using php streams and sockets
Writing and using php streams and sockets
 
Your Own Metric System
Your Own Metric SystemYour Own Metric System
Your Own Metric System
 

Viewers also liked

Data Science Academy Student Demo day--Moyi Dang, Visualizing global public c...
Data Science Academy Student Demo day--Moyi Dang, Visualizing global public c...Data Science Academy Student Demo day--Moyi Dang, Visualizing global public c...
Data Science Academy Student Demo day--Moyi Dang, Visualizing global public c...Vivian S. Zhang
 
Using Machine Learning to aid Journalism at the New York Times
Using Machine Learning to aid Journalism at the New York TimesUsing Machine Learning to aid Journalism at the New York Times
Using Machine Learning to aid Journalism at the New York TimesVivian S. Zhang
 
Streaming Python on Hadoop
Streaming Python on HadoopStreaming Python on Hadoop
Streaming Python on HadoopVivian S. Zhang
 
Hack session for NYTimes Dialect Map Visualization( developed by R Shiny)
 Hack session for NYTimes Dialect Map Visualization( developed by R Shiny) Hack session for NYTimes Dialect Map Visualization( developed by R Shiny)
Hack session for NYTimes Dialect Map Visualization( developed by R Shiny)Vivian S. Zhang
 
Nyc open-data-2015-andvanced-sklearn-expanded
Nyc open-data-2015-andvanced-sklearn-expandedNyc open-data-2015-andvanced-sklearn-expanded
Nyc open-data-2015-andvanced-sklearn-expandedVivian S. Zhang
 
Data mining with caret package
Data mining with caret packageData mining with caret package
Data mining with caret packageVivian S. Zhang
 
Max Kuhn's talk on R machine learning
Max Kuhn's talk on R machine learningMax Kuhn's talk on R machine learning
Max Kuhn's talk on R machine learningVivian S. Zhang
 
Winning data science competitions, presented by Owen Zhang
Winning data science competitions, presented by Owen ZhangWinning data science competitions, presented by Owen Zhang
Winning data science competitions, presented by Owen ZhangVivian S. Zhang
 
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its authorKaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its authorVivian S. Zhang
 
Nycdsa ml conference slides march 2015
Nycdsa ml conference slides march 2015 Nycdsa ml conference slides march 2015
Nycdsa ml conference slides march 2015 Vivian S. Zhang
 
THE HACK ON JERSEY CITY CONDO PRICES explore trends in public data
THE HACK ON JERSEY CITY CONDO PRICES explore trends in public dataTHE HACK ON JERSEY CITY CONDO PRICES explore trends in public data
THE HACK ON JERSEY CITY CONDO PRICES explore trends in public dataVivian S. Zhang
 
Spatial query tutorial for nyc subway income level along subway
Spatial query tutorial  for nyc subway income level along subwaySpatial query tutorial  for nyc subway income level along subway
Spatial query tutorial for nyc subway income level along subwayVivian S. Zhang
 
Data Science Academy Student Demo day--Richard Sheng, kinvolved school attend...
Data Science Academy Student Demo day--Richard Sheng, kinvolved school attend...Data Science Academy Student Demo day--Richard Sheng, kinvolved school attend...
Data Science Academy Student Demo day--Richard Sheng, kinvolved school attend...Vivian S. Zhang
 
Data Science Academy Student Demo day--Chang Wang, dogs breeds in nyc
Data Science Academy Student Demo day--Chang Wang, dogs breeds in nycData Science Academy Student Demo day--Chang Wang, dogs breeds in nyc
Data Science Academy Student Demo day--Chang Wang, dogs breeds in nycVivian S. Zhang
 
Data Science Academy Student Demo day--Divyanka Sharma, Businesses in nyc
Data Science Academy Student Demo day--Divyanka Sharma, Businesses in nycData Science Academy Student Demo day--Divyanka Sharma, Businesses in nyc
Data Science Academy Student Demo day--Divyanka Sharma, Businesses in nycVivian S. Zhang
 
Ben Hamner, Co-founder and CTO, Kaggle at MLconf SF - 11/13/15
Ben Hamner, Co-founder and CTO, Kaggle at MLconf SF - 11/13/15Ben Hamner, Co-founder and CTO, Kaggle at MLconf SF - 11/13/15
Ben Hamner, Co-founder and CTO, Kaggle at MLconf SF - 11/13/15MLconf
 
Data science and Hadoop
Data science and HadoopData science and Hadoop
Data science and HadoopDonald Miner
 

Viewers also liked (20)

Data Science Academy Student Demo day--Moyi Dang, Visualizing global public c...
Data Science Academy Student Demo day--Moyi Dang, Visualizing global public c...Data Science Academy Student Demo day--Moyi Dang, Visualizing global public c...
Data Science Academy Student Demo day--Moyi Dang, Visualizing global public c...
 
Using Machine Learning to aid Journalism at the New York Times
Using Machine Learning to aid Journalism at the New York TimesUsing Machine Learning to aid Journalism at the New York Times
Using Machine Learning to aid Journalism at the New York Times
 
Streaming Python on Hadoop
Streaming Python on HadoopStreaming Python on Hadoop
Streaming Python on Hadoop
 
Hack session for NYTimes Dialect Map Visualization( developed by R Shiny)
 Hack session for NYTimes Dialect Map Visualization( developed by R Shiny) Hack session for NYTimes Dialect Map Visualization( developed by R Shiny)
Hack session for NYTimes Dialect Map Visualization( developed by R Shiny)
 
Nyc open-data-2015-andvanced-sklearn-expanded
Nyc open-data-2015-andvanced-sklearn-expandedNyc open-data-2015-andvanced-sklearn-expanded
Nyc open-data-2015-andvanced-sklearn-expanded
 
Data mining with caret package
Data mining with caret packageData mining with caret package
Data mining with caret package
 
Bayesian models in r
Bayesian models in rBayesian models in r
Bayesian models in r
 
Xgboost
XgboostXgboost
Xgboost
 
Max Kuhn's talk on R machine learning
Max Kuhn's talk on R machine learningMax Kuhn's talk on R machine learning
Max Kuhn's talk on R machine learning
 
Winning data science competitions, presented by Owen Zhang
Winning data science competitions, presented by Owen ZhangWinning data science competitions, presented by Owen Zhang
Winning data science competitions, presented by Owen Zhang
 
Xgboost
XgboostXgboost
Xgboost
 
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its authorKaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
 
Nycdsa ml conference slides march 2015
Nycdsa ml conference slides march 2015 Nycdsa ml conference slides march 2015
Nycdsa ml conference slides march 2015
 
THE HACK ON JERSEY CITY CONDO PRICES explore trends in public data
THE HACK ON JERSEY CITY CONDO PRICES explore trends in public dataTHE HACK ON JERSEY CITY CONDO PRICES explore trends in public data
THE HACK ON JERSEY CITY CONDO PRICES explore trends in public data
 
Spatial query tutorial for nyc subway income level along subway
Spatial query tutorial  for nyc subway income level along subwaySpatial query tutorial  for nyc subway income level along subway
Spatial query tutorial for nyc subway income level along subway
 
Data Science Academy Student Demo day--Richard Sheng, kinvolved school attend...
Data Science Academy Student Demo day--Richard Sheng, kinvolved school attend...Data Science Academy Student Demo day--Richard Sheng, kinvolved school attend...
Data Science Academy Student Demo day--Richard Sheng, kinvolved school attend...
 
Data Science Academy Student Demo day--Chang Wang, dogs breeds in nyc
Data Science Academy Student Demo day--Chang Wang, dogs breeds in nycData Science Academy Student Demo day--Chang Wang, dogs breeds in nyc
Data Science Academy Student Demo day--Chang Wang, dogs breeds in nyc
 
Data Science Academy Student Demo day--Divyanka Sharma, Businesses in nyc
Data Science Academy Student Demo day--Divyanka Sharma, Businesses in nycData Science Academy Student Demo day--Divyanka Sharma, Businesses in nyc
Data Science Academy Student Demo day--Divyanka Sharma, Businesses in nyc
 
Ben Hamner, Co-founder and CTO, Kaggle at MLconf SF - 11/13/15
Ben Hamner, Co-founder and CTO, Kaggle at MLconf SF - 11/13/15Ben Hamner, Co-founder and CTO, Kaggle at MLconf SF - 11/13/15
Ben Hamner, Co-founder and CTO, Kaggle at MLconf SF - 11/13/15
 
Data science and Hadoop
Data science and HadoopData science and Hadoop
Data science and Hadoop
 

Similar to NLP Tasks with Python and NLTK for Dothraki, Valyrian, and Assamese

Class 31: Deanonymizing
Class 31: DeanonymizingClass 31: Deanonymizing
Class 31: DeanonymizingDavid Evans
 
Presentation R basic teaching module
Presentation R basic teaching modulePresentation R basic teaching module
Presentation R basic teaching moduleSander Timmer
 
Making Mongo realtime - oplog tailing in Meteor
Making Mongo realtime - oplog tailing in MeteorMaking Mongo realtime - oplog tailing in Meteor
Making Mongo realtime - oplog tailing in Meteoryaliceme
 
A Taste of Python - Devdays Toronto 2009
A Taste of Python - Devdays Toronto 2009A Taste of Python - Devdays Toronto 2009
A Taste of Python - Devdays Toronto 2009Jordan Baker
 
UNIVERSIDAD CENTRAL DEL ECUADOR C++
UNIVERSIDAD CENTRAL DEL ECUADOR C++UNIVERSIDAD CENTRAL DEL ECUADOR C++
UNIVERSIDAD CENTRAL DEL ECUADOR C++CamiEscobar1995
 
UNIVERSIDAD CENTRAL DEL ECUADOR CAMILA ESCOBAR LOPEZ C+++
UNIVERSIDAD CENTRAL DEL ECUADOR CAMILA ESCOBAR LOPEZ C+++UNIVERSIDAD CENTRAL DEL ECUADOR CAMILA ESCOBAR LOPEZ C+++
UNIVERSIDAD CENTRAL DEL ECUADOR CAMILA ESCOBAR LOPEZ C+++CamiEscobar1995
 
Introduction to the basics of Python programming (part 1)
Introduction to the basics of Python programming (part 1)Introduction to the basics of Python programming (part 1)
Introduction to the basics of Python programming (part 1)Pedro Rodrigues
 
Poetry with R -- Dissecting the code
Poetry with R -- Dissecting the codePoetry with R -- Dissecting the code
Poetry with R -- Dissecting the codePeter Solymos
 
한국어와 NLTK, Gensim의 만남
한국어와 NLTK, Gensim의 만남한국어와 NLTK, Gensim의 만남
한국어와 NLTK, Gensim의 만남Eunjeong (Lucy) Park
 
Ejercicios de estilo en la programación
Ejercicios de estilo en la programaciónEjercicios de estilo en la programación
Ejercicios de estilo en la programaciónSoftware Guru
 
Алексей Чеусов - Расчёсываем своё ЧСВ
Алексей Чеусов - Расчёсываем своё ЧСВАлексей Чеусов - Расчёсываем своё ЧСВ
Алексей Чеусов - Расчёсываем своё ЧСВMinsk Linux User Group
 
Sequence to sequence (encoder-decoder) learning
Sequence to sequence (encoder-decoder) learningSequence to sequence (encoder-decoder) learning
Sequence to sequence (encoder-decoder) learningRoberto Pereira Silveira
 
Crystal presentation in NY
Crystal presentation in NYCrystal presentation in NY
Crystal presentation in NYCrystal Language
 

Similar to NLP Tasks with Python and NLTK for Dothraki, Valyrian, and Assamese (20)

Class 31: Deanonymizing
Class 31: DeanonymizingClass 31: Deanonymizing
Class 31: Deanonymizing
 
Presentation R basic teaching module
Presentation R basic teaching modulePresentation R basic teaching module
Presentation R basic teaching module
 
groovy & grails - lecture 3
groovy & grails - lecture 3groovy & grails - lecture 3
groovy & grails - lecture 3
 
Making Mongo realtime - oplog tailing in Meteor
Making Mongo realtime - oplog tailing in MeteorMaking Mongo realtime - oplog tailing in Meteor
Making Mongo realtime - oplog tailing in Meteor
 
A Taste of Python - Devdays Toronto 2009
A Taste of Python - Devdays Toronto 2009A Taste of Python - Devdays Toronto 2009
A Taste of Python - Devdays Toronto 2009
 
Music as data
Music as dataMusic as data
Music as data
 
UNIVERSIDAD CENTRAL DEL ECUADOR C++
UNIVERSIDAD CENTRAL DEL ECUADOR C++UNIVERSIDAD CENTRAL DEL ECUADOR C++
UNIVERSIDAD CENTRAL DEL ECUADOR C++
 
UNIVERSIDAD CENTRAL DEL ECUADOR CAMILA ESCOBAR LOPEZ C+++
UNIVERSIDAD CENTRAL DEL ECUADOR CAMILA ESCOBAR LOPEZ C+++UNIVERSIDAD CENTRAL DEL ECUADOR CAMILA ESCOBAR LOPEZ C+++
UNIVERSIDAD CENTRAL DEL ECUADOR CAMILA ESCOBAR LOPEZ C+++
 
C++
C++C++
C++
 
Introduction to the basics of Python programming (part 1)
Introduction to the basics of Python programming (part 1)Introduction to the basics of Python programming (part 1)
Introduction to the basics of Python programming (part 1)
 
Poetry with R -- Dissecting the code
Poetry with R -- Dissecting the codePoetry with R -- Dissecting the code
Poetry with R -- Dissecting the code
 
한국어와 NLTK, Gensim의 만남
한국어와 NLTK, Gensim의 만남한국어와 NLTK, Gensim의 만남
한국어와 NLTK, Gensim의 만남
 
Ejercicios de estilo en la programación
Ejercicios de estilo en la programaciónEjercicios de estilo en la programación
Ejercicios de estilo en la programación
 
Term Rewriting
Term RewritingTerm Rewriting
Term Rewriting
 
Алексей Чеусов - Расчёсываем своё ЧСВ
Алексей Чеусов - Расчёсываем своё ЧСВАлексей Чеусов - Расчёсываем своё ЧСВ
Алексей Чеусов - Расчёсываем своё ЧСВ
 
Sequence to sequence (encoder-decoder) learning
Sequence to sequence (encoder-decoder) learningSequence to sequence (encoder-decoder) learning
Sequence to sequence (encoder-decoder) learning
 
Rakudo
RakudoRakudo
Rakudo
 
Word chains
Word chainsWord chains
Word chains
 
Poetic APIs
Poetic APIsPoetic APIs
Poetic APIs
 
Crystal presentation in NY
Crystal presentation in NYCrystal presentation in NY
Crystal presentation in NY
 

More from Vivian S. Zhang

Career services workshop- Roger Ren
Career services workshop- Roger RenCareer services workshop- Roger Ren
Career services workshop- Roger RenVivian S. Zhang
 
Nycdsa wordpress guide book
Nycdsa wordpress guide bookNycdsa wordpress guide book
Nycdsa wordpress guide bookVivian S. Zhang
 
We're so skewed_presentation
We're so skewed_presentationWe're so skewed_presentation
We're so skewed_presentationVivian S. Zhang
 
Wikipedia: Tuned Predictions on Big Data
Wikipedia: Tuned Predictions on Big DataWikipedia: Tuned Predictions on Big Data
Wikipedia: Tuned Predictions on Big DataVivian S. Zhang
 
A Hybrid Recommender with Yelp Challenge Data
A Hybrid Recommender with Yelp Challenge Data A Hybrid Recommender with Yelp Challenge Data
A Hybrid Recommender with Yelp Challenge Data Vivian S. Zhang
 
Kaggle Top1% Solution: Predicting Housing Prices in Moscow
Kaggle Top1% Solution: Predicting Housing Prices in Moscow Kaggle Top1% Solution: Predicting Housing Prices in Moscow
Kaggle Top1% Solution: Predicting Housing Prices in Moscow Vivian S. Zhang
 
Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati...
Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati...Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati...
Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati...Vivian S. Zhang
 
Data Science Academy Student Demo day--Michael blecher,the importance of clea...
Data Science Academy Student Demo day--Michael blecher,the importance of clea...Data Science Academy Student Demo day--Michael blecher,the importance of clea...
Data Science Academy Student Demo day--Michael blecher,the importance of clea...Vivian S. Zhang
 
Data Science Academy Student Demo day--Shelby Ahern, An Exploration of Non-Mi...
Data Science Academy Student Demo day--Shelby Ahern, An Exploration of Non-Mi...Data Science Academy Student Demo day--Shelby Ahern, An Exploration of Non-Mi...
Data Science Academy Student Demo day--Shelby Ahern, An Exploration of Non-Mi...Vivian S. Zhang
 
R003 laila restaurant sanitation report(NYC Data Science Academy, Data Scienc...
R003 laila restaurant sanitation report(NYC Data Science Academy, Data Scienc...R003 laila restaurant sanitation report(NYC Data Science Academy, Data Scienc...
R003 laila restaurant sanitation report(NYC Data Science Academy, Data Scienc...Vivian S. Zhang
 
R003 jiten south park episode popularity analysis(NYC Data Science Academy, D...
R003 jiten south park episode popularity analysis(NYC Data Science Academy, D...R003 jiten south park episode popularity analysis(NYC Data Science Academy, D...
R003 jiten south park episode popularity analysis(NYC Data Science Academy, D...Vivian S. Zhang
 

More from Vivian S. Zhang (12)

Why NYC DSA.pdf
Why NYC DSA.pdfWhy NYC DSA.pdf
Why NYC DSA.pdf
 
Career services workshop- Roger Ren
Career services workshop- Roger RenCareer services workshop- Roger Ren
Career services workshop- Roger Ren
 
Nycdsa wordpress guide book
Nycdsa wordpress guide bookNycdsa wordpress guide book
Nycdsa wordpress guide book
 
We're so skewed_presentation
We're so skewed_presentationWe're so skewed_presentation
We're so skewed_presentation
 
Wikipedia: Tuned Predictions on Big Data
Wikipedia: Tuned Predictions on Big DataWikipedia: Tuned Predictions on Big Data
Wikipedia: Tuned Predictions on Big Data
 
A Hybrid Recommender with Yelp Challenge Data
A Hybrid Recommender with Yelp Challenge Data A Hybrid Recommender with Yelp Challenge Data
A Hybrid Recommender with Yelp Challenge Data
 
Kaggle Top1% Solution: Predicting Housing Prices in Moscow
Kaggle Top1% Solution: Predicting Housing Prices in Moscow Kaggle Top1% Solution: Predicting Housing Prices in Moscow
Kaggle Top1% Solution: Predicting Housing Prices in Moscow
 
Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati...
Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati...Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati...
Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati...
 
Data Science Academy Student Demo day--Michael blecher,the importance of clea...
Data Science Academy Student Demo day--Michael blecher,the importance of clea...Data Science Academy Student Demo day--Michael blecher,the importance of clea...
Data Science Academy Student Demo day--Michael blecher,the importance of clea...
 
Data Science Academy Student Demo day--Shelby Ahern, An Exploration of Non-Mi...
Data Science Academy Student Demo day--Shelby Ahern, An Exploration of Non-Mi...Data Science Academy Student Demo day--Shelby Ahern, An Exploration of Non-Mi...
Data Science Academy Student Demo day--Shelby Ahern, An Exploration of Non-Mi...
 
R003 laila restaurant sanitation report(NYC Data Science Academy, Data Scienc...
R003 laila restaurant sanitation report(NYC Data Science Academy, Data Scienc...R003 laila restaurant sanitation report(NYC Data Science Academy, Data Scienc...
R003 laila restaurant sanitation report(NYC Data Science Academy, Data Scienc...
 
R003 jiten south park episode popularity analysis(NYC Data Science Academy, D...
R003 jiten south park episode popularity analysis(NYC Data Science Academy, D...R003 jiten south park episode popularity analysis(NYC Data Science Academy, D...
R003 jiten south park episode popularity analysis(NYC Data Science Academy, D...
 

Recently uploaded

Earthing details of Electrical Substation
Earthing details of Electrical SubstationEarthing details of Electrical Substation
Earthing details of Electrical Substationstephanwindworld
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfAsst.prof M.Gokilavani
 
Class 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm SystemClass 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm Systemirfanmechengr
 
Risk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfRisk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfROCENODodongVILLACER
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile servicerehmti665
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerAnamika Sarkar
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...asadnawaz62
 
Work Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvWork Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvLewisJB
 
Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxKartikeyaDwivedi3
 
Indian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.pptIndian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.pptMadan Karki
 
Vishratwadi & Ghorpadi Bridge Tender documents
Vishratwadi & Ghorpadi Bridge Tender documentsVishratwadi & Ghorpadi Bridge Tender documents
Vishratwadi & Ghorpadi Bridge Tender documentsSachinPawar510423
 
Arduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptArduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptSAURABHKUMAR892774
 
Introduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHIntroduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHC Sai Kiran
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfAsst.prof M.Gokilavani
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girlsssuser7cb4ff
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionDr.Costas Sachpazis
 

Recently uploaded (20)

Earthing details of Electrical Substation
Earthing details of Electrical SubstationEarthing details of Electrical Substation
Earthing details of Electrical Substation
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
 
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
 
Class 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm SystemClass 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm System
 
Design and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdfDesign and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdf
 
Risk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfRisk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdf
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile service
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...
 
Work Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvWork Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvv
 
Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptx
 
Indian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.pptIndian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.ppt
 
Vishratwadi & Ghorpadi Bridge Tender documents
Vishratwadi & Ghorpadi Bridge Tender documentsVishratwadi & Ghorpadi Bridge Tender documents
Vishratwadi & Ghorpadi Bridge Tender documents
 
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Serviceyoung call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
 
Arduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptArduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.ppt
 
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptxExploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
 
Introduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHIntroduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECH
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girls
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
 

NLP Tasks with Python and NLTK for Dothraki, Valyrian, and Assamese

  • 1. Is that Dothraki or Valyrian? and other NLP tasks with Python and NLTK Charlie Redmon | SupStat, Inc. August 18, 2014
  • 5. Importing raw text dothraki_f = codecs.open( "/home/cr/Python/westeros/dothraki.txt", encoding=’utf -8’) dothraki_raw = dothraki_f.read () print dothraki_raw Athchomar chomakaan , [zhey] khal vezhven. Azha anhaan asshilat ... Itte oakah! Jadi , zhey Jora Andahli. Khal vezhven. Ajjalan anha zalat vitiherat yer hatif. Kash qoy qoyi thira disse. Hash shafka zali addrivat mae , zhey Khaleesi? Ishish chare ...
  • 6. Text processing: Cleaning punct_re = re.compile( ur’[. ,;:?! u2014u2019u2026 []] ’, re.UNICODE) dothraki_proc = punct_re.sub(’’, dothraki_raw) dothraki_proc = dothraki_proc.lower () print dothraki_proc athchomar chomakaan zhey khal vezhven azha anhaan asshilat itte oakah jadi zhey jora andahli khal vezhven ajjalan anha zalat vitiherat yer hatif kash qoy qoyi thira disse ...
  • 7. Text processing: Tokenizing dothraki_tokens = re.split(ur’s+’, dothraki_proc) dothraki_types = set(dothraki_tokens ) print dothraki_types set([u’izzi ’, u’ale’, u’morea ’, u’vesazhao ’, u’yeri ’, u’ishish ’, u’dalen ’, u’vesazhae ’, u’yera ’, u’afisi ’, u’rhae ’, u’mawizzi ’, u’vee’, u’arrisse ’, u’ti’, u’ven’, u’rizh ’, u’afichak ’, u’gache ’, u’zigerek ’, u’zigereo ’, u’drivoe ’, u’maaz ’, u’zigeree ’, u’ayyeyoon ’, u’maan ’, u’mahrazhi ’, u’ma’, u’vos’, u’movekkhi ’, u’mahrazhis ’, u’meshafka ’, u’qisi ’, u’sani ’, u’ville ’, u’vikeesi ’, u’ifak ’, u’javrathi ’, u’zisa ’, u’chek ’, u’nem’, ... ])
  • 8. Inspecting the lexical distribution in a text dothraki_freqdist = FreqDist( dothraki_tokens) print dothraki_freqdist <FreqDist: u’anha ’: 50, u’vos’: 40, u’me’: 39, u’ma’: 38, u’zhey ’: 29, u’mae’: 27, u’anni ’: 26, u’hash ’: 23, u’yer’: 23, u’khal ’: 16, u’khaleesi ’: 16, u’mori ’: 15, u’jin’: 13, u’kisha ’: 12, u’nem’: 11, u’vo’: 11, u’che’: 10, u’jini ’: 10, u’she’: 10, ... > dothraki_freqdist .plot (20, cumulative=True)
  • 9. CFD of Dothraki words Top 10: anha, vos, me, ma, zhey, mae, anni, hash, yer, khal
  • 10. Valyrian vocabulary distribution Astapori Valyrian (Top 10): ji, me, do, espo, si, mysa, eji, ez, ivetr´a, sa High Valyrian (Top 10): daor, se, issa, syt, ziry, hen, jem¯ele, lue, yne, avy
  • 11. Feature 1: Consonant proportion def c_prop(word ): c_num = 0 for letter in u’bcdfgjklmnpqrstvxz u00f1 ’: c_num += word.count(letter) return c_num / len(word) c_prop(u’zu016bgusy ’) 0.5
  • 13. Feature 2: Obstruent proportion def obstruent_prop (word ): obstruent_num = 0 for letter in u’bcdfgjkpqstvxz ’ obstruent_num += word.count(letter) return obstruent_num / len(word) obstruent_prop (u’u012blvi ’) 0.25
  • 15. Feature 3: Coda presence def c_coda(word ): if word [-1] in u’bcdfgjklmnpqrstvxz u00f1 ’: return 1 else: return 0 def obstruent_coda (word ): if word [-1] in u’bcdfgjkpqstvxz ’: return 1 else: return 0 c_coda(u’lysoon ’) 1 obstruent_coda (u’lysoon ’) 0
  • 16. Mean coda consonant presence across languages
  • 17. Mean coda obstruent presence across languages
  • 18. Feature 4: Consonant clusters regex = ur’[ bcdfghjklmnpqrstvxz u00f1] [ bcdfghjklmnpqrstvxz u00f1 ]+’ def c_cluster(word ): cc_set = re.findall(regex , word , re.UNICODE) return len(cc_set) c_cluster(u’avvirsosh ’) 3
  • 19. Mean consonant cluster frequency across languages
  • 20. Feature 5: Obstruent clusters regex1 = ur’[bcdfghjkpqstvxz ][ bcdfghjkpqstvxz ]+’ def obs_cluster(word ): oo_set = re.findall(regex1 , word , re.UNICODE) return len(oo_set) obs_cluster(u’avvirsosh ’) 2
  • 21. Mean obstruent cluster frequency across languages
  • 22. Feature 6: Vowel clusters regex2 = ur’[ bcdfghjklmnpqrstvxz u00f1 ]+’ def v_cluster(word ): v_set = re.split(regex2 , word , re.UNICODE) vv_set = [v for v in v_set if len(v) > 1] return len(vv_set) v_cluster(u’haeshi ’) 1
  • 23. Mean vowel cluster frequency across languages
  • 24. Data from real languages
  • 27. Assamese corpus files directory = "/home/cr/Documents/NLPwP_pres/ TDIL_assamese_corpus_data " os.listdir(directory) [’subj_art2.txt’, ’subj_politics1 .txt’, ’lit3.txt’, ’drama.txt’, ’religion2.txt’, ’criticism2.txt’, ’criticism1.txt’, ’subj_science3.txt’, ’ref_encyclopaedia -entry.txt’, ’subj_science2.txt’, ’subj_social -studies.txt’, ’music.txt’, ’subj_art1.txt ’subj_science1.txt’, ’subj_art3.txt’, ’news.txt’, ’subj_sociology .txt’, ’criticism3.txt’, ’lit8.txt’, ’subj_history1.txt’, ’lit4.txt’, ’lit6.txt’, ’religion ’subj_law.txt’, ’lit7.txt’, ’religion1.txt’, ’criticis ’lit5.txt’, ’subj_math.txt’, ’lit1.txt’, ’subj_science ’subj_science_5 .txt’, ’subj_history2.txt’, ’lit2.txt’, ’subj_science4.txt’, ’letter.txt’]
  • 29. Frequency of the sound /x/ in ’lit5.txt’ len(re.findall(ur’[ u09b6u09b7u09b8]’, assamese_sample_raw , re.UNICODE )) 1313 len(re.findall(ur’u09b6 ’, assamese_sample_raw , re.UNICODE )) 298 len(re.findall(ur’u09b7 ’, assamese_sample_raw , re.UNICODE )) 195 len(re.findall(ur’u09b8 ’, assamese_sample_raw , re.UNICODE )) 820
  • 30. Positional restrictions Beginning a word: len(re.findall(ur’b[ u09b6u09b7u09b8]’, assamese_sample_raw , re.UNICODE )) 1129 Ending a word: len(re.findall(ur’[ u09b6u09b7u09b8 ]b’, assamese_sample_raw , re.UNICODE )) 895
  • 31. Positional restrictions Following /a/: len(re.findall(ur’u09be [ u09b6u09b7u09b8]’, assamese_sample_raw , re.UNICODE )) 57 Following /i/: len(re.findall(ur’[ u09bfu09c0 ][ u09b6u09b7u09b8]’, ssamese_sample_raw , re.UNICODE )) 70 Following /u/: len(re.findall(ur’[ u09c1u09c2 ][ u09b6u09b7u09b8]’, assamese_sample_raw , re.UNICODE )) 10
  • 32. Further work Incorporate segmental parameters into classifier (fix Unicode issues with NLTK’s classify module) Use classifier to predict assignment of random words from Westeros to Dothraki, Astapori Valyrian, and High Valyrian languages Isolate most important word-internal parameters in classification model (log-likelihood ranking in Naive Bayes model) Use full distributional account of select Assamese consonants as priors in acoustic classification model