SlideShare a Scribd company logo
1 of 2
Download to read offline
Linguistic Component: Tokenizer for the
Russian language
Technical description
SemanticAnalyzer Group, 2013-08-29
www.semanticanalyzer.info
This document describes technical details of tokenizer for the Russian language. The component has two
modes of operation:
 Processing of generic texts: news, technical articles etc
 Processing of Twitter messages
Demo package sent upon request contains the following:
 Java library of tokenizer in a form of a binary
 run_tokenizer.sh script for swift checking the functionality of the module
 messages_to_tokenize.txt file containing examples of generic text and tweets for tokenization
using the run_tokenizer.sh script
The algorithm is based on a set of rules, implemented using Flex (JFlex), which allow extracting
individual tokens for a text stream.
Speed of processing
Server: Intel(R) Xeon(R) CPU X3363 @ 2.83GHz
Operating system: ubuntu 10.04, Java 1.7.0_21 64 bit server
38497 characters/ms
5158 tokens/ms
Tests were conducted in a single thread.
Format of the messages_to_tokenize.txt file
This file describes input data for the tokenizer module for demo purposes.
Format:
TexttText type
Text contains textual data in Russian for tokenization
t – tab symbol
Text type: supported values are GENERAL_TEXT and TWITTER.
Examples of tokenization
The run_tokenizer.sh script will generate the following file: messages_to_tokenize.out.
For the following input file messages_to_tokenize.txt:
:)this is it! #По_русски @dm ;-D www.test.com/x?y TWITTER
This output gets generated:
:)this is it! #По_русски @dm ;-D www.test.com/x?y TWITTER
emopostkn, type: ALPHANUM
this, type: ALPHANUM
is, type: ALPHANUM
it, type: ALPHANUM
!, type: PUNCT
#По_русски, type: TWITTER_HASHTAG
@dm, type: TWITTER_USERNAME
emopostkn, type: ALPHANUM
www.test.com/x?y, type: HYPERLINK
Examples of using the library from the Java code
Tokenizer twitterTokenizer = new TwitterFlexTokenizer(new StringReader("#ht. done!"), true);
Token reusableToken = Token.newReusableToken();
while((reusableToken = twitterTokenizer.getNextToken(reusableToken)) != null) {
System.out.println(reusableToken);
}
output:
Token[text=#ht,type=TWITTER_HASHTAG]
Token[text=done,type=ALPHANUM]

More Related Content

Viewers also liked

Linguistic component Sentiment Analyzer for the Russian language
Linguistic component Sentiment Analyzer for the Russian languageLinguistic component Sentiment Analyzer for the Russian language
Linguistic component Sentiment Analyzer for the Russian languageDmitry Kan
 
Linguistic component Lemmatizer for the Russian language
Linguistic component Lemmatizer for the Russian languageLinguistic component Lemmatizer for the Russian language
Linguistic component Lemmatizer for the Russian languageDmitry Kan
 
MTEngine: Semantic-level Crowdsourced Machine Translation
MTEngine: Semantic-level Crowdsourced Machine TranslationMTEngine: Semantic-level Crowdsourced Machine Translation
MTEngine: Semantic-level Crowdsourced Machine TranslationDmitry Kan
 
Introduction To Machine Translation
Introduction To Machine TranslationIntroduction To Machine Translation
Introduction To Machine TranslationDmitry Kan
 
NoSQL, Apache SOLR and Apache Hadoop
NoSQL, Apache SOLR and Apache HadoopNoSQL, Apache SOLR and Apache Hadoop
NoSQL, Apache SOLR and Apache HadoopDmitry Kan
 
Rule based approach to sentiment analysis at ROMIP 2011
Rule based approach to sentiment analysis at ROMIP 2011Rule based approach to sentiment analysis at ROMIP 2011
Rule based approach to sentiment analysis at ROMIP 2011Dmitry Kan
 
Poster: Method for an automatic generation of a semantic-level contextual tra...
Poster: Method for an automatic generation of a semantic-level contextual tra...Poster: Method for an automatic generation of a semantic-level contextual tra...
Poster: Method for an automatic generation of a semantic-level contextual tra...Dmitry Kan
 
Russian Language Centre
Russian Language CentreRussian Language Centre
Russian Language CentreLucy Bullett
 
Russian for Beginners
Russian for BeginnersRussian for Beginners
Russian for BeginnersIrina Bubnova
 
How many people speak and will speak the russian language
How many people  speak and will speak the russian languageHow many people  speak and will speak the russian language
How many people speak and will speak the russian languageSecondary School from Helsinki
 
Rule based approach to sentiment analysis at romip’11 slides
Rule based approach to sentiment analysis at romip’11 slidesRule based approach to sentiment analysis at romip’11 slides
Rule based approach to sentiment analysis at romip’11 slidesDmitry Kan
 
Creating an Open Source Genealogical Search Engine with Apache Solr
Creating an Open Source Genealogical Search Engine with Apache SolrCreating an Open Source Genealogical Search Engine with Apache Solr
Creating an Open Source Genealogical Search Engine with Apache SolrBrooke Ganz
 
Semantic Analysis: theory, applications and use cases
Semantic Analysis: theory, applications and use casesSemantic Analysis: theory, applications and use cases
Semantic Analysis: theory, applications and use casesDmitry Kan
 
IR: Open source state
IR: Open source stateIR: Open source state
IR: Open source stateDmitry Kan
 
Hieber - An Introduction to Typology, Part II: Voice & Transitivity
Hieber - An Introduction to Typology, Part II: Voice & TransitivityHieber - An Introduction to Typology, Part II: Voice & Transitivity
Hieber - An Introduction to Typology, Part II: Voice & TransitivityDaniel Hieber
 
Amarigna & Tigrigna Qal Roots of Russian Language
Amarigna & Tigrigna Qal Roots of Russian LanguageAmarigna & Tigrigna Qal Roots of Russian Language
Amarigna & Tigrigna Qal Roots of Russian LanguageLegesse Allyn
 
towards mulitlingual cultural lexicography. the russian dialect dictionary as...
towards mulitlingual cultural lexicography. the russian dialect dictionary as...towards mulitlingual cultural lexicography. the russian dialect dictionary as...
towards mulitlingual cultural lexicography. the russian dialect dictionary as...eveline wandl-vogt
 
NLTK и Python для работы с текстами
NLTK и Python для работы с текстами  NLTK и Python для работы с текстами
NLTK и Python для работы с текстами NLProc.by
 

Viewers also liked (19)

Linguistic component Sentiment Analyzer for the Russian language
Linguistic component Sentiment Analyzer for the Russian languageLinguistic component Sentiment Analyzer for the Russian language
Linguistic component Sentiment Analyzer for the Russian language
 
Linguistic component Lemmatizer for the Russian language
Linguistic component Lemmatizer for the Russian languageLinguistic component Lemmatizer for the Russian language
Linguistic component Lemmatizer for the Russian language
 
MTEngine: Semantic-level Crowdsourced Machine Translation
MTEngine: Semantic-level Crowdsourced Machine TranslationMTEngine: Semantic-level Crowdsourced Machine Translation
MTEngine: Semantic-level Crowdsourced Machine Translation
 
Introduction To Machine Translation
Introduction To Machine TranslationIntroduction To Machine Translation
Introduction To Machine Translation
 
NoSQL, Apache SOLR and Apache Hadoop
NoSQL, Apache SOLR and Apache HadoopNoSQL, Apache SOLR and Apache Hadoop
NoSQL, Apache SOLR and Apache Hadoop
 
Rule based approach to sentiment analysis at ROMIP 2011
Rule based approach to sentiment analysis at ROMIP 2011Rule based approach to sentiment analysis at ROMIP 2011
Rule based approach to sentiment analysis at ROMIP 2011
 
Poster: Method for an automatic generation of a semantic-level contextual tra...
Poster: Method for an automatic generation of a semantic-level contextual tra...Poster: Method for an automatic generation of a semantic-level contextual tra...
Poster: Method for an automatic generation of a semantic-level contextual tra...
 
Russian Language Centre
Russian Language CentreRussian Language Centre
Russian Language Centre
 
Russian for Beginners
Russian for BeginnersRussian for Beginners
Russian for Beginners
 
How many people speak and will speak the russian language
How many people  speak and will speak the russian languageHow many people  speak and will speak the russian language
How many people speak and will speak the russian language
 
Rule based approach to sentiment analysis at romip’11 slides
Rule based approach to sentiment analysis at romip’11 slidesRule based approach to sentiment analysis at romip’11 slides
Rule based approach to sentiment analysis at romip’11 slides
 
Creating an Open Source Genealogical Search Engine with Apache Solr
Creating an Open Source Genealogical Search Engine with Apache SolrCreating an Open Source Genealogical Search Engine with Apache Solr
Creating an Open Source Genealogical Search Engine with Apache Solr
 
Semantic Analysis: theory, applications and use cases
Semantic Analysis: theory, applications and use casesSemantic Analysis: theory, applications and use cases
Semantic Analysis: theory, applications and use cases
 
IR: Open source state
IR: Open source stateIR: Open source state
IR: Open source state
 
Hieber - An Introduction to Typology, Part II: Voice & Transitivity
Hieber - An Introduction to Typology, Part II: Voice & TransitivityHieber - An Introduction to Typology, Part II: Voice & Transitivity
Hieber - An Introduction to Typology, Part II: Voice & Transitivity
 
Pre-incident plan
Pre-incident planPre-incident plan
Pre-incident plan
 
Amarigna & Tigrigna Qal Roots of Russian Language
Amarigna & Tigrigna Qal Roots of Russian LanguageAmarigna & Tigrigna Qal Roots of Russian Language
Amarigna & Tigrigna Qal Roots of Russian Language
 
towards mulitlingual cultural lexicography. the russian dialect dictionary as...
towards mulitlingual cultural lexicography. the russian dialect dictionary as...towards mulitlingual cultural lexicography. the russian dialect dictionary as...
towards mulitlingual cultural lexicography. the russian dialect dictionary as...
 
NLTK и Python для работы с текстами
NLTK и Python для работы с текстами  NLTK и Python для работы с текстами
NLTK и Python для работы с текстами
 

Similar to Linguistic component Tokenizer for the Russian language

Creation of a Test Bed Environment for Core Java Applications using White Box...
Creation of a Test Bed Environment for Core Java Applications using White Box...Creation of a Test Bed Environment for Core Java Applications using White Box...
Creation of a Test Bed Environment for Core Java Applications using White Box...cscpconf
 
Cody_Zeng_HPE_Intern_Poster
Cody_Zeng_HPE_Intern_PosterCody_Zeng_HPE_Intern_Poster
Cody_Zeng_HPE_Intern_PosterCody Zeng
 
Tycs sem 5 asp.net notes unit 1 2 3 4 (2017)
Tycs sem 5 asp.net notes unit 1 2 3 4 (2017)Tycs sem 5 asp.net notes unit 1 2 3 4 (2017)
Tycs sem 5 asp.net notes unit 1 2 3 4 (2017)WE-IT TUTORIALS
 
IntelliJ IDEA Architecture and Performance
IntelliJ IDEA Architecture and PerformanceIntelliJ IDEA Architecture and Performance
IntelliJ IDEA Architecture and Performanceintelliyole
 
Automating API Documentation
Automating API DocumentationAutomating API Documentation
Automating API DocumentationSelvakumar T S
 
Understanding information content with apache tika
Understanding information content with apache tikaUnderstanding information content with apache tika
Understanding information content with apache tikaSutthipong Kuruhongsa
 
Understanding information content with apache tika
Understanding information content with apache tikaUnderstanding information content with apache tika
Understanding information content with apache tikaSutthipong Kuruhongsa
 
XML Tutor maXbox starter27
XML Tutor maXbox starter27XML Tutor maXbox starter27
XML Tutor maXbox starter27Max Kleiner
 
Xml processing-by-asfak
Xml processing-by-asfakXml processing-by-asfak
Xml processing-by-asfakAsfak Mahamud
 
A Programmatic View and Implementation of XML
A Programmatic View and Implementation of XMLA Programmatic View and Implementation of XML
A Programmatic View and Implementation of XMLCSCJournals
 
Distributed System by Pratik Tambekar
Distributed System by Pratik TambekarDistributed System by Pratik Tambekar
Distributed System by Pratik TambekarPratik Tambekar
 
Iisrt arshiya hesarur
Iisrt arshiya hesarurIisrt arshiya hesarur
Iisrt arshiya hesarurIISRT
 
ASP.NET Session 2
ASP.NET Session 2ASP.NET Session 2
ASP.NET Session 2Sisir Ghosh
 
Portlets & jsr 168
Portlets & jsr 168Portlets & jsr 168
Portlets & jsr 168grsrkumar
 
Requirements presentation
Requirements presentationRequirements presentation
Requirements presentationNataly Chill
 
Automation Techniques In Documentation
Automation Techniques In DocumentationAutomation Techniques In Documentation
Automation Techniques In DocumentationSujith Mallath
 

Similar to Linguistic component Tokenizer for the Russian language (20)

Creation of a Test Bed Environment for Core Java Applications using White Box...
Creation of a Test Bed Environment for Core Java Applications using White Box...Creation of a Test Bed Environment for Core Java Applications using White Box...
Creation of a Test Bed Environment for Core Java Applications using White Box...
 
Cody_Zeng_HPE_Intern_Poster
Cody_Zeng_HPE_Intern_PosterCody_Zeng_HPE_Intern_Poster
Cody_Zeng_HPE_Intern_Poster
 
Tycs sem 5 asp.net notes unit 1 2 3 4 (2017)
Tycs sem 5 asp.net notes unit 1 2 3 4 (2017)Tycs sem 5 asp.net notes unit 1 2 3 4 (2017)
Tycs sem 5 asp.net notes unit 1 2 3 4 (2017)
 
 
IntelliJ IDEA Architecture and Performance
IntelliJ IDEA Architecture and PerformanceIntelliJ IDEA Architecture and Performance
IntelliJ IDEA Architecture and Performance
 
Automating API Documentation
Automating API DocumentationAutomating API Documentation
Automating API Documentation
 
Understanding information content with apache tika
Understanding information content with apache tikaUnderstanding information content with apache tika
Understanding information content with apache tika
 
Understanding information content with apache tika
Understanding information content with apache tikaUnderstanding information content with apache tika
Understanding information content with apache tika
 
XML Tutor maXbox starter27
XML Tutor maXbox starter27XML Tutor maXbox starter27
XML Tutor maXbox starter27
 
Xml processing-by-asfak
Xml processing-by-asfakXml processing-by-asfak
Xml processing-by-asfak
 
A Programmatic View and Implementation of XML
A Programmatic View and Implementation of XMLA Programmatic View and Implementation of XML
A Programmatic View and Implementation of XML
 
Lexical Analyzers and Parsers
Lexical Analyzers and ParsersLexical Analyzers and Parsers
Lexical Analyzers and Parsers
 
Distributed System by Pratik Tambekar
Distributed System by Pratik TambekarDistributed System by Pratik Tambekar
Distributed System by Pratik Tambekar
 
Robot framework
Robot frameworkRobot framework
Robot framework
 
Iisrt arshiya hesarur
Iisrt arshiya hesarurIisrt arshiya hesarur
Iisrt arshiya hesarur
 
8023.ppt
8023.ppt8023.ppt
8023.ppt
 
ASP.NET Session 2
ASP.NET Session 2ASP.NET Session 2
ASP.NET Session 2
 
Portlets & jsr 168
Portlets & jsr 168Portlets & jsr 168
Portlets & jsr 168
 
Requirements presentation
Requirements presentationRequirements presentation
Requirements presentation
 
Automation Techniques In Documentation
Automation Techniques In DocumentationAutomation Techniques In Documentation
Automation Techniques In Documentation
 

More from Dmitry Kan

London IR Meetup - Players in Vector Search_ algorithms, software and use cases
London IR Meetup - Players in Vector Search_ algorithms, software and use casesLondon IR Meetup - Players in Vector Search_ algorithms, software and use cases
London IR Meetup - Players in Vector Search_ algorithms, software and use casesDmitry Kan
 
Vector databases and neural search
Vector databases and neural searchVector databases and neural search
Vector databases and neural searchDmitry Kan
 
Haystack LIVE! - 5 ways to increase result diversity at web-scale - Dmitry Ka...
Haystack LIVE! - 5 ways to increase result diversity at web-scale - Dmitry Ka...Haystack LIVE! - 5 ways to increase result diversity at web-scale - Dmitry Ka...
Haystack LIVE! - 5 ways to increase result diversity at web-scale - Dmitry Ka...Dmitry Kan
 
SentiScan: система автоматической разметки тональности в social media
SentiScan: система автоматической разметки тональности в social mediaSentiScan: система автоматической разметки тональности в social media
SentiScan: система автоматической разметки тональности в social mediaDmitry Kan
 
Icsoft 2011 51_cr
Icsoft 2011 51_crIcsoft 2011 51_cr
Icsoft 2011 51_crDmitry Kan
 
Computer Semantics And Machine Translation
Computer Semantics And Machine TranslationComputer Semantics And Machine Translation
Computer Semantics And Machine TranslationDmitry Kan
 

More from Dmitry Kan (6)

London IR Meetup - Players in Vector Search_ algorithms, software and use cases
London IR Meetup - Players in Vector Search_ algorithms, software and use casesLondon IR Meetup - Players in Vector Search_ algorithms, software and use cases
London IR Meetup - Players in Vector Search_ algorithms, software and use cases
 
Vector databases and neural search
Vector databases and neural searchVector databases and neural search
Vector databases and neural search
 
Haystack LIVE! - 5 ways to increase result diversity at web-scale - Dmitry Ka...
Haystack LIVE! - 5 ways to increase result diversity at web-scale - Dmitry Ka...Haystack LIVE! - 5 ways to increase result diversity at web-scale - Dmitry Ka...
Haystack LIVE! - 5 ways to increase result diversity at web-scale - Dmitry Ka...
 
SentiScan: система автоматической разметки тональности в social media
SentiScan: система автоматической разметки тональности в social mediaSentiScan: система автоматической разметки тональности в social media
SentiScan: система автоматической разметки тональности в social media
 
Icsoft 2011 51_cr
Icsoft 2011 51_crIcsoft 2011 51_cr
Icsoft 2011 51_cr
 
Computer Semantics And Machine Translation
Computer Semantics And Machine TranslationComputer Semantics And Machine Translation
Computer Semantics And Machine Translation
 

Linguistic component Tokenizer for the Russian language

  • 1. Linguistic Component: Tokenizer for the Russian language Technical description SemanticAnalyzer Group, 2013-08-29 www.semanticanalyzer.info This document describes technical details of tokenizer for the Russian language. The component has two modes of operation:  Processing of generic texts: news, technical articles etc  Processing of Twitter messages Demo package sent upon request contains the following:  Java library of tokenizer in a form of a binary  run_tokenizer.sh script for swift checking the functionality of the module  messages_to_tokenize.txt file containing examples of generic text and tweets for tokenization using the run_tokenizer.sh script The algorithm is based on a set of rules, implemented using Flex (JFlex), which allow extracting individual tokens for a text stream. Speed of processing Server: Intel(R) Xeon(R) CPU X3363 @ 2.83GHz Operating system: ubuntu 10.04, Java 1.7.0_21 64 bit server 38497 characters/ms 5158 tokens/ms Tests were conducted in a single thread. Format of the messages_to_tokenize.txt file This file describes input data for the tokenizer module for demo purposes. Format: TexttText type Text contains textual data in Russian for tokenization t – tab symbol Text type: supported values are GENERAL_TEXT and TWITTER. Examples of tokenization The run_tokenizer.sh script will generate the following file: messages_to_tokenize.out. For the following input file messages_to_tokenize.txt: :)this is it! #По_русски @dm ;-D www.test.com/x?y TWITTER This output gets generated:
  • 2. :)this is it! #По_русски @dm ;-D www.test.com/x?y TWITTER emopostkn, type: ALPHANUM this, type: ALPHANUM is, type: ALPHANUM it, type: ALPHANUM !, type: PUNCT #По_русски, type: TWITTER_HASHTAG @dm, type: TWITTER_USERNAME emopostkn, type: ALPHANUM www.test.com/x?y, type: HYPERLINK Examples of using the library from the Java code Tokenizer twitterTokenizer = new TwitterFlexTokenizer(new StringReader("#ht. done!"), true); Token reusableToken = Token.newReusableToken(); while((reusableToken = twitterTokenizer.getNextToken(reusableToken)) != null) { System.out.println(reusableToken); } output: Token[text=#ht,type=TWITTER_HASHTAG] Token[text=done,type=ALPHANUM]