Linguistic component Tokenizer for the Russian language
1. Linguistic Component: Tokenizer for the
Russian language
Technical description
SemanticAnalyzer Group, 2013-08-29
www.semanticanalyzer.info
This document describes technical details of tokenizer for the Russian language. The component has two
modes of operation:
Processing of generic texts: news, technical articles etc
Processing of Twitter messages
Demo package sent upon request contains the following:
Java library of tokenizer in a form of a binary
run_tokenizer.sh script for swift checking the functionality of the module
messages_to_tokenize.txt file containing examples of generic text and tweets for tokenization
using the run_tokenizer.sh script
The algorithm is based on a set of rules, implemented using Flex (JFlex), which allow extracting
individual tokens for a text stream.
Speed of processing
Server: Intel(R) Xeon(R) CPU X3363 @ 2.83GHz
Operating system: ubuntu 10.04, Java 1.7.0_21 64 bit server
38497 characters/ms
5158 tokens/ms
Tests were conducted in a single thread.
Format of the messages_to_tokenize.txt file
This file describes input data for the tokenizer module for demo purposes.
Format:
TexttText type
Text contains textual data in Russian for tokenization
t – tab symbol
Text type: supported values are GENERAL_TEXT and TWITTER.
Examples of tokenization
The run_tokenizer.sh script will generate the following file: messages_to_tokenize.out.
For the following input file messages_to_tokenize.txt:
:)this is it! #По_русски @dm ;-D www.test.com/x?y TWITTER
This output gets generated:
2. :)this is it! #По_русски @dm ;-D www.test.com/x?y TWITTER
emopostkn, type: ALPHANUM
this, type: ALPHANUM
is, type: ALPHANUM
it, type: ALPHANUM
!, type: PUNCT
#По_русски, type: TWITTER_HASHTAG
@dm, type: TWITTER_USERNAME
emopostkn, type: ALPHANUM
www.test.com/x?y, type: HYPERLINK
Examples of using the library from the Java code
Tokenizer twitterTokenizer = new TwitterFlexTokenizer(new StringReader("#ht. done!"), true);
Token reusableToken = Token.newReusableToken();
while((reusableToken = twitterTokenizer.getNextToken(reusableToken)) != null) {
System.out.println(reusableToken);
}
output:
Token[text=#ht,type=TWITTER_HASHTAG]
Token[text=done,type=ALPHANUM]