SlideShare ist ein Scribd-Unternehmen logo
1 von 20
Downloaden Sie, um offline zu lesen
STAT Technical Report                 Version 0.1


                                                      by the Stat Team

                                                        Mehrbod Sharifi
                                                             Jing Yang




               The Stat Project, guided by

          Professor Eric Nyberg and Anthony Tomasic




                     March. 5, 2009
Chapter 1

Introduction to STAT

In this chapter, we give an brief introduction to the Stat project to audience reading this document.
We explain the background, the motivation, the scope, and the stakeholders of this project so that
audience can understand why we are doing so, what we are going to do, and who may be interested
in our project.


1.1    Overview
Stat is an open source machine learning framework in Java for text analysis with focus on semi-
supervised learning algorithms. Its main goal is to facilitate common textual data analysis tasks
for researcher and engineers, so that they can get their works done straightforwardly and efficiently.

    Applying machine learning approaches to extract information and uncover patterns from tex-
tual data has become extremely popular in recent years. Accordingly, many software have been
developed to enable people to utilize machine learning for text analytics and automate such pro-
cess. Users, however, find many of these existing software difficult to use, even if they just want
to carry out a simple experiment; they have to spend much time learning those software and may
finally find out they still need to write their own programs to preprocess data to get their target
software running.

    We notice this situation and observe that many of these can be simplified. A new software
framework should be developed to ease the process of doing text analytics; we believe researchers
or engineering using our framework for textual data analysis would feel the process convenient,
conformable, and probably, enjoyable.


1.2    Purpose
Existing software with regard to using machine learning for linguistic analysis have tremendously
helped researchers and engineers make new discoveries based on textual data, which is unarguably
one of the most form of data in the real world.

    As a result, many more researchers, engineers, and possibly students are increasingly interested
in using machine learning approaches in their text analytics. However, the bar for entering this
area is not low. Those people, some of which even being experienced users, find existing software
packages are not generally easy to learn and convenient to use.



                                                 1
For example, although Weka has a comprehensive suite of machine learning algorithms, it is
not designed for text analysis, lacking of naturally supported capabilities for linguistic concepts
representation and processing. MinorThird, on the other hand, though designed specifically as a
package for text analysis, turns out to be rather complicated and difficult to learn. It also does not
support semi-supervised and unsupervised learning, which are becoming increasingly important
machine learning approaches.

   Another problem for many existing packages is that they often adopt their own specific input
and output format. Real-world textual data, however, are generally in other formats that are not
readily understood by those packages. Researchers and engineers who want to make use of those
packages often find themselves spending much time seeking or writing ad hoc format conversion
code. These ad hoc code, which could have been reusable, are often written over and over again
by different users.

    Researchers and engineers, when presented common text analysis tasks, usually want a text-
specific, lightweight, reusable, understandable, and easy-to-learn package that help them get their
works done efficiently and straightforwardly. Stat is designed to meet their requirements. Moti-
vated by the needs of users who want to simplify their work and experiment related to textual data
learning, we initiate the Stat project, dedicating to provide them suitable toolkits to facilitate
their analytics task on textual data.
      In a nutshell, Stat is an open source framework aimed at providing researchers and en-
      gineers with a integrated set of simplified, reusable, and convenient toolkits for textual
      data analysis. Based on this framework, researchers can carry out their machine learning
      experiments on textual data conveniently and comfortably, and engineers can build their
      own small applications for text analytics straightforwardly and efficiently.


1.3     Scope
This project involves developing a simplified and reusable framework (a collection of foundation
classes) in Java that provides basic and common capabilities for people to easily perform machine
learning analysis on various kind of textual data.

    The previous section may give an impression for an impossible task. In this section, we clearly
state what is and is not included in this project.

   The main deliverable for this project is a set of specifications, which defines a simplified frame-
work for text analysis based on NLP and machine learning. We explain how succinctly the frame-
work should be used and how easily it can be extended.

    We also provide introductory implementations of the framework, including tools and packages
serving foundation classes of the framework. They are
   • Dataset and framework object adaptors: A set of classes that will allow reading and
     writing files in various formats, supporting importing and exporting dataset as well as loading
     and saving framework objects.
   • Linguistic and machine learning packages wrappers: A set of classes that integrate
     existing tools for NLP and Machine Learning and can be used within the framework. These
     wrappers hides the implementation and variation details of these packages to provide a set
     of simplified and unified interfaces to framework users.

                                                 2
• Semi-Supervised algorithms: Implementation of certain Semi-Supervised learning algo-
     rithms that are not available from the existing packages.

   Finally, we provide a set documents that reflect our development process and give guidance to
users of how to use our framework. These documents are

   • Technical report: An report summarizing major artifacts we have and documenting vision,
     goals, motivation, and decisions. It also includes result of requirements and design phase,
     and results of final evaluation. This report gives overall comprehension to whom want to
     understand our package as well as software development process well.

   • Executive summary: An summary that gives brief introduction to users of our framework,
     explaining benefits that this framework can bring to them in text analysis.

   • JavaDocs, tutorials, and examples: JavaDocs about APIs specifications extracted from
     comments in the code. Tutorials and concretes examples are also provided to ease the process
     learning this framework.




                                               3
1.4    Stakeholders
Below is the list of stakeholder and how this project will affect them:

   • Researchers, particularly in language technology but also in other fields, would be able
     to save time by focusing on their experiments instead of dealing with various input/output
     format which is routinely necessary in text processing. They can also easily switch between
     various tools available and even contribute to STAT so that others can save time by using
     their adaptors and algorithms.

   • Software engineers, who are not familiar with the machine learning can start using the
     package in their program with a very short learning phase. STAT can help them develop clear
     concepts of machine learning quickly. They can build their applications using functionality
     provided STAT easily and achieve high level performance.

   • Developers of learning package, can provide plug-ins for STAT to allow ease of integration
     of their package. They can also delegate some of the interoperability needs through this
     program (some of which may be more time consuming to be addressed within their own
     package).

   • Beginners to text processing and mining, who want fundamental and easy to learn
     capabilities involving discovering patterns from text. They will be benefited from this project
     by saving their time, facilitating their learning process, and sparking their interests to the
     area of language technology.




                                                4
Chapter 2

Survey Analysis

This project was faced with many challenges from the beginning. There are many question, some
of subjective nature, that really needs to be addresses by our target audience. For this reason, we
designed a survey to obtain a better understanding and provide a more suitable solution to this
problem. In this chapter, we explain the process of designing the survey, collecting information
and some analysis of the collected data.


2.1    Designing the Survey
The primary goals of doing a survey was the following:

   • Understanding the potential users of the package: their programming habit, problem solving
     strategies, experience in various area and tools, etc.

   • Setting priority for which criteria to focus on for our design and implementation

   The survey needed to be short and question to be very specific to get better responses. The
maximum number of question was set at 10 questions. Several draft of the questions was reviewing
within the STAT group and the software engineering class students and instructors several times
until finalize. We also obtained and incorporate some advices from other departments. The final
survey was designed on the SurveyMonkey.com.


2.2    Distribution
The target users of STAT are two main groups with different needs: researchers and industry
programmer. The survey contains questions to distinguish there two group but the final framework
should address the needs from both groups. After conducting a test run with this the STAT group
and the class, we sent the survey out to the Language Technology Institute student mailing list
(representing researchers) and also to student in iLab (Prof, Ramayya Krishnan, Heinz School of
Business) representing industry programmers.


2.3    Analysis of Results
As of 2/25/09, we have received 23 responses and they are individually reviewed by STAT members
and also in aggregate. Below we summarized the finding of the survey result and some charts:




                                                5
• While many different programming language are used (Python, R, C++) but over 90%
  mentioned Java as one of the languages (25% consider themselves expert). The programming
  hours range from 2-60.

• Users don’t seem to distinguish much between industry and research applications and this is
  perhaps more research for the different to be transparent.

• Most users are not familiar with Operation Research but everyone is somewhat familiar with
  Machine Learning (if not specifically text classification or data mining).

• Data type expectedly were mostly textual (plain, XML, HTML, etc. as opposed to Excel,
  though it was mentioned) and sources were files, databases and web.

• Over 50% chose: ”I write a program to preprocess data and then use an external machine
  learning package.”

• Easy of API use, Performance and Extensibility were the top three choice in design but in
  addition to those in textual descriptions user pointed out mostly problems with input and
  output formats.




                  Figure 2.1: Distribution of familiarity with packages




                                            6
Figure 2.2: Distribution of design preference




                     7
Chapter 3

Analysis of Related Packages

In this chapter, we analyze a few main competitors of our projects. We focus on two academic
toolkits – Weka and MinorThird. We comment on their strengths and explore their limitations, and
discuss why and how we can do better than these competitors.


3.1        Weka
Weka is a comprehensive collection of machine learning algorithms for solving data mining problems
in Java and open sourced under the GPL.

3.1.1       Strengths of Weka
Weka is a very popular software for machine learning, due to the its main strengths:

       • Provide comprehensive machine learning algorithms. Weka supports most current
         machine learning approaches for classification, clustering, regression, and association rules.
       • Cover most aspects for performing a full data mining process. In addition to learn-
         ing, Weka supports common data preprocessing methods, feature selection, and visualization.
       • Freely available. Weka is open source released under GNU General Public License.
       • Cross-platform. Weka is cross-platform fully implemented in Java.

Because of its supports of comprehensive machine learning algorithm, Weka is often used for
analytics in many form of data, including textual data.

3.1.2       Limitations of using Weka for text analysis
However, Weka is not designed specifically for textual data analysis. The most critical drawback
of using Weka for processing text is that Weka does not provide “built-in” constructs for natural
representation of linguistics concepts1 . Users interested in using Weka for text analysis often find
themselves need to write some ad-hoc programs for text preprocessing and conversion to Weka
representation.

       • Not good at understanding various text format. Weka is good at understanding its
         standard .arff format, which is however not a convenient way of representation text. Users
         have to worry about how can they convert textual data in various original format such as
   1
     Though there are classes in Weka supporting basic natural language processing, they are viewed as auxiliary
utilities. They make performing basic textual data processing using Weka possible, but not conveniently and straight-
forwardly


                                                         8
raw plain text, XML, HTML, CSV, Excel, PDF, MS Word, Open Office document, etc. to
     be understandable by Weka. As a result, they need to spend time seeking or writing external
     tools to complete this task before performing their actual analysis.
   • Unnecessary data type conversion. Weka is superior in processing nominal (aka, categor-
     ical) and numerical type attributes, but not string type. In Weka, non-numerical attributes
     are by default imported as nominal attributes, which usually is not a desirable type for text
     (imagine treating different chunks of text as different values of a categorical attribute). One
     have to explicitly use filters to do a conversion, which could have been done automatically if
     it knows you are importing text.
   • Lack of specialized supported for linguistics preprocessing. Linguistics preprocessing
     is a very important aspect of textual data analysis but not a concern of Weka. Weka does
     not (at least, not dedicated to) take care this issue very seriously for users. Weka has a
     StringToWordVector class that performs all-in-one basic linguistics preprocessing, including
     tokenization, stemming, stopword removal, tf-idf transformation, etc. However, it is less
     flexible and lack of other techniques (such as part-of-speech tagging and n-gram processing)
     for users who want fined grain and advanced linguistics controls.
   • Unnatural representation of textual data learning concepts. Weka is designed for
     general purpose machine learning tasks so have to protect too many variations. As a results,
     domain concepts in Weka are abstract and high-level, package hierarchy is deep, and the
     number of classes explodes. For example, we have to use Instance rather than Document and
     Instances rather than Corpus. Concepts in Weka such as Attribute is obscure in meaning
     for text processing. First adding many Attribute to a cryptic FastVector which then passed
     to a Instances in order to construct a dataset appears very awkward to users processing
     text. Categorize filters first according to attribute/instance then supervised /unsupervised
     make non-expert users feel confusing and hard to find their right filters. Many users may feel
     unconformable programmatically using Weka to carry out their experiments related to text.

    In summary, for users who want enjoyable experience at performing text analysis, they need
built-in capabilities to naturally support representing and processing text. They need specialized
and convenient tools that can help them finish most common text analysis tasks straightforwardly
and efficiently. This cannot be done by Weka due to its general-purpose nature, despite its com-
prehensive tools.

   Figure 3.1 shows the domain model we extracted for Weka for basic text analysis.




                                                9
Partial UML Domain Model of Weka (Preliminary)

                                                                   evaluate

                                            1
                                                                                                1

                                     Evaluation                                                Classifier

                                                                                    1                           1


                                            1
                                                                              built-from                    classify
                                             evaluate-on

                                                                                                                        1
                                                               1              1


                                 1 tranform-attribute                                           contain
            StringToWordVector                                       Instances                                          Instance
                                                           1

                                                                                           1                        attributeValues
                                                                                                            *



                                                                              1


                                                                              contain

                                                                              *

                                                                      Attribute
             NominalToString         transform-type     1
                                 1
                                                               possibleValues




                                                            Note: when you see ClassA quot;containsquot; a number of ClassB,
                                                            it is probably that Weka implements it as ClassA maintains a
                                                            quot;FastVectorquot; whose elements are instances of ClassB.


                Figure 3.1: Partial domain model for Weka for basic text analysis


3.2    MinorThird
Figure 3.2 shows the domain model we extracted for MinorThird for basic text analysis.




                                                                     10
Figure 3.2: Partial domain model for MinorThird for basic text analysis




                                  11
Chapter 4

Requirements specifications

Here we first explain in detail the major features of our framework.

   • Simplified. APIs are clear, consistent, and straightforward. Users with reasonable Java
     programming knowledge can learn our package without much efforts, understand its logical
     flow quickly, be able to get started within a small amount of time, and finish the most common
     tasks with a few lines of code. Since our framework is not designed for general purposes and
     for including comprehensive features, there are space for us to simplify the APIs to optimize
     for those most typical and frequent operations.

   • Extensible and Reusable. Built-in modular supports are provided the core routines across
     various phases in text analysis, including text format transformation, linguistic processing,
     machine learning, and experimental evaluation. Additional functionalities can be extended
     on top of the core framework easily and user-defined specifications are pluggable. Existing
     code can be used cross environment and interoperate with external related packages, such as
     Weka, MinorThird, and OpenNLP.

   • High performance. Performance in terms of speed of algorithms we wrap and we imple-
     ment should acceptable for typical experiment and dataset. Specifically, there should not be
     significant degrade in performance in using our framework with capabilities provided by ex-
     ternal packages; performance of algorithm we implemented should be should not be degrade
     from its best complexity much because of our implementation flaws.


4.1    Functional Requirements
In this section, we define most common use cases of our framework and address them in the degree
of detail of casual use case. The “functional requirements” of this project are that the users can
use libraries provided by our framework to complete these use cases more easily and comfortably
than not use.

Actors
Since our framework assumes that all users of interests are programming using our APIs, there is
only one role of human actor, namely the programmer. This human actor is always the primary
actor. There are some possible secondary and system actors, namely the external packages our
framework integrates, depending on what specific use cases the primary actor is performing.




                                               12
Fully-dressed Use Cases

               Use Case UC1: Document Classification Experiment

Scope: Text analysis application using STAT framework

Level: User goal

Primary Actor: Researcher

Stakeholder and Interests:

       • Researcher: Want to test and evaluate a classification algorithm (supervised, semi-
         supervised or unsupervised) by applying it on a (probably well-known) corpus; the task
         needs to be done efficiently with easy and straightforward coding

Preconditions:

       • STAT framework is correctly installed and configured
       • The corpus is placed on a source readable by STAT framework

Postconditions:

       • A model is trained and test documents in the corpus are classified. Evaluation results
         are displayed

Main Success Scenario:

       1. Researcher imports the corpus from its source into memory. Specifically, the system
          reads data from the source, parses the raw format, extracts information according to
          the schema, and constructs an in-memory object to store the corpus
       2. Researcher performs preprocessing on the corpus. Specifically, for each document, the
          researcher tokenizes the text, removes the stopwords, performs stemming on the tokens,
          performs filtering, and/or other potential preprocessing on body text and meta data
       3. Researcher converts the corpus into the feature vectors needed for machine learning.
          The feature vectors are created by analyzing the documents in the corpus, deriving or
          filtering features, adding or removing documents, sampling documents, handling missing
          entries, normalizing features, selecting features, and/or other potential processing
       4. Researcher splits the processed corpus into training and testing set
       5. Researcher chooses a machine learning algorithm, set its parameters, and uses it to train
          a model from the training set
       6. Researcher classifies the documents in the test set based on the model trained
       7. Researcher evaluates the classification based on classification results obtained on the
          test set and its true labels. Classification is evaluated mainly on classification accuracy
          and classification time or if it is unsupervised, on other unsupervised metrics such as
          Adjusted Rand Index.
       8. Researcher displays the final evaluation result




                                               13
Use Case UC1: Document Classification Experiment (cont.)

Extensions:
    1a. The framework is unable to find the specified source.
       1. Throw source not found exception
    1b. Researcher loads a previously saved corpus in native format from a file on the disk directly
    to memory object, thus researcher does not handle source, format, or schema explicitly.
       1a. File not found:
          1.Throw file not found exception
       1b. Malformed native format:
          1.Throw malformed native format exception
    4a. Researcher specify a parameter k larger than the number of document or smaller than 1
       1. Throw invalid argument exception
    1-3, 5a. Researcher saves the in-memory objects of different level of processed corpus rep-
    resentation to disk in native format which can be loaded back lately, after finishing each
    step.
    1-3, 5b. Research exports the in-memory objects of different processed corpus representation
    to disk in external formats (e.g., weka arff, csv) which can be processed by external software.
    6a. Researcher saves the in-memory model object to disk, which can be loaded back lately.
    6b. Researcher loads a previously saved model in native format from a file on the disk directly
    to memory object.
       1a. File not found:
          1. Throw file not found exception
       1b. Malformed native format:
          1.Throw malformed native format exception
    4-8b. To perform k-fold cross validation, the corpus is split to k parts in step 4, and steps
    5-8 are repeated k-times by switching each split a testing split and the rest as training.
    Researcher combines the evaluations on different test sets obtained in the previous steps and
    forms a final classification evaluations
    6c. Unsupported learning parameters (the learning algorithm cannot handle the combination
    of parameters the researcher specifies)
       1. Throw unsupported learning parameters exception
    6d. Unsupported learning capability (the learning algorithm cannot handle the format and
    data in training set, potentially caused by unsupported feature type, class type, missing
    values, etc).
       1. Identify exception cause(s)
       2. Throw corresponding exception(s)




                                              14
Use Case UC1: Document Classification Experiment (cont.)

 8a. Incompatible between test set and classification (potentially caused by difference in schema
     between training set and test set)
         1. Throw incompatible evaluation exception
      10a. The researcher customizes the display instead of using the default display format.
         1.The researcher obtains specific fields of the evaluations via interfaces provided
         2.The researcher constructs a customized format using the fields he/she extracts
         3.The researcher display it customized format and/or write to a destination

Special Requirements:

        • Pluggable preprocessors in step 2-3
        • Pluggable learning algorithm in step 6
        • Learning algorithm should be scalable to deal with large corpus
        • Researcher should be able to visualize results after various steps to trace the state of
          different objects (e.g., preprocessed corpus, models, classifications, evaluations)
        • Researcher should be able to customize the visualization output

Open Issues:

        • How to address the variations issues in reading different sources
        • How to (in what form) let research specify parameters for different learning algorithms
        • What specifically need to be able to export, persist, and visualize?
        • How to implement the corpus splitting in an efficient way (don’t create extra objects)
        • How to deal with performance issues of storing large corpora in the memory
        • How to deal with internal representation of the dataset in efficient data structure



4.2    Non-functional Requirements
  • Open source. It should be made available for public collaboration, allowing users to use,
    change, improve, and redistribute the software.

  • Portability. It should be consistently installed, configured, and run independent to different
    platforms, given its design and implementation on Java runtime environment.

  • Performance. It should not be the bottleneck in terms of machine learning analysis. That
    is, by wrapping other existing packages, no significant reduce in performance impact would
    be observed. For algorithms we implement, our coding should achieve the best performance
    as the best complexity as the algorithm could be.

  • Documentation. Its code should be readable, self-explained, and documented clearly and
    unambiguously for critical or tricky part. It should include an introduction guide for users


                                                15
to get started, and preferably, provides sample dataset, tutorial, and demos for user to run
examples out of the box.




                                         16
4.3     Domain model
In this section, we present the domain model diagram and some explanations about it. A lot of
time was spent on this domain model and it has been evolving to a relative stable one, which will
guide our design in the iteration I.

4.3.1   Domain model diagram
Figure 4.1 shows the domain model diagram of STAT project (for the first iteration). This
domain model intends to give the top level of understanding of concepts in the project rather
than a comprehensive one that includes every details. There are a number of concepts such as
“Annotation”, “Label”, “ProbabilityDistribution” ”DistanceMetric”, and “Partition” (or
maybe “Split”), are not shown in this diagram.




               Figure 4.1: Partial domain model for Weka for basic text analysis

    Note the diagram also lacks of some top level concepts related to unsupervised learning. These
are topics in milestone requirement and design iteration II. A new domain model incorporates these
concepts will be proposed at that phase. For now, we focus this domain model and give detail
clarifications of what these current concepts are.

4.3.2   Domain concepts clarifications
Mebrbod: Revise definitions of these objects here.

   • CorpusReader. CorpusReader read text from a source to a Corpus. No content transfor-
     mation is done and everything (label, metadata, body, etc) stay in text format.

   • Corpus. Corpus is a set of Documents in text format.

   • Annotator. Annotator transforms a Corpus to another Corpus by adding annotations.

                                               17
• FeatureExtractor. FeatureExtractor transforms a Corpus to a Dataset by converting
  Documents in text format to Instances into feature representations.

• Dataset. Dataset is a set of Instances which are features representation of Documents
  text.

• Learner. Learner learns a Model from a Dataset.

• Model. Model is a collection of parameters learned from a Dataset by the Learner.

• Classifier. Classifier uses the model to assign classes to Instances in a Dataset and
  produces a Classification with respect to the Dataset.

• Classification. Classification contains the classification results and descriptive informa-
  tion about the classification process, e.g., which model and classifier are used to produce the
  classification results.

• ClassificationEvaluator. Classification computes the evaluation metrics for a classifi-
  cation and produces a ClassificationEvaluation.

• ClassificationEvaluation. ClassificationEvaluation contains the evaluation results
  and descriptive information about the classification and evaluation process.




                                            18
Bibliography

[1] TBD

[2] TBD




               19

Weitere ähnliche Inhalte

Was ist angesagt?

Industry-Academia Communication In Empirical Software Engineering
Industry-Academia Communication In Empirical Software EngineeringIndustry-Academia Communication In Empirical Software Engineering
Industry-Academia Communication In Empirical Software Engineering
Per Runeson
 
Presentation: Tool Support for Essential Use Cases to Better Capture Software...
Presentation: Tool Support for Essential Use Cases to Better Capture Software...Presentation: Tool Support for Essential Use Cases to Better Capture Software...
Presentation: Tool Support for Essential Use Cases to Better Capture Software...
Naelah AlAgeel
 
EVALUATION OF SINGLE-SPAN MODELS ON EXTRACTIVE MULTI-SPAN QUESTION-ANSWERING
EVALUATION OF SINGLE-SPAN MODELS ON EXTRACTIVE MULTI-SPAN QUESTION-ANSWERINGEVALUATION OF SINGLE-SPAN MODELS ON EXTRACTIVE MULTI-SPAN QUESTION-ANSWERING
EVALUATION OF SINGLE-SPAN MODELS ON EXTRACTIVE MULTI-SPAN QUESTION-ANSWERING
dannyijwest
 
Finding Bad Code Smells with Neural Network Models
Finding Bad Code Smells with Neural Network Models Finding Bad Code Smells with Neural Network Models
Finding Bad Code Smells with Neural Network Models
IJECEIAES
 
Crocodile physics2
Crocodile physics2Crocodile physics2
Crocodile physics2
haitham95
 

Was ist angesagt? (11)

Icsoc12 tooldemo.ppt
Icsoc12 tooldemo.pptIcsoc12 tooldemo.ppt
Icsoc12 tooldemo.ppt
 
Industry-Academia Communication In Empirical Software Engineering
Industry-Academia Communication In Empirical Software EngineeringIndustry-Academia Communication In Empirical Software Engineering
Industry-Academia Communication In Empirical Software Engineering
 
Presentation: Tool Support for Essential Use Cases to Better Capture Software...
Presentation: Tool Support for Essential Use Cases to Better Capture Software...Presentation: Tool Support for Essential Use Cases to Better Capture Software...
Presentation: Tool Support for Essential Use Cases to Better Capture Software...
 
Machine Learning in Static Analysis of Program Source Code
Machine Learning in Static Analysis of Program Source CodeMachine Learning in Static Analysis of Program Source Code
Machine Learning in Static Analysis of Program Source Code
 
Second generation semantic technologies for patent analysis
Second generation semantic technologies for patent analysisSecond generation semantic technologies for patent analysis
Second generation semantic technologies for patent analysis
 
PhD Thesis Defense - Enhancing Software Quality and Quality of Experience thr...
PhD Thesis Defense - Enhancing Software Quality and Quality of Experience thr...PhD Thesis Defense - Enhancing Software Quality and Quality of Experience thr...
PhD Thesis Defense - Enhancing Software Quality and Quality of Experience thr...
 
EVALUATION OF SINGLE-SPAN MODELS ON EXTRACTIVE MULTI-SPAN QUESTION-ANSWERING
EVALUATION OF SINGLE-SPAN MODELS ON EXTRACTIVE MULTI-SPAN QUESTION-ANSWERINGEVALUATION OF SINGLE-SPAN MODELS ON EXTRACTIVE MULTI-SPAN QUESTION-ANSWERING
EVALUATION OF SINGLE-SPAN MODELS ON EXTRACTIVE MULTI-SPAN QUESTION-ANSWERING
 
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniquesImproving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
 
Finding Bad Code Smells with Neural Network Models
Finding Bad Code Smells with Neural Network Models Finding Bad Code Smells with Neural Network Models
Finding Bad Code Smells with Neural Network Models
 
Crocodile physics2
Crocodile physics2Crocodile physics2
Crocodile physics2
 
IntelliSemantc - Second generation semantic technologies for patents
IntelliSemantc - Second generation semantic technologies for patentsIntelliSemantc - Second generation semantic technologies for patents
IntelliSemantc - Second generation semantic technologies for patents
 

Andere mochten auch

Andere mochten auch (7)

Organi-Deviance Part I
Organi-Deviance Part IOrgani-Deviance Part I
Organi-Deviance Part I
 
Summary Of Dissertation Presentation
Summary Of Dissertation PresentationSummary Of Dissertation Presentation
Summary Of Dissertation Presentation
 
Is A Corporate Criminal Profile Possible
Is A Corporate Criminal Profile PossibleIs A Corporate Criminal Profile Possible
Is A Corporate Criminal Profile Possible
 
我愛上攝影
我愛上攝影我愛上攝影
我愛上攝影
 
STAT Requirement Analysis
STAT Requirement AnalysisSTAT Requirement Analysis
STAT Requirement Analysis
 
Op weg naar de grote wereld
Op weg naar de grote wereldOp weg naar de grote wereld
Op weg naar de grote wereld
 
Stat2 25 09
Stat2 25 09Stat2 25 09
Stat2 25 09
 

Ähnlich wie Stat Tech Reportv1

Integrated Analysis of Traditional Requirements Engineering Process with Agil...
Integrated Analysis of Traditional Requirements Engineering Process with Agil...Integrated Analysis of Traditional Requirements Engineering Process with Agil...
Integrated Analysis of Traditional Requirements Engineering Process with Agil...
zillesubhan
 
MK_MSc_Degree_Project_Report ver 5_updated
MK_MSc_Degree_Project_Report ver 5_updatedMK_MSc_Degree_Project_Report ver 5_updated
MK_MSc_Degree_Project_Report ver 5_updated
Mohammed Ali Khan
 

Ähnlich wie Stat Tech Reportv1 (20)

Exploring the Efficiency of the Program using OOAD Metrics
Exploring the Efficiency of the Program using OOAD MetricsExploring the Efficiency of the Program using OOAD Metrics
Exploring the Efficiency of the Program using OOAD Metrics
 
Automatic Term Recognition with Apache Solr
Automatic Term Recognition with Apache SolrAutomatic Term Recognition with Apache Solr
Automatic Term Recognition with Apache Solr
 
Guia 2-examen-de-ingles
Guia 2-examen-de-inglesGuia 2-examen-de-ingles
Guia 2-examen-de-ingles
 
Vol 1 issue 2 june 2015
Vol 1 issue 2 june 2015Vol 1 issue 2 june 2015
Vol 1 issue 2 june 2015
 
Integrated Analysis of Traditional Requirements Engineering Process with Agil...
Integrated Analysis of Traditional Requirements Engineering Process with Agil...Integrated Analysis of Traditional Requirements Engineering Process with Agil...
Integrated Analysis of Traditional Requirements Engineering Process with Agil...
 
Big data analytics fas trak solution overview
Big data analytics fas trak solution overviewBig data analytics fas trak solution overview
Big data analytics fas trak solution overview
 
Msr2021 tutorial-di penta
Msr2021 tutorial-di pentaMsr2021 tutorial-di penta
Msr2021 tutorial-di penta
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)
 
ESSENSE
ESSENSEESSENSE
ESSENSE
 
Text Summarization and Conversion of Speech to Text
Text Summarization and Conversion of Speech to TextText Summarization and Conversion of Speech to Text
Text Summarization and Conversion of Speech to Text
 
IRJET - Optical Character Recognition and Translation
IRJET -  	  Optical Character Recognition and TranslationIRJET -  	  Optical Character Recognition and Translation
IRJET - Optical Character Recognition and Translation
 
IRJET- Voice to Code Editor using Speech Recognition
IRJET- Voice to Code Editor using Speech RecognitionIRJET- Voice to Code Editor using Speech Recognition
IRJET- Voice to Code Editor using Speech Recognition
 
Unit 1 OOSE
Unit 1 OOSEUnit 1 OOSE
Unit 1 OOSE
 
IET~DAVV STUDY MATERIALS SRS.docx
IET~DAVV STUDY MATERIALS SRS.docxIET~DAVV STUDY MATERIALS SRS.docx
IET~DAVV STUDY MATERIALS SRS.docx
 
A Survey on Knowledge Base: An Internal Platform to Exchange Technical Questi...
A Survey on Knowledge Base: An Internal Platform to Exchange Technical Questi...A Survey on Knowledge Base: An Internal Platform to Exchange Technical Questi...
A Survey on Knowledge Base: An Internal Platform to Exchange Technical Questi...
 
MK_MSc_Degree_Project_Report ver 5_updated
MK_MSc_Degree_Project_Report ver 5_updatedMK_MSc_Degree_Project_Report ver 5_updated
MK_MSc_Degree_Project_Report ver 5_updated
 
IRJET- Factoid Question and Answering System
IRJET-  	  Factoid Question and Answering SystemIRJET-  	  Factoid Question and Answering System
IRJET- Factoid Question and Answering System
 
IRJET - Mobile Chatbot for Information Search
 IRJET - Mobile Chatbot for Information Search IRJET - Mobile Chatbot for Information Search
IRJET - Mobile Chatbot for Information Search
 
Final_version_SAI_ST_projectenboekje_2015
Final_version_SAI_ST_projectenboekje_2015Final_version_SAI_ST_projectenboekje_2015
Final_version_SAI_ST_projectenboekje_2015
 
Sd Revision
Sd RevisionSd Revision
Sd Revision
 

Kürzlich hochgeladen

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 

Kürzlich hochgeladen (20)

Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 

Stat Tech Reportv1

  • 1. STAT Technical Report Version 0.1 by the Stat Team Mehrbod Sharifi Jing Yang The Stat Project, guided by Professor Eric Nyberg and Anthony Tomasic March. 5, 2009
  • 2. Chapter 1 Introduction to STAT In this chapter, we give an brief introduction to the Stat project to audience reading this document. We explain the background, the motivation, the scope, and the stakeholders of this project so that audience can understand why we are doing so, what we are going to do, and who may be interested in our project. 1.1 Overview Stat is an open source machine learning framework in Java for text analysis with focus on semi- supervised learning algorithms. Its main goal is to facilitate common textual data analysis tasks for researcher and engineers, so that they can get their works done straightforwardly and efficiently. Applying machine learning approaches to extract information and uncover patterns from tex- tual data has become extremely popular in recent years. Accordingly, many software have been developed to enable people to utilize machine learning for text analytics and automate such pro- cess. Users, however, find many of these existing software difficult to use, even if they just want to carry out a simple experiment; they have to spend much time learning those software and may finally find out they still need to write their own programs to preprocess data to get their target software running. We notice this situation and observe that many of these can be simplified. A new software framework should be developed to ease the process of doing text analytics; we believe researchers or engineering using our framework for textual data analysis would feel the process convenient, conformable, and probably, enjoyable. 1.2 Purpose Existing software with regard to using machine learning for linguistic analysis have tremendously helped researchers and engineers make new discoveries based on textual data, which is unarguably one of the most form of data in the real world. As a result, many more researchers, engineers, and possibly students are increasingly interested in using machine learning approaches in their text analytics. However, the bar for entering this area is not low. Those people, some of which even being experienced users, find existing software packages are not generally easy to learn and convenient to use. 1
  • 3. For example, although Weka has a comprehensive suite of machine learning algorithms, it is not designed for text analysis, lacking of naturally supported capabilities for linguistic concepts representation and processing. MinorThird, on the other hand, though designed specifically as a package for text analysis, turns out to be rather complicated and difficult to learn. It also does not support semi-supervised and unsupervised learning, which are becoming increasingly important machine learning approaches. Another problem for many existing packages is that they often adopt their own specific input and output format. Real-world textual data, however, are generally in other formats that are not readily understood by those packages. Researchers and engineers who want to make use of those packages often find themselves spending much time seeking or writing ad hoc format conversion code. These ad hoc code, which could have been reusable, are often written over and over again by different users. Researchers and engineers, when presented common text analysis tasks, usually want a text- specific, lightweight, reusable, understandable, and easy-to-learn package that help them get their works done efficiently and straightforwardly. Stat is designed to meet their requirements. Moti- vated by the needs of users who want to simplify their work and experiment related to textual data learning, we initiate the Stat project, dedicating to provide them suitable toolkits to facilitate their analytics task on textual data. In a nutshell, Stat is an open source framework aimed at providing researchers and en- gineers with a integrated set of simplified, reusable, and convenient toolkits for textual data analysis. Based on this framework, researchers can carry out their machine learning experiments on textual data conveniently and comfortably, and engineers can build their own small applications for text analytics straightforwardly and efficiently. 1.3 Scope This project involves developing a simplified and reusable framework (a collection of foundation classes) in Java that provides basic and common capabilities for people to easily perform machine learning analysis on various kind of textual data. The previous section may give an impression for an impossible task. In this section, we clearly state what is and is not included in this project. The main deliverable for this project is a set of specifications, which defines a simplified frame- work for text analysis based on NLP and machine learning. We explain how succinctly the frame- work should be used and how easily it can be extended. We also provide introductory implementations of the framework, including tools and packages serving foundation classes of the framework. They are • Dataset and framework object adaptors: A set of classes that will allow reading and writing files in various formats, supporting importing and exporting dataset as well as loading and saving framework objects. • Linguistic and machine learning packages wrappers: A set of classes that integrate existing tools for NLP and Machine Learning and can be used within the framework. These wrappers hides the implementation and variation details of these packages to provide a set of simplified and unified interfaces to framework users. 2
  • 4. • Semi-Supervised algorithms: Implementation of certain Semi-Supervised learning algo- rithms that are not available from the existing packages. Finally, we provide a set documents that reflect our development process and give guidance to users of how to use our framework. These documents are • Technical report: An report summarizing major artifacts we have and documenting vision, goals, motivation, and decisions. It also includes result of requirements and design phase, and results of final evaluation. This report gives overall comprehension to whom want to understand our package as well as software development process well. • Executive summary: An summary that gives brief introduction to users of our framework, explaining benefits that this framework can bring to them in text analysis. • JavaDocs, tutorials, and examples: JavaDocs about APIs specifications extracted from comments in the code. Tutorials and concretes examples are also provided to ease the process learning this framework. 3
  • 5. 1.4 Stakeholders Below is the list of stakeholder and how this project will affect them: • Researchers, particularly in language technology but also in other fields, would be able to save time by focusing on their experiments instead of dealing with various input/output format which is routinely necessary in text processing. They can also easily switch between various tools available and even contribute to STAT so that others can save time by using their adaptors and algorithms. • Software engineers, who are not familiar with the machine learning can start using the package in their program with a very short learning phase. STAT can help them develop clear concepts of machine learning quickly. They can build their applications using functionality provided STAT easily and achieve high level performance. • Developers of learning package, can provide plug-ins for STAT to allow ease of integration of their package. They can also delegate some of the interoperability needs through this program (some of which may be more time consuming to be addressed within their own package). • Beginners to text processing and mining, who want fundamental and easy to learn capabilities involving discovering patterns from text. They will be benefited from this project by saving their time, facilitating their learning process, and sparking their interests to the area of language technology. 4
  • 6. Chapter 2 Survey Analysis This project was faced with many challenges from the beginning. There are many question, some of subjective nature, that really needs to be addresses by our target audience. For this reason, we designed a survey to obtain a better understanding and provide a more suitable solution to this problem. In this chapter, we explain the process of designing the survey, collecting information and some analysis of the collected data. 2.1 Designing the Survey The primary goals of doing a survey was the following: • Understanding the potential users of the package: their programming habit, problem solving strategies, experience in various area and tools, etc. • Setting priority for which criteria to focus on for our design and implementation The survey needed to be short and question to be very specific to get better responses. The maximum number of question was set at 10 questions. Several draft of the questions was reviewing within the STAT group and the software engineering class students and instructors several times until finalize. We also obtained and incorporate some advices from other departments. The final survey was designed on the SurveyMonkey.com. 2.2 Distribution The target users of STAT are two main groups with different needs: researchers and industry programmer. The survey contains questions to distinguish there two group but the final framework should address the needs from both groups. After conducting a test run with this the STAT group and the class, we sent the survey out to the Language Technology Institute student mailing list (representing researchers) and also to student in iLab (Prof, Ramayya Krishnan, Heinz School of Business) representing industry programmers. 2.3 Analysis of Results As of 2/25/09, we have received 23 responses and they are individually reviewed by STAT members and also in aggregate. Below we summarized the finding of the survey result and some charts: 5
  • 7. • While many different programming language are used (Python, R, C++) but over 90% mentioned Java as one of the languages (25% consider themselves expert). The programming hours range from 2-60. • Users don’t seem to distinguish much between industry and research applications and this is perhaps more research for the different to be transparent. • Most users are not familiar with Operation Research but everyone is somewhat familiar with Machine Learning (if not specifically text classification or data mining). • Data type expectedly were mostly textual (plain, XML, HTML, etc. as opposed to Excel, though it was mentioned) and sources were files, databases and web. • Over 50% chose: ”I write a program to preprocess data and then use an external machine learning package.” • Easy of API use, Performance and Extensibility were the top three choice in design but in addition to those in textual descriptions user pointed out mostly problems with input and output formats. Figure 2.1: Distribution of familiarity with packages 6
  • 8. Figure 2.2: Distribution of design preference 7
  • 9. Chapter 3 Analysis of Related Packages In this chapter, we analyze a few main competitors of our projects. We focus on two academic toolkits – Weka and MinorThird. We comment on their strengths and explore their limitations, and discuss why and how we can do better than these competitors. 3.1 Weka Weka is a comprehensive collection of machine learning algorithms for solving data mining problems in Java and open sourced under the GPL. 3.1.1 Strengths of Weka Weka is a very popular software for machine learning, due to the its main strengths: • Provide comprehensive machine learning algorithms. Weka supports most current machine learning approaches for classification, clustering, regression, and association rules. • Cover most aspects for performing a full data mining process. In addition to learn- ing, Weka supports common data preprocessing methods, feature selection, and visualization. • Freely available. Weka is open source released under GNU General Public License. • Cross-platform. Weka is cross-platform fully implemented in Java. Because of its supports of comprehensive machine learning algorithm, Weka is often used for analytics in many form of data, including textual data. 3.1.2 Limitations of using Weka for text analysis However, Weka is not designed specifically for textual data analysis. The most critical drawback of using Weka for processing text is that Weka does not provide “built-in” constructs for natural representation of linguistics concepts1 . Users interested in using Weka for text analysis often find themselves need to write some ad-hoc programs for text preprocessing and conversion to Weka representation. • Not good at understanding various text format. Weka is good at understanding its standard .arff format, which is however not a convenient way of representation text. Users have to worry about how can they convert textual data in various original format such as 1 Though there are classes in Weka supporting basic natural language processing, they are viewed as auxiliary utilities. They make performing basic textual data processing using Weka possible, but not conveniently and straight- forwardly 8
  • 10. raw plain text, XML, HTML, CSV, Excel, PDF, MS Word, Open Office document, etc. to be understandable by Weka. As a result, they need to spend time seeking or writing external tools to complete this task before performing their actual analysis. • Unnecessary data type conversion. Weka is superior in processing nominal (aka, categor- ical) and numerical type attributes, but not string type. In Weka, non-numerical attributes are by default imported as nominal attributes, which usually is not a desirable type for text (imagine treating different chunks of text as different values of a categorical attribute). One have to explicitly use filters to do a conversion, which could have been done automatically if it knows you are importing text. • Lack of specialized supported for linguistics preprocessing. Linguistics preprocessing is a very important aspect of textual data analysis but not a concern of Weka. Weka does not (at least, not dedicated to) take care this issue very seriously for users. Weka has a StringToWordVector class that performs all-in-one basic linguistics preprocessing, including tokenization, stemming, stopword removal, tf-idf transformation, etc. However, it is less flexible and lack of other techniques (such as part-of-speech tagging and n-gram processing) for users who want fined grain and advanced linguistics controls. • Unnatural representation of textual data learning concepts. Weka is designed for general purpose machine learning tasks so have to protect too many variations. As a results, domain concepts in Weka are abstract and high-level, package hierarchy is deep, and the number of classes explodes. For example, we have to use Instance rather than Document and Instances rather than Corpus. Concepts in Weka such as Attribute is obscure in meaning for text processing. First adding many Attribute to a cryptic FastVector which then passed to a Instances in order to construct a dataset appears very awkward to users processing text. Categorize filters first according to attribute/instance then supervised /unsupervised make non-expert users feel confusing and hard to find their right filters. Many users may feel unconformable programmatically using Weka to carry out their experiments related to text. In summary, for users who want enjoyable experience at performing text analysis, they need built-in capabilities to naturally support representing and processing text. They need specialized and convenient tools that can help them finish most common text analysis tasks straightforwardly and efficiently. This cannot be done by Weka due to its general-purpose nature, despite its com- prehensive tools. Figure 3.1 shows the domain model we extracted for Weka for basic text analysis. 9
  • 11. Partial UML Domain Model of Weka (Preliminary) evaluate 1 1 Evaluation Classifier 1 1 1 built-from classify evaluate-on 1 1 1 1 tranform-attribute contain StringToWordVector Instances Instance 1 1 attributeValues * 1 contain * Attribute NominalToString transform-type 1 1 possibleValues Note: when you see ClassA quot;containsquot; a number of ClassB, it is probably that Weka implements it as ClassA maintains a quot;FastVectorquot; whose elements are instances of ClassB. Figure 3.1: Partial domain model for Weka for basic text analysis 3.2 MinorThird Figure 3.2 shows the domain model we extracted for MinorThird for basic text analysis. 10
  • 12. Figure 3.2: Partial domain model for MinorThird for basic text analysis 11
  • 13. Chapter 4 Requirements specifications Here we first explain in detail the major features of our framework. • Simplified. APIs are clear, consistent, and straightforward. Users with reasonable Java programming knowledge can learn our package without much efforts, understand its logical flow quickly, be able to get started within a small amount of time, and finish the most common tasks with a few lines of code. Since our framework is not designed for general purposes and for including comprehensive features, there are space for us to simplify the APIs to optimize for those most typical and frequent operations. • Extensible and Reusable. Built-in modular supports are provided the core routines across various phases in text analysis, including text format transformation, linguistic processing, machine learning, and experimental evaluation. Additional functionalities can be extended on top of the core framework easily and user-defined specifications are pluggable. Existing code can be used cross environment and interoperate with external related packages, such as Weka, MinorThird, and OpenNLP. • High performance. Performance in terms of speed of algorithms we wrap and we imple- ment should acceptable for typical experiment and dataset. Specifically, there should not be significant degrade in performance in using our framework with capabilities provided by ex- ternal packages; performance of algorithm we implemented should be should not be degrade from its best complexity much because of our implementation flaws. 4.1 Functional Requirements In this section, we define most common use cases of our framework and address them in the degree of detail of casual use case. The “functional requirements” of this project are that the users can use libraries provided by our framework to complete these use cases more easily and comfortably than not use. Actors Since our framework assumes that all users of interests are programming using our APIs, there is only one role of human actor, namely the programmer. This human actor is always the primary actor. There are some possible secondary and system actors, namely the external packages our framework integrates, depending on what specific use cases the primary actor is performing. 12
  • 14. Fully-dressed Use Cases Use Case UC1: Document Classification Experiment Scope: Text analysis application using STAT framework Level: User goal Primary Actor: Researcher Stakeholder and Interests: • Researcher: Want to test and evaluate a classification algorithm (supervised, semi- supervised or unsupervised) by applying it on a (probably well-known) corpus; the task needs to be done efficiently with easy and straightforward coding Preconditions: • STAT framework is correctly installed and configured • The corpus is placed on a source readable by STAT framework Postconditions: • A model is trained and test documents in the corpus are classified. Evaluation results are displayed Main Success Scenario: 1. Researcher imports the corpus from its source into memory. Specifically, the system reads data from the source, parses the raw format, extracts information according to the schema, and constructs an in-memory object to store the corpus 2. Researcher performs preprocessing on the corpus. Specifically, for each document, the researcher tokenizes the text, removes the stopwords, performs stemming on the tokens, performs filtering, and/or other potential preprocessing on body text and meta data 3. Researcher converts the corpus into the feature vectors needed for machine learning. The feature vectors are created by analyzing the documents in the corpus, deriving or filtering features, adding or removing documents, sampling documents, handling missing entries, normalizing features, selecting features, and/or other potential processing 4. Researcher splits the processed corpus into training and testing set 5. Researcher chooses a machine learning algorithm, set its parameters, and uses it to train a model from the training set 6. Researcher classifies the documents in the test set based on the model trained 7. Researcher evaluates the classification based on classification results obtained on the test set and its true labels. Classification is evaluated mainly on classification accuracy and classification time or if it is unsupervised, on other unsupervised metrics such as Adjusted Rand Index. 8. Researcher displays the final evaluation result 13
  • 15. Use Case UC1: Document Classification Experiment (cont.) Extensions: 1a. The framework is unable to find the specified source. 1. Throw source not found exception 1b. Researcher loads a previously saved corpus in native format from a file on the disk directly to memory object, thus researcher does not handle source, format, or schema explicitly. 1a. File not found: 1.Throw file not found exception 1b. Malformed native format: 1.Throw malformed native format exception 4a. Researcher specify a parameter k larger than the number of document or smaller than 1 1. Throw invalid argument exception 1-3, 5a. Researcher saves the in-memory objects of different level of processed corpus rep- resentation to disk in native format which can be loaded back lately, after finishing each step. 1-3, 5b. Research exports the in-memory objects of different processed corpus representation to disk in external formats (e.g., weka arff, csv) which can be processed by external software. 6a. Researcher saves the in-memory model object to disk, which can be loaded back lately. 6b. Researcher loads a previously saved model in native format from a file on the disk directly to memory object. 1a. File not found: 1. Throw file not found exception 1b. Malformed native format: 1.Throw malformed native format exception 4-8b. To perform k-fold cross validation, the corpus is split to k parts in step 4, and steps 5-8 are repeated k-times by switching each split a testing split and the rest as training. Researcher combines the evaluations on different test sets obtained in the previous steps and forms a final classification evaluations 6c. Unsupported learning parameters (the learning algorithm cannot handle the combination of parameters the researcher specifies) 1. Throw unsupported learning parameters exception 6d. Unsupported learning capability (the learning algorithm cannot handle the format and data in training set, potentially caused by unsupported feature type, class type, missing values, etc). 1. Identify exception cause(s) 2. Throw corresponding exception(s) 14
  • 16. Use Case UC1: Document Classification Experiment (cont.) 8a. Incompatible between test set and classification (potentially caused by difference in schema between training set and test set) 1. Throw incompatible evaluation exception 10a. The researcher customizes the display instead of using the default display format. 1.The researcher obtains specific fields of the evaluations via interfaces provided 2.The researcher constructs a customized format using the fields he/she extracts 3.The researcher display it customized format and/or write to a destination Special Requirements: • Pluggable preprocessors in step 2-3 • Pluggable learning algorithm in step 6 • Learning algorithm should be scalable to deal with large corpus • Researcher should be able to visualize results after various steps to trace the state of different objects (e.g., preprocessed corpus, models, classifications, evaluations) • Researcher should be able to customize the visualization output Open Issues: • How to address the variations issues in reading different sources • How to (in what form) let research specify parameters for different learning algorithms • What specifically need to be able to export, persist, and visualize? • How to implement the corpus splitting in an efficient way (don’t create extra objects) • How to deal with performance issues of storing large corpora in the memory • How to deal with internal representation of the dataset in efficient data structure 4.2 Non-functional Requirements • Open source. It should be made available for public collaboration, allowing users to use, change, improve, and redistribute the software. • Portability. It should be consistently installed, configured, and run independent to different platforms, given its design and implementation on Java runtime environment. • Performance. It should not be the bottleneck in terms of machine learning analysis. That is, by wrapping other existing packages, no significant reduce in performance impact would be observed. For algorithms we implement, our coding should achieve the best performance as the best complexity as the algorithm could be. • Documentation. Its code should be readable, self-explained, and documented clearly and unambiguously for critical or tricky part. It should include an introduction guide for users 15
  • 17. to get started, and preferably, provides sample dataset, tutorial, and demos for user to run examples out of the box. 16
  • 18. 4.3 Domain model In this section, we present the domain model diagram and some explanations about it. A lot of time was spent on this domain model and it has been evolving to a relative stable one, which will guide our design in the iteration I. 4.3.1 Domain model diagram Figure 4.1 shows the domain model diagram of STAT project (for the first iteration). This domain model intends to give the top level of understanding of concepts in the project rather than a comprehensive one that includes every details. There are a number of concepts such as “Annotation”, “Label”, “ProbabilityDistribution” ”DistanceMetric”, and “Partition” (or maybe “Split”), are not shown in this diagram. Figure 4.1: Partial domain model for Weka for basic text analysis Note the diagram also lacks of some top level concepts related to unsupervised learning. These are topics in milestone requirement and design iteration II. A new domain model incorporates these concepts will be proposed at that phase. For now, we focus this domain model and give detail clarifications of what these current concepts are. 4.3.2 Domain concepts clarifications Mebrbod: Revise definitions of these objects here. • CorpusReader. CorpusReader read text from a source to a Corpus. No content transfor- mation is done and everything (label, metadata, body, etc) stay in text format. • Corpus. Corpus is a set of Documents in text format. • Annotator. Annotator transforms a Corpus to another Corpus by adding annotations. 17
  • 19. • FeatureExtractor. FeatureExtractor transforms a Corpus to a Dataset by converting Documents in text format to Instances into feature representations. • Dataset. Dataset is a set of Instances which are features representation of Documents text. • Learner. Learner learns a Model from a Dataset. • Model. Model is a collection of parameters learned from a Dataset by the Learner. • Classifier. Classifier uses the model to assign classes to Instances in a Dataset and produces a Classification with respect to the Dataset. • Classification. Classification contains the classification results and descriptive informa- tion about the classification process, e.g., which model and classifier are used to produce the classification results. • ClassificationEvaluator. Classification computes the evaluation metrics for a classifi- cation and produces a ClassificationEvaluation. • ClassificationEvaluation. ClassificationEvaluation contains the evaluation results and descriptive information about the classification and evaluation process. 18