SlideShare ist ein Scribd-Unternehmen logo
1 von 9
Downloaden Sie, um offline zu lesen
MUD 2010
   Workshop on Mining Unstructured Data




                          Nicolas Bettenburg
SOFTWARE ANALYSIS            Bram Adams
 & INTELLIGENCE LAB   http://sailhome.cs.queensu.ca/mud/
                                                           1
Unstructured
   Data?




               2
EXAMPLE OF STRUCTURED DATA
<bug>
  <bug_id>45411</bug_id>
  <creation_ts>2000-07-13 13:46:00 -0700</creation_ts>
  <short_desc>Drag, hover over tab should open tab</short_desc>
  <delta_ts>2009-12-04 13:03:48 -0800</delta_ts>
  <reporter_accessible>1</reporter_accessible>
  <cclist_accessible>1</cclist_accessible>
  <classification_id>2</classification_id>
  <classification>Client Software</classification>
  <product>SeaMonkey</product>
  <component>Tabbed Browser</component>
  <version>Trunk</version>
  <rep_platform>All</rep_platform>
  <op_sys>All</op_sys>
  <bug_status>RESOLVED</bug_status>
  <resolution>WONTFIX</resolution>
  <priority>--</priority>
  <bug_severity>enhancement</bug_severity>
  <target_milestone>---</target_milestone>
  <blocked>121292</blocked>
  ...
</bug>
                                                                  3
So What?
EXAMPLES OF UNSTRUCTURED DATA


   web-sites      diagrams        requirements
                                   documents

social media   documentation                 help
                                IRC chat     files
       code
so urce nts              orts
     mme        bu g rep              captchas
  co

                  commit logs
       email                          system logs
                                                    4
SE data without explicit format




COMPLEXITY   DIVERSITY   IMPERFECTION


                                        5
Unstructured Data is
        COMPLEX ...
                                    all
                  QLite library sh                 Bonjour,
       0: The S                      ents
S1  000              l SQ L statem
           high-leve s to persistent
translate             all
          level I/O c                               ces deux pro
                                                                   blèmes sont
into low-                                           En effet, les                  reliés.
                                                                   paquets Ubu
 storage.                                          comportent                     ntu ne
                                 SQL
                    k  of every           an-
                                                                  pas les dépe
                                                                                 ndances (e.
  The ess ential tas to translate hum              libpng, libjp
                                                                 eg, libglew,                 g.
                  ne is                                                        ...).
  datab ase engi             ts into
             SQL s tatemen        s.              Si Tulip ne p
  readable             operation                                  eut afficher
                                                                                les fichiers
               of I/O                            PNG, c'est s
   sequences                                                     ans doute ca
                                                                                r le paquet
                                                 libpng est m
                                                                 anquant sur
                                                Nous travail                    le système.
                                                                lons à ajout
                                                dépendance                   er les
                                                                s sur les paq
  natural language                              n'arrivera pr
                                                                obablement
                                                                               uets, mais c
                                                                              pas avant T
                                                                                             eci
                                                3.5.                                         ulip
  rich semantics
                                                Cordialemen
                                                           t,
  no authoritative formats                      Charles.

                                                                                                    6
... AND DIVERSE
In this report, you have defined a parameter named blocksize,
which is given a value of "7|D|1|D". In open script of data set,
there are below lines code:

<script begin>
token=Packages.java.util.StringTokenizer(params["blocksize"],"|");
vec=new Packages.java.util.Vector();
while(token.hasMoreTokens()){
   vec.addElement(token.nextToken());   Eclipse #150222
}
params["DateRange"]=java.lang.Integer.parseInt(vec.elementAt(0));
</script end>

Since the value of params["blocksize"] is "7|D|1|D", vec.elementAt(0)
is "7", and then it can not be parsed to int value. In 1.0.1,
the value of params["blocksize"] might be 7|D|1|D, so it can be
parsed to int value of 7.

                                                                     7
... AND IMPERFECT
              o e@gmail.com
From: john.d      c eforge.net
To: d evlist@sour        !!
Subject: BS  OD WTF!!??

Hi devs,
                         C       inconsistency
               in JDBC-RP ’t
 f ound a bug ol. OMG can        ambiguity
 ver y badass l sed that. I
        ve you mis incorrect     informal language
 belie           er
 get  a bsod aft
                  (
  pw,  pls fix :'

  JD $$$
                                                 8
So What?
EXAMPLES OF UNSTRUCTURED DATA


   web-sites      diagrams        requirements
                                   documents

social media   documentation                 help
                                IRC chat     files
       code
so urce nts              orts
     mme        bu g rep              captchas
  co

                  commit logs
       email                          system logs
                                                    9

Weitere ähnliche Inhalte

Andere mochten auch

Ph.D. Dissertation - Studying the Impact of Developer Communication on the Qu...
Ph.D. Dissertation - Studying the Impact of Developer Communication on the Qu...Ph.D. Dissertation - Studying the Impact of Developer Communication on the Qu...
Ph.D. Dissertation - Studying the Impact of Developer Communication on the Qu...
Nicolas Bettenburg
 
Computing Accuracy Precision And Recall
Computing Accuracy Precision And RecallComputing Accuracy Precision And Recall
Computing Accuracy Precision And Recall
Nicolas Bettenburg
 

Andere mochten auch (10)

A Lightweight Approach to Uncover Technical Information in Unstructured Data
A Lightweight Approach to Uncover Technical Information in Unstructured DataA Lightweight Approach to Uncover Technical Information in Unstructured Data
A Lightweight Approach to Uncover Technical Information in Unstructured Data
 
Studying the impact of Social Structures on Software Quality
Studying the impact of Social Structures on Software QualityStudying the impact of Social Structures on Software Quality
Studying the impact of Social Structures on Software Quality
 
An Empirical Study on Inconsistent Changes to Code Clones at Release Level
An Empirical Study on Inconsistent Changes to Code Clones at Release LevelAn Empirical Study on Inconsistent Changes to Code Clones at Release Level
An Empirical Study on Inconsistent Changes to Code Clones at Release Level
 
Finding Paths in Large Spaces - A* and Hierarchical A*
Finding Paths in Large Spaces - A* and Hierarchical A*Finding Paths in Large Spaces - A* and Hierarchical A*
Finding Paths in Large Spaces - A* and Hierarchical A*
 
Think Locally, Act Gobally - Improving Defect and Effort Prediction Models
Think Locally, Act Gobally - Improving Defect and Effort Prediction ModelsThink Locally, Act Gobally - Improving Defect and Effort Prediction Models
Think Locally, Act Gobally - Improving Defect and Effort Prediction Models
 
Ph.D. Dissertation - Studying the Impact of Developer Communication on the Qu...
Ph.D. Dissertation - Studying the Impact of Developer Communication on the Qu...Ph.D. Dissertation - Studying the Impact of Developer Communication on the Qu...
Ph.D. Dissertation - Studying the Impact of Developer Communication on the Qu...
 
The Quality of Bug Reports in Eclipse ETX'07
The Quality of Bug Reports in Eclipse ETX'07The Quality of Bug Reports in Eclipse ETX'07
The Quality of Bug Reports in Eclipse ETX'07
 
Duplicate Bug Reports Considered Harmful ... Really?
Duplicate Bug Reports Considered Harmful ... Really?Duplicate Bug Reports Considered Harmful ... Really?
Duplicate Bug Reports Considered Harmful ... Really?
 
Computing Accuracy Precision And Recall
Computing Accuracy Precision And RecallComputing Accuracy Precision And Recall
Computing Accuracy Precision And Recall
 
Fuzzy Logic in Smart Homes
Fuzzy Logic in Smart HomesFuzzy Logic in Smart Homes
Fuzzy Logic in Smart Homes
 

Ähnlich wie Mud flash

Dimitry Solovyov - The imminent threat of functional programming
Dimitry Solovyov - The imminent threat of functional programmingDimitry Solovyov - The imminent threat of functional programming
Dimitry Solovyov - The imminent threat of functional programming
Dmitry Buzdin
 
Os Worthington
Os WorthingtonOs Worthington
Os Worthington
oscon2007
 
Database & Technology 1 _ Tom Kyte _ Efficient PL SQL - Why and How to Use.pdf
Database & Technology 1 _ Tom Kyte _ Efficient PL SQL - Why and How to Use.pdfDatabase & Technology 1 _ Tom Kyte _ Efficient PL SQL - Why and How to Use.pdf
Database & Technology 1 _ Tom Kyte _ Efficient PL SQL - Why and How to Use.pdf
InSync2011
 
Kuldeep presentation ppt
Kuldeep presentation pptKuldeep presentation ppt
Kuldeep presentation ppt
kuldeep khichar
 
Infrastrucutre as sdlc
Infrastrucutre as sdlcInfrastrucutre as sdlc
Infrastrucutre as sdlc
John Willis
 
Task Parallel Library Data Flows
Task Parallel Library Data FlowsTask Parallel Library Data Flows
Task Parallel Library Data Flows
SANKARSAN BOSE
 

Ähnlich wie Mud flash (20)

Dimitry Solovyov - The imminent threat of functional programming
Dimitry Solovyov - The imminent threat of functional programmingDimitry Solovyov - The imminent threat of functional programming
Dimitry Solovyov - The imminent threat of functional programming
 
Os Worthington
Os WorthingtonOs Worthington
Os Worthington
 
All you didn't know about the CAP theorem
All you didn't know about the CAP theoremAll you didn't know about the CAP theorem
All you didn't know about the CAP theorem
 
Peyton jones-2011-parallel haskell-the_future
Peyton jones-2011-parallel haskell-the_futurePeyton jones-2011-parallel haskell-the_future
Peyton jones-2011-parallel haskell-the_future
 
Simon Peyton Jones: Managing parallelism
Simon Peyton Jones: Managing parallelismSimon Peyton Jones: Managing parallelism
Simon Peyton Jones: Managing parallelism
 
Recurrent Neural Networks for Text Analysis
Recurrent Neural Networks for Text AnalysisRecurrent Neural Networks for Text Analysis
Recurrent Neural Networks for Text Analysis
 
Database & Technology 1 _ Tom Kyte _ Efficient PL SQL - Why and How to Use.pdf
Database & Technology 1 _ Tom Kyte _ Efficient PL SQL - Why and How to Use.pdfDatabase & Technology 1 _ Tom Kyte _ Efficient PL SQL - Why and How to Use.pdf
Database & Technology 1 _ Tom Kyte _ Efficient PL SQL - Why and How to Use.pdf
 
Databases for Storage Engineers
Databases for Storage EngineersDatabases for Storage Engineers
Databases for Storage Engineers
 
The Ruby Plumber's Guide to *nix
The Ruby Plumber's Guide to *nixThe Ruby Plumber's Guide to *nix
The Ruby Plumber's Guide to *nix
 
SQL Azure in deep
SQL Azure in deepSQL Azure in deep
SQL Azure in deep
 
Closing the DevOps gaps
Closing the DevOps gapsClosing the DevOps gaps
Closing the DevOps gaps
 
Infrastrucutre as sdlc
Infrastrucutre as sdlcInfrastrucutre as sdlc
Infrastrucutre as sdlc
 
Kuldeep presentation ppt
Kuldeep presentation pptKuldeep presentation ppt
Kuldeep presentation ppt
 
No Sql
No SqlNo Sql
No Sql
 
Infrastrucutre as sdlc
Infrastrucutre as sdlcInfrastrucutre as sdlc
Infrastrucutre as sdlc
 
OpenStack and OpenFlow Demos
OpenStack and OpenFlow DemosOpenStack and OpenFlow Demos
OpenStack and OpenFlow Demos
 
We're going on a bug hunt! Experts Talk Manchester 2018
We're going on a bug hunt! Experts Talk Manchester 2018We're going on a bug hunt! Experts Talk Manchester 2018
We're going on a bug hunt! Experts Talk Manchester 2018
 
ExpertTalks Manchester September 2018
ExpertTalks Manchester September 2018ExpertTalks Manchester September 2018
ExpertTalks Manchester September 2018
 
Task Parallel Library Data Flows
Task Parallel Library Data FlowsTask Parallel Library Data Flows
Task Parallel Library Data Flows
 
Dev con kolkata 2011 tpl dataflows
Dev con kolkata 2011   tpl dataflowsDev con kolkata 2011   tpl dataflows
Dev con kolkata 2011 tpl dataflows
 

Mehr von Nicolas Bettenburg

Mehr von Nicolas Bettenburg (7)

10 Year Impact Award Presentation - Duplicate Bug Reports Considered Harmful ...
10 Year Impact Award Presentation - Duplicate Bug Reports Considered Harmful ...10 Year Impact Award Presentation - Duplicate Bug Reports Considered Harmful ...
10 Year Impact Award Presentation - Duplicate Bug Reports Considered Harmful ...
 
Using Fuzzy Code Search to Link Code Fragments in Discussions to Source Code
Using Fuzzy Code Search to Link Code Fragments in Discussions to Source CodeUsing Fuzzy Code Search to Link Code Fragments in Discussions to Source Code
Using Fuzzy Code Search to Link Code Fragments in Discussions to Source Code
 
Managing Community Contributions: Lessons Learned from a Case Study on Andro...
Managing Community Contributions:  Lessons Learned from a Case Study on Andro...Managing Community Contributions:  Lessons Learned from a Case Study on Andro...
Managing Community Contributions: Lessons Learned from a Case Study on Andro...
 
Approximation Algorithms
Approximation AlgorithmsApproximation Algorithms
Approximation Algorithms
 
Predictors of Customer Perceived Quality
Predictors of Customer Perceived QualityPredictors of Customer Perceived Quality
Predictors of Customer Perceived Quality
 
Extracting Structural Information from Bug Reports.
Extracting Structural Information from Bug Reports.Extracting Structural Information from Bug Reports.
Extracting Structural Information from Bug Reports.
 
Metropolis Instant Radiosity
Metropolis Instant RadiosityMetropolis Instant Radiosity
Metropolis Instant Radiosity
 

Mud flash

  • 1. MUD 2010 Workshop on Mining Unstructured Data Nicolas Bettenburg SOFTWARE ANALYSIS Bram Adams & INTELLIGENCE LAB http://sailhome.cs.queensu.ca/mud/ 1
  • 2. Unstructured Data? 2
  • 3. EXAMPLE OF STRUCTURED DATA <bug> <bug_id>45411</bug_id> <creation_ts>2000-07-13 13:46:00 -0700</creation_ts> <short_desc>Drag, hover over tab should open tab</short_desc> <delta_ts>2009-12-04 13:03:48 -0800</delta_ts> <reporter_accessible>1</reporter_accessible> <cclist_accessible>1</cclist_accessible> <classification_id>2</classification_id> <classification>Client Software</classification> <product>SeaMonkey</product> <component>Tabbed Browser</component> <version>Trunk</version> <rep_platform>All</rep_platform> <op_sys>All</op_sys> <bug_status>RESOLVED</bug_status> <resolution>WONTFIX</resolution> <priority>--</priority> <bug_severity>enhancement</bug_severity> <target_milestone>---</target_milestone> <blocked>121292</blocked> ... </bug> 3
  • 4. So What? EXAMPLES OF UNSTRUCTURED DATA web-sites diagrams requirements documents social media documentation help IRC chat files code so urce nts orts mme bu g rep captchas co commit logs email system logs 4
  • 5. SE data without explicit format COMPLEXITY DIVERSITY IMPERFECTION 5
  • 6. Unstructured Data is COMPLEX ... all QLite library sh Bonjour, 0: The S ents S1 000 l SQ L statem high-leve s to persistent translate all level I/O c ces deux pro blèmes sont into low- En effet, les reliés. paquets Ubu storage. comportent ntu ne SQL k of every an- pas les dépe ndances (e. The ess ential tas to translate hum libpng, libjp eg, libglew, g. ne is ...). datab ase engi ts into SQL s tatemen s. Si Tulip ne p readable operation eut afficher les fichiers of I/O PNG, c'est s sequences ans doute ca r le paquet libpng est m anquant sur Nous travail le système. lons à ajout dépendance er les s sur les paq natural language n'arrivera pr obablement uets, mais c pas avant T eci 3.5. ulip rich semantics Cordialemen t, no authoritative formats Charles. 6
  • 7. ... AND DIVERSE In this report, you have defined a parameter named blocksize, which is given a value of "7|D|1|D". In open script of data set, there are below lines code: <script begin> token=Packages.java.util.StringTokenizer(params["blocksize"],"|"); vec=new Packages.java.util.Vector(); while(token.hasMoreTokens()){ vec.addElement(token.nextToken()); Eclipse #150222 } params["DateRange"]=java.lang.Integer.parseInt(vec.elementAt(0)); </script end> Since the value of params["blocksize"] is "7|D|1|D", vec.elementAt(0) is "7", and then it can not be parsed to int value. In 1.0.1, the value of params["blocksize"] might be 7|D|1|D, so it can be parsed to int value of 7. 7
  • 8. ... AND IMPERFECT o e@gmail.com From: john.d c eforge.net To: d evlist@sour !! Subject: BS OD WTF!!?? Hi devs, C inconsistency in JDBC-RP ’t f ound a bug ol. OMG can ambiguity ver y badass l sed that. I ve you mis incorrect informal language belie er get a bsod aft ( pw, pls fix :' JD $$$ 8
  • 9. So What? EXAMPLES OF UNSTRUCTURED DATA web-sites diagrams requirements documents social media documentation help IRC chat files code so urce nts orts mme bu g rep captchas co commit logs email system logs 9