Augustus Overview Open Source Analytics

Introduction to Augustus OVERVIEW Open Data Group September 17, 2009

Website and Community Augustus is an open source scoring engine for statistical and data mining models based on the Predictive Model Markup Language (PMML). It is written in Python and is freely available. http://augustus.googlecode.com

Getting Augustus ,[object Object],[object Object],[object Object],[object Object]

Source ,[object Object],[object Object],[object Object]

Documentation and Community ,[object Object],[object Object],[object Object],[object Object]

Using Augustus ,[object Object],[object Object],[object Object]

Development and Use Cycle ,[object Object],[object Object],[object Object],[object Object],[object Object]

Development and Use Cycle 2. Model schema 1. Data Inputs

Running Augustus 3. Obtain new model with Producer 4. Score with Consumer

Work Flows ,[object Object],[object Object]

Components ,[object Object],[object Object],[object Object],[object Object]

Producers and Consumers ,[object Object],[object Object],[object Object]

Post Processing ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Segments ,[object Object],[object Object],[object Object],[object Object]

Case Study: Auto ,[object Object],[object Object],[object Object],[object Object],[object Object]

Auto: Weighted Batch Using the Baseline for Training: $ cd WeightedBatch `-- scripts |-- consume.py |-- postprocess.py `-- produce.py http://code.google.com/p/augustus/source/browse/#svn/trunk/examples/auto/WeightedBatch

Input for the Producer The Producer takes the training data set. In the code, we have declared how we want to test the data import augustus.modellib.baseline.producer.Producer as Producer def makeConfigs(inFile, outFile, inPMML, outPMML): #open data file inf = uni.UniTable().fromfile(inFile) #start the configuration file test = ET.SubElement(root, "test") test.set("field", "Automaker") test.set("weightField", "Count") test.set("testStatistic", "dDist") test.set("testType", "threshold") test.set("threshold", "0.475")

Input for the Producer Continued # use a discrete distribution model for test baseline = ET.SubElement(test, "baseline") baseline.set("dist", "discrete") baseline.set("file", str(inFile)) baseline.set("type", "UniTable") # create the segmentation declarations for the two fields at this level ''' Taken out for the example, other Use Cases will focus on Segments segmentation = ET.SubElement(test, "segmentation") makeSegment(inf, segmentation, "Color") ''' #output the configuration file tree = ET.ElementTree(root) tree.write(outFile)

Running the Producer( Training) $ cd scripts $ python2.5 produce.py -f wtraining.nab -t20 (0.000 secs) Beginning timing (0.000 secs) Creating configuration file (0.001 secs) Creating input PMML file (0.001 secs) Starting producer (0.000 secs) Inputting configurations (0.001 secs) Inputting model (0.008 secs) Collecting stats for baseline distribution (0.011 secs) Events 20.067% processed (0.009 secs) Events 40.134% processed (0.009 secs) Events 60.201% processed (0.009 secs) Events 80.268% processed (0.009 secs) Events 100.000% processed (0.000 secs) Making test distributions from statistics (0.002 secs) Outputting PMML (0.062 secs) Lifetime of timer

Model generated by the Producer <PMML version="3.1"> <Header copyright=" " /> < DataDictionary > < DataField dataType="string" name="Automaker" optype="categorical" /> < DataField dataType="string" name="Color" optype="categorical" /> < DataField dataType="float" name="Count" optype="continuous" /> </ DataDictionary > < BaselineModel functionName="baseline"> < MiningSchema > < MiningField name="Automaker" /> < MiningField name="Color" /> < MiningField name="Count" /> </ MiningSchema > </ BaselineModel > </PMML>

Model generated by the Producer (Cont) ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Producer Output The training step used the code in producer.py to generate a model and get expected results. Training generated the following files: . |-- consumer | `-- wtraining.nab.pmml MODEL WITH EXPECTED VALUES BASED ON THE TRAINING DATA `-- producer |-- wtraining.nab.pmml BASELINE DATA, DATA DICTIONARY, MINING SCHEMA `-- wtraining.nab.xml MODEL FILE USED FOR TRAINING

Training XML ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Unitable ,[object Object],[object Object],[object Object],[object Object]

Running the Consumer cd script $ python2.5 consume.py -b wtraining.nab -f wscoring.nab Ready to score . |-- consumer | |-- wscoring.nab.wtraining.nab.xml | `-- wtraining.nab.pmml |-- postprocess | `-- wscoring.nab.wtraining.nab.xml `-- producer |-- wtraining.nab.pmml `-- wtraining.nab.xml This examples generates a report in the post process directory.

Consumer (Scoring) output $ cat consumer/wscoring.nab.wtraining.nab.xml <pmmlDeployment> <inputData> <readOnce /> <batchScoring /> <fromFile name="../data/wscoring.nab" type="UniTable" /> </inputData> <inputModel> <fromFile name="../consumer/wtraining.nab.pmml" /> </inputModel> <output> <report name="report"> <toFile name="../postprocess/wscoring.nab.wtraining.nab.xml" /> <outputRow name="event"> <score name="score" /> <alert name="alert" /> <segments name="segments" /> </outputRow> </report> </output> </pmmlDeployment>

Scoring Report $ cat postprocess/ wscoring.nab.wtraining.nab.xml <report> < event > < score >0.471458430077</ score > < alert >True</ alert > < Segments ></ Segments > </ event > </report>

Unitable ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Key Features of Unitable ,[object Object],[object Object],[object Object],[object Object],[object Object]

Key Features of Unitable (cont) ,[object Object],[object Object],[object Object]

For more information ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Augustus Overview Open Source Analytics

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie Augustus Overview Open Source Analytics

Ähnlich wie Augustus Overview Open Source Analytics (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Augustus Overview Open Source Analytics