Data Mining Open Ap Is

Data Mining and Open APIs

Toby Segaran

About Me
Software Developer at Genstruct
Work directly with scientists
Design algorithms to aid in drug testing
“Programming Collective Intelligence”
Published by O’Reilly
Due out in August
Consult with open-source projects and other
companies
http://kiwitobes.com

Presentation Goals

Look at some Open APIs
Get some data
Visualize algorithms for data-mining
Work through some Python code
Variety of techniques and sources

Advocacy (why you should care)

Open data APIs

Zillow Yahoo Answers
eBay Amazon
Facebook Technorati
del.icio.us Twitter
HotOrNot Google News
Upcoming

programmableweb.com/apis for more…

Open API uses

Mashups
Integration
Automation
Command-line tools
Most importantly, creating datasets!

What is data mining?

From a large dataset find the:
Implicit
Unknown
Useful
Data could be:
Tabular, e.g. Price lists
Free text
Pictures

Why it’s important now

More devices produce more data
People share more data
The internet is vast
Products are more customized
Advertising is targeted
Human cognition is limited

Traditional Applications

Computational Biology
Financial Markets
Retail Markets
Fraud Detection
Surveillance
Supply Chain Optimization
National Security

Traditional = Inaccessible

Real applications are esoteric
Tutorial examples are trivial
Generally lacking in “interest value”

Fun, Accessible Applications

Home price modeling
Where are the hottest people?
Which bloggers are similar?
Important attributes on eBay
Predicting fashion trends
Movie popularity

The Zillow API

Allows querying by address
Returns information about the property
Bedrooms
Bathrooms
Zip Code
Price Estimate
Last Sale Price
Requires registration key
http://www.zillow.com/howto/api/PropertyDetailsAPIOverview.htm

The Zillow API

REST Request

http://www.zillow.com/webservice/GetDeepSearchResults.htm?
zws-id=key&address=address&citystatezip=citystateszip

The Zillow API
<SearchResults:searchresults xmlns:SearchResults=quot;http://www.
zillow.com/vstatic/3/static/xsd/SearchResults.xsdquot;>
…
<response>
<results>
<result>
<zpid>48749425</zpid>
<links>
…
</links>
<address>
<street>2114 Bigelow Ave N</street>
<zipcode>98109</zipcode>
<city>SEATTLE</city>
<state>WA</state>
<latitude>47.637934</latitude> <longitude>-122.347936</longitude>
</address>
<yearBuilt>1924</yearBuilt>
<lotSizeSqFt>4680</lotSizeSqFt>
<finishedSqFt>3290</finishedSqFt>
<bathrooms>2.75</bathrooms>
<bedrooms>4</bedrooms>
<lastSoldDate>06/18/2002</lastSoldDate>
<lastSoldPrice currency=quot;USDquot;>770000</lastSoldPrice>
<valuation>
<amount currency=quot;USDquot;>1091061</amount>
</result>
</results>
</response>

The Zillow API
<SearchResults:searchresults xmlns:SearchResults=quot;http://www.
zillow.com/vstatic/3/static/xsd/SearchResults.xsdquot;>
…
<response>
<results>
<result>
<state>WA</state>
<zpid>48749425</zpid>
<links>
<latitude>47.637934</latitude>
…
<longitude>-122.347936</longitude>
</links>
<address>
</address>Bigelow Ave N</street>
<street>2114
<state>WA</state>
<latitude>47.637934</latitude> <longitude>-122.347936</longitude>
</address>
<lastSoldPrice currency=quot;USDquot;>770000</lastSoldPrice>
<valuation> currency=quot;USDquot;>770000</lastSoldPrice>
<lastSoldPrice
<valuation>
<amountcurrency=quot;USDquot;>1091061</amount>
currency=quot;USDquot;>1091061</amount>
<amount
</result>
</results>
</response>

Zillow from Python
def getaddressdata(address,city):
escad=address.replace(' ','+')

# Construct the URL
url='http://www.zillow.com/webservice/GetDeepSearchResults.htm?'
url+='zws-id=%s&address=%s&citystatezip=%s' % (zwskey,escad,city)

# Parse resulting XML
doc=xml.dom.minidom.parseString(urllib2.urlopen(url).read())
code=doc.getElementsByTagName('code')[0].firstChild.data

# Code 0 means success, otherwise there was an error
if code!='0': return None

# Extract the info about this property
try:
zipcode=doc.getElementsByTagName('zipcode')[0].firstChild.data
use=doc.getElementsByTagName('useCode')[0].firstChild.data
year=doc.getElementsByTagName('yearBuilt')[0].firstChild.data
bath=doc.getElementsByTagName('bathrooms')[0].firstChild.data
bed=doc.getElementsByTagName('bedrooms')[0].firstChild.data
rooms=doc.getElementsByTagName('totalRooms')[0].firstChild.data
price=doc.getElementsByTagName('amount')[0].firstChild.data
except:
return None

return (zipcode,use,int(year),float(bath),int(bed),int(rooms),price)

Zillow from Python

# Construct the URL
# Construct the URL


try:
except:
return None


Zillow from Python

# Construct the URL

# Parse resulting
XML

try:
except:
return None


Zillow from Python

# Construct the URL



try:
except:
return None


A home price dataset

House Zip Bathrooms Bedrooms Built Type Price

A 02138 1.5 2 1847 Single 505296

B 02139 3.5 9 1916 Triplex 776378

C 02140 3.5 4 1894 Duplex 595027

D 02139 2.5 4 1854 Duplex 552213

E 02138 3.5 5 1909 Duplex 947528

F 02138 3.5 4 1930 Single 2107871

etc..

What can we learn?

A made-up houses price
How important is Zip Code?
What are the important attributes?

Can we do better than averages?

Introducing Regression Trees
A B Value
10 Circle 20
11 Square 22
22 Square 8
18 Circle 6

Minimizing deviation
Standard deviation is the “spread” of results
Try all possible divisions
Choose the division that decreases deviation the most

Initially
A B Value
Average = 14
10 Circle 20
Standard Deviation = 8.2
11 Square 22

22 Square 8

18 Circle 6


B = Circle
A B Value
Average = 13
10 Circle 20
11 Square 22

22 Square 8
B = Square
18 Circle 6
Average = 15


A > 18
A B Value
Average = 8
10 Circle 20
Standard Deviation = 0
11 Square 22

22 Square 8
A <= 20
18 Circle 6
Average = 16


A > 11
A B Value
Average = 7
10 Circle 20
11 Square 22

22 Square 8
A <= 11
18 Circle 6
Average = 21

Python Code
def variance(rows):
if len(rows)==0: return 0
data=[float(row[len(row)-1]) for row in rows]
mean=sum(data)/len(data)
variance=sum([(d-mean)**2 for d in data])/len(data)
return variance

def divideset(rows,column,value):
# Make a function that tells us if a row is in
# the first group (true) or the second group (false)
split_function=None
if isinstance(value,int) or isinstance(value,float):
split_function=lambda row:row[column]>=value
else:
split_function=lambda row:row[column]==value

# Divide the rows into two sets and return them
set1=[row for row in rows if split_function(row)]
set2=[row for row in rows if not split_function(row)]
return (set1,set2)

Python Code
def variance(rows):
def variance(rows):
if len(rows)==0: return for row in rows]
data=[float(row[len(row)-1]) 0
mean=sum(data)/len(data)d in data])/len(data)
variance=sum([(d-mean)**2 for
return variance
return variance
split_function=None
else:

return (set1,set2)

Python Code
def variance(rows):
return variance

# def divideset(rows,column,value): us if a row is in
Make a function that tells
# the Make a function (true) or the asecond in
# first group that tells us if row is group (false)
split_function=None
split_function=None
else:
else:
return (set1,set2)

Python Code
def variance(rows):
return variance

split_function=None
else:

# Divide the rows into two sets and returnreturn them
# Divide the rows into two sets and them
in rows if split_function(row)]
set2=[row for row
set2=[row(set1,set2) in rows if not split_function(row)]
for row
return
return (set1,set2)

CART Algoritm
A B Value
10 Circle 20
11 Square 22
22 Square 8
18 Circle 6

CART Algoritm

22 Square 8
10 Circle 20
18 Circle 6
11 Square 22

Python Code
def buildtree(rows,scoref=variance):
if len(rows)==0: return decisionnode()
current_score=scoref(rows)
# Set up some variables to track the best criteria
best_gain=0.0
best_criteria=None
best_sets=None
column_count=len(rows[0])-1
for col in range(0,column_count):
# Generate the list of different values in
# this column
column_values={}
for row in rows:
column_values[row[col]]=1
# Now try dividing the rows up for each value
# in this column
for value in column_values.keys():
(set1,set2)=divideset(rows,col,value)
# Information gain
p=float(len(set1))/len(rows)
gain=current_score-p*scoref(set1)-(1-p)*scoref(set2)
if gain>best_gain and len(set1)>0 and len(set2)>0:
best_gain=gain
best_criteria=(col,value)
best_sets=(set1,set2)
# Create the sub branches
if best_gain>0:
trueBranch=buildtree(best_sets[0])
falseBranch=buildtree(best_sets[1])
return decisionnode(col=best_criteria[0],value=best_criteria[1],tb=trueBranch,fb=falseBranch)
else:
return decisionnode(results=uniquecounts(rows))

Python Code
current_score=scoref(rows) criteria
# Set up some variables to track the best
#best_gain=0.0some variables to track the best criteria
Set up
best_criteria=None
best_gain=0.0
best_sets=None
best_criteria=None
best_sets=None of different values in
# Generate the list
# this column
column_values={}
for row in rows:
# in this column
# Information gain
best_gain=gain
if best_gain>0:
else:

Python Code
best_gain=0.0
best_criteria=None
best_sets=None
# this column
column_values={}
for row in rows:
for try dividing the rows up for each value
# Now value in column_values.keys():
# in this column
# Information gain
# Information gain
best_gain=gain
best_gain=gain
if best_gain>0:
else:

Python Code
best_gain=0.0
best_criteria=None
best_sets=None
# this column
column_values={}
for row in rows:
# in this column
# Information gain
if best_gain>0: and len(set1)>0 and len(set2)>0:
if gain>best_gain
best_gain=gain
if best_gain>0:
return decisionnode(col=best_criteria[0],value=best_criteria[1],
tb=trueBranch,fb=falseBranch)
else:
else:

Zillow Results

Bathrooms > 3

Zip: 02139? After 1903?

Zip: 02140? Bedrooms > 4? Duplex? Triplex?

Supervised and Unsupervised

Regression trees are supervised
“answers” are in the dataset
Tree models predict answers
Some methods are unsupervised
There are no answers
Methods just characterize the data
Show interesting patterns

Next challenge - Bloggers

Millions of blogs online
Usually focus on a subject area
Can they be characterized
automatically?
… using only the words in the posts?

Getting the content

Use Mark Pilgrim’s Universal Feed
Reader
Retrieve the post titles and text
Split up the words
Count occurrence of each word

Python Code
import feedparser
import re
# Returns title and dictionary of word counts for an RSS feed
def getwordcounts(url):
# Parse the feed
d=feedparser.parse(url)
wc={}
# Loop over all the entries
for e in d.entries:
if 'summary' in e: summary=e.summary
else: summary=e.description
# Extract a list of words
words=getwords(e.title+' '+summary)
for word in words:
wc.setdefault(word,0)
wc[word]+=1
return d.feed.title,wc

def getwords(html):
# Remove all the HTML tags
txt=re.compile(r'<[^>]+>').sub('',html)
# Split words by all non-alpha characters
words=re.compile(r'[^A-Z^a-z]+').split(txt)
# Convert to lowercase
return [word.lower() for word in words if word!='']

Python Code
import feedparser
import re
# Parse the feed
wc={}
for e in d.entries:
for e in d.entries:
else: summary=e.description words
# Extract a list of
for word in words:
for word in words:
wc[word]+=1
wc[word]+=1

def getwords(html):
# Split words by all non-alpha characters
# Convert to lowercase

Python Code
import feedparser
import re
# Parse the feed
wc={}
for e in d.entries:
for word in words:
wc[word]+=1
def getwords(html):
# Remove def getwords(html):
all the HTML tags
# Split words bywords by all non-alpha characters
# Split all non-alpha characters
# Convert # Convert to lowercase
to lowercase

Building a Word Matrix

Build a matrix of word counts
Blogs are rows, words are columns
Eliminate words that are:
Too common
Too rare

Python Code
apcount={}
wordcounts={}
for feedurl in file('feedlist.txt'):
title,wc=getwordcounts(feedurl)
wordcounts[title]=wc
for word,count in wc.items():
apcount.setdefault(word,0)
if count>1:
apcount[word]+=1

wordlist=[]
for w,bc in apcount.items():
frac=float(bc)/len(feedlist)
if frac>0.1 and frac<0.5: wordlist.append(w)

out=file('blogdata.txt','w')
out.write('Blog')
for word in wordlist: out.write('t%s' % word)
out.write('n')
for blog,wc in wordcounts.items():
out.write(blog)
for word in wordlist:
if word in wc: out.write('t%d' % wc[word])
else: out.write('t0')
out.write('n')

Python Code
apcount={}
wordcounts={}
for feedurlinin file('feedlist.txt'):
for feedurl file('feedlist.txt'):
forapcount.setdefault(word,0)
word,count in wc.items():
if count>1:
if apcount[word]+=1
count>1:
apcount[word]+=1
wordlist=[]

out.write('Blog')
out.write('n')
out.write(blog)
out.write('n')

Python Code
apcount={}
wordcounts={}
if count>1:
apcount[word]+=1
wordlist=[]
wordlist=[]
out.write('Blog')
out.write('n')
out.write(blog)
out.write('n')

Python Code
apcount={}
wordcounts={}
if count>1:
apcount[word]+=1

wordlist=[]
out.write('Blog')
out.write('Blog')
out.write('n')
out.write('n')
for blog,wcinin wordcounts.items():
for blog,wc wordcounts.items():
out.write(blog)
out.write(blog)
for wordin wordlist:
out.write('n')
out.write('n')

The Word Matrix
“china” “kids” “music” “yahoo”

Gothamist 0 3 3 0

GigaOM 6 0 1 2

Quick Online Tips 0 2 2 12

Determining distance
“china” “kids” “music” “yahoo”

Gothamist 0 3 3 0

GigaOM 6 0 1 2

Quick Online Tips 0 2 2 12

Euclidean “as the crow flies”

(6 − 0) 2 + (0 − 2) 2 + (1 − 2) 2 + (2 − 12) 2
= 12 (approx)

Other Distance Metrics

Manhattan
Tanamoto
Pearson Correlation
Chebychev
Spearman

Hierarchical Clustering

Find the two closest item
Combine them into a single item
Repeat…

Python Code

class bicluster:
def
__init__(self,vec,left=None,right=None,distance=0.0,id=None):
self.left=left
self.right=right
self.vec=vec
self.id=id
self.distance=distance

Python Code
def hcluster(rows,distance=pearson):
distances={}
currentclustid=-1
# Clusters are initially just the rows
clust=[bicluster(rows[i],id=i) for i in range(len(rows))]
while len(clust)>1:
lowestpair=(0,1)
closest=distance(clust[0].vec,clust[1].vec)
# loop through every pair looking for the smallest distance
for i in range(len(clust)):
for j in range(i+1,len(clust)):
# distances is the cache of distance calculations
if (clust[i].id,clust[j].id) not in distances:
distances[(clust[i].id,clust[j].id)]=distance(clust[i].vec,clust[j].vec)
d=distances[(clust[i].id,clust[j].id)]
if d<closest:
closest=d
lowestpair=(i,j)
# calculate the average of the two clusters
mergevec=[
(clust[lowestpair[0]].vec[i]+clust[lowestpair[1]].vec[i])/2.0
for i in range(len(clust[0].vec))]
# create the new cluster
newcluster=bicluster(mergevec,left=clust[lowestpair[0]],
right=clust[lowestpair[1]],
distance=closest,id=currentclustid)
# cluster ids that weren’t in the original set are negative
currentclustid-=1
del clust[lowestpair[1]]
clust.append(newcluster)
return clust[0]

Python Code
distances={}
distances={}
currentclustid=-1
currentclustid=-1
while len(clust)>1:
lowestpair=(0,1)
if d<closest:
closest=d
lowestpair=(i,j)
mergevec=[
currentclustid-=1
return clust[0]

Python Code
distances={}
while len(clust)>1:
currentclustid=-1
lowestpair=(0,1)
while len(clust)>1:
lowestpair=(0,1)
# loop closest=distance(clust[0].vec,clust[1].vec) for the smallest distance
through every pair looking
for i inloopin range(len(clust)):
#
range(len(clust)): for the smallest distance
through every pair looking
for i
for j for j range(i+1,len(clust)):
in in range(i+1,len(clust)):
# distances is the cache of distances:
if (clust[i].id,clust[j].id) not in distance calculations
distances[(clust[i].id,clust[j].id)]=
if d<closest:
closest=d
distance(clust[i].vec,clust[j].vec)
lowestpair=(i,j)
mergevec=[
if (clust[lowestpair[0]].vec[i]+clust[lowestpair[1]].vec[i])/2.0
d<closest:
closest=d
lowestpair=(i,j)
currentclustid-=1
return clust[0]

Python Code
distances={}
currentclustid=-1
while len(clust)>1:
lowestpair=(0,1)
# calculate distances[(clust[i].id,clust[j].id)]=distance(clust[i].vec,clust[j].vec)
the average of the two clusters
mergevec=[ if d<closest:
closest=d
lowestpair=(i,j)
#in range(len(clust[0].vec))
calculate the average of the two clusters
for i mergevec=[
] (clust[lowestpair[0]].vec[i]+clust[lowestpair[1]].vec[i])/2.0
# create for i in range(len(clust[0].vec))]
#the new new cluster
create the cluster
currentclustid-=1
return clust[0]

Rotating the Matrix

Words in a blog -> blogs containing each word

Gothamist GigaOM Quick Onl
china 0 6 0
kids 3 0 2
music 3 1 2
Yahoo 0 2 12

K-Means Clustering

Divides data into distinct clusters
User determines how many
Algorithm
Start with arbitrary centroids
Assign points to centroids
Move the centroids
Repeat

Python Code
import random
def kcluster(rows,distance=pearson,k=4):
# Determine the minimum and maximum values for each point
ranges=[(min([row[i] for row in rows]),max([row[i] for row in rows]))
for i in range(len(rows[0]))]
# Create k randomly placed centroids
clusters=[[random.random()*(ranges[i][1]-ranges[i][0])+ranges[i][0]
for i in range(len(rows[0]))] for j in range(k)]

lastmatches=None
for t in range(100):
print 'Iteration %d' % t
bestmatches=[[] for i in range(k)]

# Find which centroid is the closest for each row
for j in range(len(rows)):
row=rows[j]
bestmatch=0
for i in range(k):
d=distance(clusters[i],row)
if d<distance(clusters[bestmatch],row): bestmatch=i
bestmatches[bestmatch].append(j)
# If the results are the same as last time, this is complete
if bestmatches==lastmatches: break
lastmatches=bestmatches

# Move the centroids to the average of their members
for i in range(k):
avgs=[0.0]*len(rows[0])
if len(bestmatches[i])>0:
for rowid in bestmatches[i]:
for m in range(len(rows[rowid])):
avgs[m]+=rows[rowid][m]
for j in range(len(avgs)):
avgs[j]/=len(bestmatches[i])
clusters[i]=avgs

return bestmatches

Python Code
import random
ranges=[(min([row[i] for row in rows]),
max([row[i] for row in rows]))
lastmatches=None
clusters=[[random.random()*
for j(ranges[i][1]-ranges[i][0])+ranges[i][0]
in range(len(rows)):
row=rows[j]
bestmatch=0
for i in range(k):
for j in range(k)]

for i in range(k):
clusters[i]=avgs

return bestmatches

Python Code
import random
lastmatches=None
row=rows[j]
row=rows[j]
bestmatch=0
bestmatch=0
for for iin range(k):
i in range(k):

for i in range(k):
clusters[i]=avgs

return bestmatches

Python Code
import random

lastmatches=None

row=rows[j]
bestmatch=0
for i in range(k):
for i in range(k):
clusters[i]=avgs

return bestmatches

Python Code
import random

lastmatches=None

row=rows[j]
bestmatch=0
for i in range(k):
for i in range(k):
# Move the centroids toin average of their members
for rowid the bestmatches[i]:
for i in range(k):
avgs=[0.0]*len(rows[0])range(len(rows[rowid])):
for m in
clusters[i]=avgs
clusters[i]=avgs

return bestmatches

K-Means Results

>> [rownames[r] for r in k[0]]
['The Viral Garden', 'Copyblogger', 'Creating Passionate Users',
'Oilman', 'ProBlogger Blog Tips', quot;Seth's Blogquot;]

>> [rownames[r] for r in k[1]]
['Wonkette', 'Gawker', 'Gothamist', 'Huffington Post']

2D Visualizations

Instead of Clusters, a 2D Map
Goals
Preserve distances as much as possible
Draw in two dimensions
Dimension Reduction
Principal Components Analysis
Multidimensional Scaling

def scaledown(data,distance=pearson,rate=0.01):
n=len(data)
# The real distances between every pair of items
realdist=[[distance(data[i],data[j]) for j in range(n)]
for i in range(0,n)]
outersum=0.0

# Randomly initialize the starting points of the locations in 2D
loc=[[random.random(),random.random()] for i in range(n)]
fakedist=[[0.0 for j in range(n)] for i in range(n)]

lasterror=None
for m in range(0,1000):
# Find projected distances
for i in range(n):
for j in range(n):
fakedist[i][j]=sqrt(sum([pow(loc[i][x]-loc[j][x],2)
for x in range(len(loc[i]))]))

# Move points
grad=[[0.0,0.0] for i in range(n)]

totalerror=0
for k in range(n):
for j in range(n):
if j==k: continue
# The error is percent difference between the distances
errorterm=(fakedist[j][k]-realdist[j][k])/realdist[j][k]

# Each point needs to be moved away from or towards the other
# point in proportion to how much error it has
grad[k][0]+=((loc[k][0]-loc[j][0])/fakedist[j][k])*errorterm
# Keep track of the total error
totalerror+=abs(errorterm)
print totalerror
# If the answer got worse by moving the points, we are done
if lasterror and lasterror<totalerror: break
lasterror=totalerror

# Move each of the points by the learning rate times the gradient
for k in range(n):
loc[k][0]-=rate*grad[k][0]
return loc

n=len(data) The real distances between every pair of items
n=len(data)
#
# The realrealdist=[[distance(data[i],data[j]) for j inpair of items
distances between every range(n)]
outersum=0.0

# for i initialize the starting points of the locations in 2D
in range(0,n)]
Randomly
outersum=0.0

lasterror=None
for i in range(n):
for j in range(n):

# Move points

totalerror=0
for k in range(n):
for j in range(n):
if j==k: continue

print totalerror

for k in range(n):
return loc

n=len(data)
outersum=0.0

# RandomlyRandomly initialize the startingof the locations in 2D in
# initialize the starting points points of the locations 2D

lasterror=None
for i in range(n):
for j in range(n):

# Move points

totalerror=0
for k in range(n):
for j in range(n):
if j==k: continue

print totalerror

for k in range(n):
return loc

n=len(data)
outersum=0.0


lasterror=None
lasterror=None
for m in # Find projected distances
range(0,1000):
for i in range(n):
for j in range(n):
for i in range(n): for x in range(len(loc[i]))]))
for j in range(n):
# Move points
totalerror=0
for k in range(n):
for j in range(n):
if j==k: continue

print totalerror

for k in range(n):
return loc

n=len(data)
outersum=0.0


# Move points lasterror=None
grad=[[0.0,0.0]# m in range(0,1000):
for
for i in range(n)]
Find projected distances
for i in range(n):
for j in range(n):
totalerror=0 fakedist[i][j]=sqrt(sum([pow(loc[i][x]-loc[j][x],2)
for k in range(n):
# Move points
for j in range(n):
if j==k: continue
totalerror=0
# The errorfor k inpercent difference between the distances
is range(n):
for j in range(n):
if j==k: continue

# Each point needs to be moved away from or towards the towards the
# Each point needs to be moved away from or other
# other point# in proportionhow much error much error it has
point in proportion to
to how it has
print totalerror
# Keep trackIfof answer got worse by moving the points, we are done
the the total error
#

for k in range(n):
return loc

n=len(data)
outersum=0.0


lasterror=None
for i in range(n):
for j in range(n):

# Move points

totalerror=0
for k in range(n):
for j in range(n):
if j==k: continue

print totalerror
for k in range(n):
return loc

n=len(data)
outersum=0.0


lasterror=None
for i in range(n):
for j in range(n):

# Move points

totalerror=0
for k in range(n):
for j in range(n):
if j==k: continue

print totalerror
for k in range(n): of the points by the learning rate times the gradient
# Move each
for k in range(n):
return loc

Numerical Predictions

Back to “supervised” learning
We have a set of numerical attributes
Specs for a laptop
Age and rating for wine
Ratios for a stock
Want to predict another attribute
Formula/model is unknown
e.g. price

Regression Trees?

Regression trees find hard boundaries
Can’t deal with complex formulae

Statistical regression

Requires specification of a model
Usually linear
Doesn’t handle context

Alternative - Interpolation

Find “similar” items
Guess price based on similar items
Need to determine:
What is similar?
How should we aggregate prices?

The eBay API

XML API
Send XML over HTTPS
Receive results in XML

http://developer.ebay.com/quickstartguide.

Some Python Code
def getHeaders(apicall,siteID=quot;0quot;,compatabilityLevel = quot;433quot;):
headers = {quot;X-EBAY-API-COMPATIBILITY-LEVELquot;: compatabilityLevel,
quot;X-EBAY-API-DEV-NAMEquot;: devKey,
quot;X-EBAY-API-APP-NAMEquot;: appKey,
quot;X-EBAY-API-CERT-NAMEquot;: certKey,
quot;X-EBAY-API-CALL-NAMEquot;: apicall,
quot;X-EBAY-API-SITEIDquot;: siteID,
quot;Content-Typequot;: quot;text/xmlquot;}
return headers

def sendRequest(apicall,xmlparameters):
connection = httplib.HTTPSConnection(serverUrl)
connection.request(quot;POSTquot;, '/ws/api.dll', xmlparameters, getHeaders(apicall))
response = connection.getresponse()
if response.status != 200:
print quot;Error sending request:quot; + response.reason
else:
data = response.read()
connection.close()
return data

Some Python Code
def getItem(itemID):
xml = quot;<?xml version='1.0' encoding='utf-8'?>quot;+
quot;<GetItemRequest xmlns=quot;urn:ebay:apis:eBLBaseComponentsquot;>quot;+
quot;<RequesterCredentials><eBayAuthToken>quot; +
userToken +
quot;</eBayAuthToken></RequesterCredentials>quot; +
quot;<ItemID>quot; + str(itemID) + quot;</ItemID>quot;+
quot;<DetailLevel>ItemReturnAttributes</DetailLevel>quot;+
quot;</GetItemRequest>quot;
data=sendRequest('GetItem',xml)
result={}
response=parseString(data)
result['title']=getSingleValue(response,'Title')
sellingStatusNode = response.getElementsByTagName('SellingStatus')[0];
result['price']=getSingleValue(sellingStatusNode,'CurrentPrice')
result['bids']=getSingleValue(sellingStatusNode,'BidCount')
seller = response.getElementsByTagName('Seller')
result['feedback'] = getSingleValue(seller[0],'FeedbackScore')
attributeSet=response.getElementsByTagName('Attribute');
attributes={}
for att in attributeSet:
attID=att.attributes.getNamedItem('attributeID').nodeValue
attValue=getSingleValue(att,'ValueLiteral')
attributes[attID]=attValue
result['attributes']=attributes
return result

Building an item table
RAM CPU HDD Screen DVD Price

D600 512 1400 40 14 1 $350

Lenovo 160 300 5 13 0 $80

T22 256 900 20 14 1 $200

Pavillion 1024 1600 120 17 1 $800

etc..

Distance between items

New 512 1400 40 14 1 ???

T22 256 900 20 14 1 $200

Euclidean, just like in clustering

(512 − 256) 2 + (1400 − 900) 2 + (40 − 20) 2 + (14 − 14) 2 + (1 − 1) 2

Idea 1 – use the closest item

With the item whose price I want to
guess:
Calculate the distance for every item in
my dataset
Guess that the price is the same as the
closest
This is called kNN with k=1

Problems with “outliers”

The closest item may be anomalous
Why?
Exceptional deal that won’t occur again
Something missing from the dataset
Data errors

Using an average

New 512 1400 40 14 1 ???

No. 1 512 1400 30 13 1 $360

No. 2 512 1400 60 14 1 $400

No. 3 1024 1600 120 15 0 $325

k=3, estimate = $361

Using a weighted average
RAM CPU HDD Screen DVD Price Weight

New 512 1400 40 14 1 ???

No. 1 512 1400 30 13 1 $360 3

No. 2 512 1400 60 14 1 $400 2

No. 3 1024 1600 120 15 0 $325 1

Estimate = $367

Python code
def getdistances(data,vec1):
distancelist=[]
for i in range(len(data)):
vec2=data[i]['input']
distancelist.append((euclidean(vec1,vec2),i))
distancelist.sort()
return distancelist
def weightedknn(data,vec1,k=5,weightf=gaussian):
# Get distances
dlist=getdistances(data,vec1)
avg=0.0
totalweight=0.0

# Get weighted average
for i in range(k):
dist=dlist[i][0]
idx=dlist[i][1]
weight=weightf(dist)
avg+=weight*data[idx]['result']
totalweight+=weight
avg=avg/totalweight
return avg

Python code
def getdistances(data,vec1):
distancelist=[]
for i in range(len(data)):
defvec2=data[i]['input']
weightedknn(data,vec1,k=5,weightf=gaussian):
distancelist.append((euclidean(vec1,vec2),i))
# Get distances
distancelist.sort()
dlist=getdistances(data,vec1)
return distancelist
avg=0.0
totalweight=0.0

# Get weighted average
for i in range(k):
dist=dlist[i][0]
idx=dlist[i][1]
weight=weightf(dist)
avg+=weight*data[idx]['result']
totalweight+=weight
avg=avg/totalweight
return avg

Determining the best k

Divide the dataset up
Training set
Test set
Guess the prices for the test set using
the training set
See how good the guesses are for
different values of k
Known as “cross-validation”

Determining the best k
Test set

Attribute Price
Attribute Price 10 20
10 20
Training set
11 30
Attribute Price
8 10
11 30
6 0
8 10
6 0
For k = 1, guess = 30, error = 10

Repeat with different test sets, average the error

Python code
def dividedata(data,test=0.05):
trainset=[]
testset=[]
for row in data:
if random()<test:
testset.append(row)
else:
trainset.append(row)
return trainset,testset

def testalgorithm(algf,trainset,testset):
error=0.0
for row in testset:
guess=algf(trainset,row['input'])
error+=(row['result']-guess)**2
return error/len(testset)

def crossvalidate(algf,data,trials=100,test=0.05):
error=0.0
for i in range(trials):
trainset,testset=dividedata(data,test)
error+=testalgorithm(algf,trainset,testset)
return error/trials

Python code
trainset=[]
testset=[]
for row in data:
if random()<test:
testset.append(row)
else:
error=0.0
for row in testset:

error=0.0
return error/trials

Python code
trainset=[]
testset=[]
for row in data:
if random()<test:
testset.append(row)
else:
def
testalgorithm(algf,trainset,testset):
error=0.0
for row in testset:

error=0.0
return error/trials

Python code
trainset=[]
testset=[]
for row in data:
if random()<test:
testset.append(row)
else:

error=0.0
for row in testset:
error=0.0
return error/trials

Determining the best scale

Try different weights
Use the “cross-validation” method
Different ways of choosing a scale:
Range-scaling
Intuitive guessing
Optimization

Methods covered

Regression trees
Hierarchical clustering
k-means clustering
Multidimensional scaling
Weight k-nearest neighbors

New projects

Openads
An open-source ad server
Users can share impression/click data
Matrix of what hits based on
Page Text
Ad
Ad placement
Search query
Can we improve targeting?

New Projects

Finance
Analysts already drowning in info
Stories sometimes broken on blogs
Message boards show sentiment

Extremely low signal-to-noise ratio

New Projects

Entertainment
How much buzz is a movie generating?
What psychographic profiles like this type
of movie?

Of interest to studios and media investors

Data Mining Open Ap Is

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (10)

Andere mochten auch

Andere mochten auch (6)

Ähnlich wie Data Mining Open Ap Is

Ähnlich wie Data Mining Open Ap Is (20)

Mehr von oscon2007

Mehr von oscon2007 (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Data Mining Open Ap Is