2. About Me
Software Developer at Genstruct
Work directly with scientists
Design algorithms to aid in drug testing
“Programming Collective Intelligence”
Published by O’Reilly
Due out in August
Consult with open-source projects and other
companies
http://kiwitobes.com
3. Presentation Goals
Look at some Open APIs
Get some data
Visualize algorithms for data-mining
Work through some Python code
Variety of techniques and sources
Advocacy (why you should care)
4. Open data APIs
Zillow Yahoo Answers
eBay Amazon
Facebook Technorati
del.icio.us Twitter
HotOrNot Google News
Upcoming
programmableweb.com/apis for more…
5. Open API uses
Mashups
Integration
Automation
Command-line tools
Most importantly, creating datasets!
6. What is data mining?
From a large dataset find the:
Implicit
Unknown
Useful
Data could be:
Tabular, e.g. Price lists
Free text
Pictures
7. Why it’s important now
More devices produce more data
People share more data
The internet is vast
Products are more customized
Advertising is targeted
Human cognition is limited
8. Traditional Applications
Computational Biology
Financial Markets
Retail Markets
Fraud Detection
Surveillance
Supply Chain Optimization
National Security
9. Traditional = Inaccessible
Real applications are esoteric
Tutorial examples are trivial
Generally lacking in “interest value”
10. Fun, Accessible Applications
Home price modeling
Where are the hottest people?
Which bloggers are similar?
Important attributes on eBay
Predicting fashion trends
Movie popularity
12. The Zillow API
Allows querying by address
Returns information about the property
Bedrooms
Bathrooms
Zip Code
Price Estimate
Last Sale Price
Requires registration key
http://www.zillow.com/howto/api/PropertyDetailsAPIOverview.htm
13. The Zillow API
REST Request
http://www.zillow.com/webservice/GetDeepSearchResults.htm?
zws-id=key&address=address&citystatezip=citystateszip
16. Zillow from Python
def getaddressdata(address,city):
escad=address.replace(' ','+')
# Construct the URL
url='http://www.zillow.com/webservice/GetDeepSearchResults.htm?'
url+='zws-id=%s&address=%s&citystatezip=%s' % (zwskey,escad,city)
# Parse resulting XML
doc=xml.dom.minidom.parseString(urllib2.urlopen(url).read())
code=doc.getElementsByTagName('code')[0].firstChild.data
# Code 0 means success, otherwise there was an error
if code!='0': return None
# Extract the info about this property
try:
zipcode=doc.getElementsByTagName('zipcode')[0].firstChild.data
use=doc.getElementsByTagName('useCode')[0].firstChild.data
year=doc.getElementsByTagName('yearBuilt')[0].firstChild.data
bath=doc.getElementsByTagName('bathrooms')[0].firstChild.data
bed=doc.getElementsByTagName('bedrooms')[0].firstChild.data
rooms=doc.getElementsByTagName('totalRooms')[0].firstChild.data
price=doc.getElementsByTagName('amount')[0].firstChild.data
except:
return None
return (zipcode,use,int(year),float(bath),int(bed),int(rooms),price)
17. Zillow from Python
def getaddressdata(address,city):
escad=address.replace(' ','+')
# Construct the URL
# Construct the URL
url='http://www.zillow.com/webservice/GetDeepSearchResults.htm?'
url='http://www.zillow.com/webservice/GetDeepSearchResults.htm?'
url+='zws-id=%s&address=%s&citystatezip=%s' % (zwskey,escad,city)
url+='zws-id=%s&address=%s&citystatezip=%s' % (zwskey,escad,city)
# Parse resulting XML
doc=xml.dom.minidom.parseString(urllib2.urlopen(url).read())
code=doc.getElementsByTagName('code')[0].firstChild.data
# Code 0 means success, otherwise there was an error
if code!='0': return None
# Extract the info about this property
try:
zipcode=doc.getElementsByTagName('zipcode')[0].firstChild.data
use=doc.getElementsByTagName('useCode')[0].firstChild.data
year=doc.getElementsByTagName('yearBuilt')[0].firstChild.data
bath=doc.getElementsByTagName('bathrooms')[0].firstChild.data
bed=doc.getElementsByTagName('bedrooms')[0].firstChild.data
rooms=doc.getElementsByTagName('totalRooms')[0].firstChild.data
price=doc.getElementsByTagName('amount')[0].firstChild.data
except:
return None
return (zipcode,use,int(year),float(bath),int(bed),int(rooms),price)
18. Zillow from Python
def getaddressdata(address,city):
escad=address.replace(' ','+')
# Construct the URL
url='http://www.zillow.com/webservice/GetDeepSearchResults.htm?'
url+='zws-id=%s&address=%s&citystatezip=%s' % (zwskey,escad,city)
# Parse resulting XML
# Parse resulting
XML
doc=xml.dom.minidom.parseString(urllib2.urlopen(url).read())
doc=xml.dom.minidom.parseString(urllib2.urlopen(url).read())
code=doc.getElementsByTagName('code')[0].firstChild.data
code=doc.getElementsByTagName('code')[0].firstChild.data
# Code 0 means success, otherwise there was an error
if code!='0': return None
# Extract the info about this property
try:
zipcode=doc.getElementsByTagName('zipcode')[0].firstChild.data
use=doc.getElementsByTagName('useCode')[0].firstChild.data
year=doc.getElementsByTagName('yearBuilt')[0].firstChild.data
bath=doc.getElementsByTagName('bathrooms')[0].firstChild.data
bed=doc.getElementsByTagName('bedrooms')[0].firstChild.data
rooms=doc.getElementsByTagName('totalRooms')[0].firstChild.data
price=doc.getElementsByTagName('amount')[0].firstChild.data
except:
return None
return (zipcode,use,int(year),float(bath),int(bed),int(rooms),price)
19. Zillow from Python
def getaddressdata(address,city):
escad=address.replace(' ','+')
# Construct the URL
url='http://www.zillow.com/webservice/GetDeepSearchResults.htm?'
url+='zws-id=%s&address=%s&citystatezip=%s' % (zwskey,escad,city)
# Parse resulting XML
doc=xml.dom.minidom.parseString(urllib2.urlopen(url).read())
code=doc.getElementsByTagName('code')[0].firstChild.data
# Code 0 means success, otherwise there was an error
if code!='0': return None
zipcode=doc.getElementsByTagName('zipcode')[0].firstChild.data
# Extract the info about this property
try:
use=doc.getElementsByTagName('useCode')[0].firstChild.data
zipcode=doc.getElementsByTagName('zipcode')[0].firstChild.data
year=doc.getElementsByTagName('yearBuilt')[0].firstChild.data
use=doc.getElementsByTagName('useCode')[0].firstChild.data
bath=doc.getElementsByTagName('bathrooms')[0].firstChild.data
year=doc.getElementsByTagName('yearBuilt')[0].firstChild.data
bath=doc.getElementsByTagName('bathrooms')[0].firstChild.data
bed=doc.getElementsByTagName('bedrooms')[0].firstChild.data
bed=doc.getElementsByTagName('bedrooms')[0].firstChild.data
rooms=doc.getElementsByTagName('totalRooms')[0].firstChild.data
rooms=doc.getElementsByTagName('totalRooms')[0].firstChild.data
price=doc.getElementsByTagName('amount')[0].firstChild.data
price=doc.getElementsByTagName('amount')[0].firstChild.data
except:
return None
return (zipcode,use,int(year),float(bath),int(bed),int(rooms),price)
20. A home price dataset
House Zip Bathrooms Bedrooms Built Type Price
A 02138 1.5 2 1847 Single 505296
B 02139 3.5 9 1916 Triplex 776378
C 02140 3.5 4 1894 Duplex 595027
D 02139 2.5 4 1854 Duplex 552213
E 02138 3.5 5 1909 Duplex 947528
F 02138 3.5 4 1930 Single 2107871
etc..
21. What can we learn?
A made-up houses price
How important is Zip Code?
What are the important attributes?
Can we do better than averages?
24. Minimizing deviation
Standard deviation is the “spread” of results
Try all possible divisions
Choose the division that decreases deviation the most
Initially
A B Value
Average = 14
10 Circle 20
Standard Deviation = 8.2
11 Square 22
22 Square 8
18 Circle 6
25. Minimizing deviation
Standard deviation is the “spread” of results
Try all possible divisions
Choose the division that decreases deviation the most
B = Circle
A B Value
Average = 13
10 Circle 20
Standard Deviation = 9.9
11 Square 22
22 Square 8
B = Square
18 Circle 6
Average = 15
Standard Deviation = 9.9
26. Minimizing deviation
Standard deviation is the “spread” of results
Try all possible divisions
Choose the division that decreases deviation the most
A > 18
A B Value
Average = 8
10 Circle 20
Standard Deviation = 0
11 Square 22
22 Square 8
A <= 20
18 Circle 6
Average = 16
Standard Deviation = 8.7
27. Minimizing deviation
Standard deviation is the “spread” of results
Try all possible divisions
Choose the division that decreases deviation the most
A > 11
A B Value
Average = 7
10 Circle 20
Standard Deviation = 1.4
11 Square 22
22 Square 8
A <= 11
18 Circle 6
Average = 21
Standard Deviation = 1.4
28. Python Code
def variance(rows):
if len(rows)==0: return 0
data=[float(row[len(row)-1]) for row in rows]
mean=sum(data)/len(data)
variance=sum([(d-mean)**2 for d in data])/len(data)
return variance
def divideset(rows,column,value):
# Make a function that tells us if a row is in
# the first group (true) or the second group (false)
split_function=None
if isinstance(value,int) or isinstance(value,float):
split_function=lambda row:row[column]>=value
else:
split_function=lambda row:row[column]==value
# Divide the rows into two sets and return them
set1=[row for row in rows if split_function(row)]
set2=[row for row in rows if not split_function(row)]
return (set1,set2)
29. Python Code
def variance(rows):
def variance(rows):
if len(rows)==0: return 0
if len(rows)==0: return for row in rows]
data=[float(row[len(row)-1]) 0
data=[float(row[len(row)-1]) for row in rows]
mean=sum(data)/len(data)
mean=sum(data)/len(data)d in data])/len(data)
variance=sum([(d-mean)**2 for
return variance
variance=sum([(d-mean)**2 for d in data])/len(data)
return variance
def divideset(rows,column,value):
# Make a function that tells us if a row is in
# the first group (true) or the second group (false)
split_function=None
if isinstance(value,int) or isinstance(value,float):
split_function=lambda row:row[column]>=value
else:
split_function=lambda row:row[column]==value
# Divide the rows into two sets and return them
set1=[row for row in rows if split_function(row)]
set2=[row for row in rows if not split_function(row)]
return (set1,set2)
30. Python Code
def variance(rows):
if len(rows)==0: return 0
data=[float(row[len(row)-1]) for row in rows]
mean=sum(data)/len(data)
variance=sum([(d-mean)**2 for d in data])/len(data)
return variance
# def divideset(rows,column,value): us if a row is in
Make a function that tells
# the Make a function (true) or the asecond in
# first group that tells us if row is group (false)
# the first group (true) or the second group (false)
split_function=None
split_function=None
if isinstance(value,int) or isinstance(value,float):
if isinstance(value,int) or isinstance(value,float):
split_function=lambda row:row[column]>=value
split_function=lambda row:row[column]>=value
else:
else:
split_function=lambda row:row[column]==value
split_function=lambda row:row[column]==value
# Divide the rows into two sets and return them
set1=[row for row in rows if split_function(row)]
set2=[row for row in rows if not split_function(row)]
return (set1,set2)
31. Python Code
def variance(rows):
if len(rows)==0: return 0
data=[float(row[len(row)-1]) for row in rows]
mean=sum(data)/len(data)
variance=sum([(d-mean)**2 for d in data])/len(data)
return variance
def divideset(rows,column,value):
# Make a function that tells us if a row is in
# the first group (true) or the second group (false)
split_function=None
if isinstance(value,int) or isinstance(value,float):
split_function=lambda row:row[column]>=value
else:
split_function=lambda row:row[column]==value
# Divide the rows into two sets and returnreturn them
# Divide the rows into two sets and them
set1=[row for row in rows if split_function(row)]
set1=[row for row in rows if not split_function(row)]
in rows if split_function(row)]
set2=[row for row
set2=[row(set1,set2) in rows if not split_function(row)]
for row
return
return (set1,set2)
32. CART Algoritm
A B Value
10 Circle 20
11 Square 22
22 Square 8
18 Circle 6
33. CART Algoritm
A B Value
10 Circle 20
11 Square 22
22 Square 8
18 Circle 6
34. CART Algoritm
22 Square 8
10 Circle 20
18 Circle 6
11 Square 22
36. Python Code
def buildtree(rows,scoref=variance):
if len(rows)==0: return decisionnode()
current_score=scoref(rows)
# Set up some variables to track the best criteria
best_gain=0.0
best_criteria=None
best_sets=None
column_count=len(rows[0])-1
for col in range(0,column_count):
# Generate the list of different values in
# this column
column_values={}
for row in rows:
column_values[row[col]]=1
# Now try dividing the rows up for each value
# in this column
for value in column_values.keys():
(set1,set2)=divideset(rows,col,value)
# Information gain
p=float(len(set1))/len(rows)
gain=current_score-p*scoref(set1)-(1-p)*scoref(set2)
if gain>best_gain and len(set1)>0 and len(set2)>0:
best_gain=gain
best_criteria=(col,value)
best_sets=(set1,set2)
# Create the sub branches
if best_gain>0:
trueBranch=buildtree(best_sets[0])
falseBranch=buildtree(best_sets[1])
return decisionnode(col=best_criteria[0],value=best_criteria[1],tb=trueBranch,fb=falseBranch)
else:
return decisionnode(results=uniquecounts(rows))
37. Python Code
def buildtree(rows,scoref=variance):
def buildtree(rows,scoref=variance):
if len(rows)==0: return decisionnode()
if len(rows)==0: return decisionnode()
current_score=scoref(rows)
current_score=scoref(rows) criteria
# Set up some variables to track the best
#best_gain=0.0some variables to track the best criteria
Set up
best_criteria=None
best_gain=0.0
best_sets=None
column_count=len(rows[0])-1
best_criteria=None
for col in range(0,column_count):
best_sets=None of different values in
# Generate the list
# this column
column_count=len(rows[0])-1
column_values={}
for row in rows:
column_values[row[col]]=1
# Now try dividing the rows up for each value
# in this column
for value in column_values.keys():
(set1,set2)=divideset(rows,col,value)
# Information gain
p=float(len(set1))/len(rows)
gain=current_score-p*scoref(set1)-(1-p)*scoref(set2)
if gain>best_gain and len(set1)>0 and len(set2)>0:
best_gain=gain
best_criteria=(col,value)
best_sets=(set1,set2)
# Create the sub branches
if best_gain>0:
trueBranch=buildtree(best_sets[0])
falseBranch=buildtree(best_sets[1])
return decisionnode(col=best_criteria[0],value=best_criteria[1],tb=trueBranch,fb=falseBranch)
else:
return decisionnode(results=uniquecounts(rows))
38. Python Code
def buildtree(rows,scoref=variance):
if len(rows)==0: return decisionnode()
current_score=scoref(rows)
# Set up some variables to track the best criteria
best_gain=0.0
best_criteria=None
best_sets=None
column_count=len(rows[0])-1
for col in range(0,column_count):
# Generate the list of different values in
# this column
column_values={}
for row in rows:
column_values[row[col]]=1
for try dividing the rows up for each value
# Now value in column_values.keys():
# in this column
(set1,set2)=divideset(rows,col,value)
for value in column_values.keys():
# Information gain
(set1,set2)=divideset(rows,col,value)
# Information gain
p=float(len(set1))/len(rows)
p=float(len(set1))/len(rows)
gain=current_score-p*scoref(set1)-(1-p)*scoref(set2)
gain=current_score-p*scoref(set1)-(1-p)*scoref(set2)
if gain>best_gain and len(set1)>0 and len(set2)>0:
if gain>best_gain and len(set1)>0 and len(set2)>0:
best_gain=gain
best_criteria=(col,value)
best_gain=gain
best_sets=(set1,set2)
best_criteria=(col,value)
# Create the sub branches
if best_gain>0:
best_sets=(set1,set2)
trueBranch=buildtree(best_sets[0])
falseBranch=buildtree(best_sets[1])
return decisionnode(col=best_criteria[0],value=best_criteria[1],tb=trueBranch,fb=falseBranch)
else:
return decisionnode(results=uniquecounts(rows))
39. Python Code
def buildtree(rows,scoref=variance):
if len(rows)==0: return decisionnode()
current_score=scoref(rows)
# Set up some variables to track the best criteria
best_gain=0.0
best_criteria=None
best_sets=None
column_count=len(rows[0])-1
for col in range(0,column_count):
# Generate the list of different values in
# this column
column_values={}
for row in rows:
column_values[row[col]]=1
# Now try dividing the rows up for each value
# in this column
for value in column_values.keys():
(set1,set2)=divideset(rows,col,value)
# Information gain
p=float(len(set1))/len(rows)
gain=current_score-p*scoref(set1)-(1-p)*scoref(set2)
if best_gain>0: and len(set1)>0 and len(set2)>0:
if gain>best_gain
best_gain=gain
trueBranch=buildtree(best_sets[0])
best_criteria=(col,value)
best_sets=(set1,set2)
falseBranch=buildtree(best_sets[1])
# Create the sub branches
if best_gain>0:
return decisionnode(col=best_criteria[0],value=best_criteria[1],
trueBranch=buildtree(best_sets[0])
tb=trueBranch,fb=falseBranch)
falseBranch=buildtree(best_sets[1])
return decisionnode(col=best_criteria[0],value=best_criteria[1],tb=trueBranch,fb=falseBranch)
else:
else:
return decisionnode(results=uniquecounts(rows))
return decisionnode(results=uniquecounts(rows))
43. Supervised and Unsupervised
Regression trees are supervised
“answers” are in the dataset
Tree models predict answers
Some methods are unsupervised
There are no answers
Methods just characterize the data
Show interesting patterns
44. Next challenge - Bloggers
Millions of blogs online
Usually focus on a subject area
Can they be characterized
automatically?
… using only the words in the posts?
47. Getting the content
Use Mark Pilgrim’s Universal Feed
Reader
Retrieve the post titles and text
Split up the words
Count occurrence of each word
48. Python Code
import feedparser
import re
# Returns title and dictionary of word counts for an RSS feed
def getwordcounts(url):
# Parse the feed
d=feedparser.parse(url)
wc={}
# Loop over all the entries
for e in d.entries:
if 'summary' in e: summary=e.summary
else: summary=e.description
# Extract a list of words
words=getwords(e.title+' '+summary)
for word in words:
wc.setdefault(word,0)
wc[word]+=1
return d.feed.title,wc
def getwords(html):
# Remove all the HTML tags
txt=re.compile(r'<[^>]+>').sub('',html)
# Split words by all non-alpha characters
words=re.compile(r'[^A-Z^a-z]+').split(txt)
# Convert to lowercase
return [word.lower() for word in words if word!='']
49. Python Code
import feedparser
import re
# Returns title and dictionary of word counts for an RSS feed
def getwordcounts(url):
# Parse the feed
d=feedparser.parse(url)
wc={}
for e in d.entries:
# Loop over all the entries
if 'summary' in e: summary=e.summary
for e in d.entries:
else: summary=e.description
if 'summary' in e: summary=e.summary
else: summary=e.description words
# Extract a list of
# Extract a list of words
words=getwords(e.title+' '+summary)
words=getwords(e.title+' '+summary)
for word in words:
for word in words:
wc.setdefault(word,0)
wc.setdefault(word,0)
wc[word]+=1
wc[word]+=1
return d.feed.title,wc
def getwords(html):
# Remove all the HTML tags
txt=re.compile(r'<[^>]+>').sub('',html)
# Split words by all non-alpha characters
words=re.compile(r'[^A-Z^a-z]+').split(txt)
# Convert to lowercase
return [word.lower() for word in words if word!='']
50. Python Code
import feedparser
import re
# Returns title and dictionary of word counts for an RSS feed
def getwordcounts(url):
# Parse the feed
d=feedparser.parse(url)
wc={}
# Loop over all the entries
for e in d.entries:
if 'summary' in e: summary=e.summary
else: summary=e.description
# Extract a list of words
words=getwords(e.title+' '+summary)
for word in words:
wc.setdefault(word,0)
wc[word]+=1
return d.feed.title,wc
def getwords(html):
# Remove def getwords(html):
all the HTML tags
# Remove all the HTML tags
txt=re.compile(r'<[^>]+>').sub('',html)
txt=re.compile(r'<[^>]+>').sub('',html)
# Split words bywords by all non-alpha characters
# Split all non-alpha characters
words=re.compile(r'[^A-Z^a-z]+').split(txt)
words=re.compile(r'[^A-Z^a-z]+').split(txt)
# Convert # Convert to lowercase
to lowercase
return [word.lower() for word in words if word!='']
return [word.lower() for word in words if word!='']
51. Building a Word Matrix
Build a matrix of word counts
Blogs are rows, words are columns
Eliminate words that are:
Too common
Too rare
52. Python Code
apcount={}
wordcounts={}
for feedurl in file('feedlist.txt'):
title,wc=getwordcounts(feedurl)
wordcounts[title]=wc
for word,count in wc.items():
apcount.setdefault(word,0)
if count>1:
apcount[word]+=1
wordlist=[]
for w,bc in apcount.items():
frac=float(bc)/len(feedlist)
if frac>0.1 and frac<0.5: wordlist.append(w)
out=file('blogdata.txt','w')
out.write('Blog')
for word in wordlist: out.write('t%s' % word)
out.write('n')
for blog,wc in wordcounts.items():
out.write(blog)
for word in wordlist:
if word in wc: out.write('t%d' % wc[word])
else: out.write('t0')
out.write('n')
53. Python Code
apcount={}
wordcounts={}
for feedurlinin file('feedlist.txt'):
for feedurl file('feedlist.txt'):
title,wc=getwordcounts(feedurl)
title,wc=getwordcounts(feedurl)
wordcounts[title]=wc
wordcounts[title]=wc
for word,count in wc.items():
forapcount.setdefault(word,0)
word,count in wc.items():
apcount.setdefault(word,0)
if count>1:
if apcount[word]+=1
count>1:
apcount[word]+=1
wordlist=[]
for w,bc in apcount.items():
frac=float(bc)/len(feedlist)
if frac>0.1 and frac<0.5: wordlist.append(w)
out=file('blogdata.txt','w')
out.write('Blog')
for word in wordlist: out.write('t%s' % word)
out.write('n')
for blog,wc in wordcounts.items():
out.write(blog)
for word in wordlist:
if word in wc: out.write('t%d' % wc[word])
else: out.write('t0')
out.write('n')
54. Python Code
apcount={}
wordcounts={}
for feedurl in file('feedlist.txt'):
title,wc=getwordcounts(feedurl)
wordcounts[title]=wc
for word,count in wc.items():
apcount.setdefault(word,0)
if count>1:
apcount[word]+=1
wordlist=[]
wordlist=[]
for w,bc in apcount.items():
for w,bc in apcount.items():
frac=float(bc)/len(feedlist)
frac=float(bc)/len(feedlist)
if frac>0.1 and frac<0.5: wordlist.append(w)
if frac>0.1 and frac<0.5: wordlist.append(w)
out=file('blogdata.txt','w')
out.write('Blog')
for word in wordlist: out.write('t%s' % word)
out.write('n')
for blog,wc in wordcounts.items():
out.write(blog)
for word in wordlist:
if word in wc: out.write('t%d' % wc[word])
else: out.write('t0')
out.write('n')
55. Python Code
apcount={}
wordcounts={}
for feedurl in file('feedlist.txt'):
title,wc=getwordcounts(feedurl)
wordcounts[title]=wc
for word,count in wc.items():
apcount.setdefault(word,0)
if count>1:
apcount[word]+=1
wordlist=[]
for w,bc in apcount.items():
frac=float(bc)/len(feedlist)
if frac>0.1 and frac<0.5: wordlist.append(w)
out=file('blogdata.txt','w')
out.write('Blog')
out=file('blogdata.txt','w')
for word in wordlist: out.write('t%s' % word)
out.write('Blog')
for word in wordlist: out.write('t%s' % word)
out.write('n')
out.write('n')
for blog,wcinin wordcounts.items():
for blog,wc wordcounts.items():
out.write(blog)
out.write(blog)
for wordin wordlist:
for word in wordlist:
if word in wc: out.write('t%d' % wc[word])
if word in wc: out.write('t%d' % wc[word])
else: out.write('t0')
else: out.write('t0')
out.write('n')
out.write('n')
67. Python Code
def hcluster(rows,distance=pearson):
distances={}
currentclustid=-1
# Clusters are initially just the rows
clust=[bicluster(rows[i],id=i) for i in range(len(rows))]
while len(clust)>1:
lowestpair=(0,1)
closest=distance(clust[0].vec,clust[1].vec)
# loop through every pair looking for the smallest distance
for i in range(len(clust)):
for j in range(i+1,len(clust)):
# distances is the cache of distance calculations
if (clust[i].id,clust[j].id) not in distances:
distances[(clust[i].id,clust[j].id)]=distance(clust[i].vec,clust[j].vec)
d=distances[(clust[i].id,clust[j].id)]
if d<closest:
closest=d
lowestpair=(i,j)
# calculate the average of the two clusters
mergevec=[
(clust[lowestpair[0]].vec[i]+clust[lowestpair[1]].vec[i])/2.0
for i in range(len(clust[0].vec))]
# create the new cluster
newcluster=bicluster(mergevec,left=clust[lowestpair[0]],
right=clust[lowestpair[1]],
distance=closest,id=currentclustid)
# cluster ids that weren’t in the original set are negative
currentclustid-=1
del clust[lowestpair[1]]
del clust[lowestpair[0]]
clust.append(newcluster)
return clust[0]
68. Python Code
def hcluster(rows,distance=pearson):
distances={}
distances={}
currentclustid=-1
currentclustid=-1
# Clusters are initially just the rows
clust=[bicluster(rows[i],id=i) for i in range(len(rows))]
# Clusters are initially just the rows
while len(clust)>1:
lowestpair=(0,1)
clust=[bicluster(rows[i],id=i) for i in range(len(rows))]
closest=distance(clust[0].vec,clust[1].vec)
# loop through every pair looking for the smallest distance
for i in range(len(clust)):
for j in range(i+1,len(clust)):
# distances is the cache of distance calculations
if (clust[i].id,clust[j].id) not in distances:
distances[(clust[i].id,clust[j].id)]=distance(clust[i].vec,clust[j].vec)
d=distances[(clust[i].id,clust[j].id)]
if d<closest:
closest=d
lowestpair=(i,j)
# calculate the average of the two clusters
mergevec=[
(clust[lowestpair[0]].vec[i]+clust[lowestpair[1]].vec[i])/2.0
for i in range(len(clust[0].vec))]
# create the new cluster
newcluster=bicluster(mergevec,left=clust[lowestpair[0]],
right=clust[lowestpair[1]],
distance=closest,id=currentclustid)
# cluster ids that weren’t in the original set are negative
currentclustid-=1
del clust[lowestpair[1]]
del clust[lowestpair[0]]
clust.append(newcluster)
return clust[0]
69. Python Code
def hcluster(rows,distance=pearson):
distances={}
while len(clust)>1:
currentclustid=-1
# Clusters are initially just the rows
lowestpair=(0,1)
clust=[bicluster(rows[i],id=i) for i in range(len(rows))]
closest=distance(clust[0].vec,clust[1].vec)
while len(clust)>1:
lowestpair=(0,1)
# loop closest=distance(clust[0].vec,clust[1].vec) for the smallest distance
through every pair looking
for i inloopin range(len(clust)):
#
range(len(clust)): for the smallest distance
through every pair looking
for i
for j for j range(i+1,len(clust)):
in in range(i+1,len(clust)):
# distances is the cache of distance calculations
# distances is the cache of distances:
if (clust[i].id,clust[j].id) not in distance calculations
if (clust[i].id,clust[j].id) not in distances:
distances[(clust[i].id,clust[j].id)]=distance(clust[i].vec,clust[j].vec)
d=distances[(clust[i].id,clust[j].id)]
distances[(clust[i].id,clust[j].id)]=
if d<closest:
closest=d
distance(clust[i].vec,clust[j].vec)
lowestpair=(i,j)
d=distances[(clust[i].id,clust[j].id)]
# calculate the average of the two clusters
mergevec=[
if (clust[lowestpair[0]].vec[i]+clust[lowestpair[1]].vec[i])/2.0
d<closest:
for i in range(len(clust[0].vec))]
closest=d
# create the new cluster
lowestpair=(i,j)
newcluster=bicluster(mergevec,left=clust[lowestpair[0]],
right=clust[lowestpair[1]],
distance=closest,id=currentclustid)
# cluster ids that weren’t in the original set are negative
currentclustid-=1
del clust[lowestpair[1]]
del clust[lowestpair[0]]
clust.append(newcluster)
return clust[0]
70. Python Code
def hcluster(rows,distance=pearson):
distances={}
currentclustid=-1
# Clusters are initially just the rows
clust=[bicluster(rows[i],id=i) for i in range(len(rows))]
while len(clust)>1:
lowestpair=(0,1)
closest=distance(clust[0].vec,clust[1].vec)
# loop through every pair looking for the smallest distance
for i in range(len(clust)):
for j in range(i+1,len(clust)):
# distances is the cache of distance calculations
if (clust[i].id,clust[j].id) not in distances:
# calculate distances[(clust[i].id,clust[j].id)]=distance(clust[i].vec,clust[j].vec)
the average of the two clusters
d=distances[(clust[i].id,clust[j].id)]
mergevec=[ if d<closest:
closest=d
(clust[lowestpair[0]].vec[i]+clust[lowestpair[1]].vec[i])/2.0
lowestpair=(i,j)
#in range(len(clust[0].vec))
calculate the average of the two clusters
for i mergevec=[
] (clust[lowestpair[0]].vec[i]+clust[lowestpair[1]].vec[i])/2.0
# create for i in range(len(clust[0].vec))]
#the new new cluster
create the cluster
newcluster=bicluster(mergevec,left=clust[lowestpair[0]],
newcluster=bicluster(mergevec,left=clust[lowestpair[0]],
right=clust[lowestpair[1]],
right=clust[lowestpair[1]],
distance=closest,id=currentclustid)
# cluster ids that weren’t in the original set are negative
distance=closest,id=currentclustid)
currentclustid-=1
del clust[lowestpair[1]]
del clust[lowestpair[1]]
del clust[lowestpair[0]]
del clust[lowestpair[0]]
clust.append(newcluster)
clust.append(newcluster)
return clust[0]
76. K-Means Clustering
Divides data into distinct clusters
User determines how many
Algorithm
Start with arbitrary centroids
Assign points to centroids
Move the centroids
Repeat
82. Python Code
import random
def kcluster(rows,distance=pearson,k=4):
# Determine the minimum and maximum values for each point
ranges=[(min([row[i] for row in rows]),max([row[i] for row in rows]))
for i in range(len(rows[0]))]
# Create k randomly placed centroids
clusters=[[random.random()*(ranges[i][1]-ranges[i][0])+ranges[i][0]
for i in range(len(rows[0]))] for j in range(k)]
lastmatches=None
for t in range(100):
print 'Iteration %d' % t
bestmatches=[[] for i in range(k)]
# Find which centroid is the closest for each row
for j in range(len(rows)):
row=rows[j]
bestmatch=0
for i in range(k):
d=distance(clusters[i],row)
if d<distance(clusters[bestmatch],row): bestmatch=i
bestmatches[bestmatch].append(j)
# If the results are the same as last time, this is complete
if bestmatches==lastmatches: break
lastmatches=bestmatches
# Move the centroids to the average of their members
for i in range(k):
avgs=[0.0]*len(rows[0])
if len(bestmatches[i])>0:
for rowid in bestmatches[i]:
for m in range(len(rows[rowid])):
avgs[m]+=rows[rowid][m]
for j in range(len(avgs)):
avgs[j]/=len(bestmatches[i])
clusters[i]=avgs
return bestmatches
83. Python Code
import random
def kcluster(rows,distance=pearson,k=4):
# Determine the minimum and maximum values for each point
# Determine the minimum and maximum values for each point
ranges=[(min([row[i] for row in rows]),max([row[i] for row in rows]))
ranges=[(min([row[i] for row in rows]),
for i in range(len(rows[0]))]
# Create k randomly placed centroids
max([row[i] for row in rows]))
clusters=[[random.random()*(ranges[i][1]-ranges[i][0])+ranges[i][0]
for i in range(len(rows[0]))] for j in range(k)]
for i in range(len(rows[0]))]
lastmatches=None
for t in range(100):
# Create k randomly placed centroids
print 'Iteration %d' % t
bestmatches=[[] for i in range(k)]
clusters=[[random.random()*
# Find which centroid is the closest for each row
for j(ranges[i][1]-ranges[i][0])+ranges[i][0]
in range(len(rows)):
row=rows[j]
for i in range(len(rows[0]))]
bestmatch=0
for i in range(k):
for j in range(k)]
d=distance(clusters[i],row)
if d<distance(clusters[bestmatch],row): bestmatch=i
bestmatches[bestmatch].append(j)
# If the results are the same as last time, this is complete
if bestmatches==lastmatches: break
lastmatches=bestmatches
# Move the centroids to the average of their members
for i in range(k):
avgs=[0.0]*len(rows[0])
if len(bestmatches[i])>0:
for rowid in bestmatches[i]:
for m in range(len(rows[rowid])):
avgs[m]+=rows[rowid][m]
for j in range(len(avgs)):
avgs[j]/=len(bestmatches[i])
clusters[i]=avgs
return bestmatches
84. Python Code
import random
def kcluster(rows,distance=pearson,k=4):
# Determine the minimum and maximum values for each point
ranges=[(min([row[i] for row in rows]),max([row[i] for row in rows]))
for i in range(len(rows[0]))]
# Create k randomly placed centroids
for t in range(100):
clusters=[[random.random()*(ranges[i][1]-ranges[i][0])+ranges[i][0]
for i in range(len(rows[0]))] for j in range(k)]
bestmatches=[[] for i in range(k)]
lastmatches=None
for t in range(100):
# Find which centroid is the closest for each row
print 'Iteration %d' % t
bestmatches=[[] for i in range(k)]
for j in range(len(rows)):
# Find which centroid is the closest for each row
row=rows[j]
for j in range(len(rows)):
row=rows[j]
bestmatch=0
bestmatch=0
for for iin range(k):
i in range(k):
d=distance(clusters[i],row)
d=distance(clusters[i],row)
if d<distance(clusters[bestmatch],row): bestmatch=i
bestmatches[bestmatch].append(j)
if d<distance(clusters[bestmatch],row): bestmatch=i
# If the results are the same as last time, this is complete
if bestmatches==lastmatches: break
bestmatches[bestmatch].append(j)
lastmatches=bestmatches
# Move the centroids to the average of their members
for i in range(k):
avgs=[0.0]*len(rows[0])
if len(bestmatches[i])>0:
for rowid in bestmatches[i]:
for m in range(len(rows[rowid])):
avgs[m]+=rows[rowid][m]
for j in range(len(avgs)):
avgs[j]/=len(bestmatches[i])
clusters[i]=avgs
return bestmatches
85. Python Code
import random
def kcluster(rows,distance=pearson,k=4):
# Determine the minimum and maximum values for each point
ranges=[(min([row[i] for row in rows]),max([row[i] for row in rows]))
for i in range(len(rows[0]))]
# Create k randomly placed centroids
clusters=[[random.random()*(ranges[i][1]-ranges[i][0])+ranges[i][0]
for i in range(len(rows[0]))] for j in range(k)]
lastmatches=None
for t in range(100):
print 'Iteration %d' % t
bestmatches=[[] for i in range(k)]
# Find which centroid is the closest for each row
for j in range(len(rows)):
row=rows[j]
bestmatch=0
for i in range(k):
d=distance(clusters[i],row)
if d<distance(clusters[bestmatch],row): bestmatch=i
# If the results are the same as last time, this is complete
bestmatches[bestmatch].append(j)
# If the results are the same as last time, this is complete
if bestmatches==lastmatches: break
if bestmatches==lastmatches: break
lastmatches=bestmatches
lastmatches=bestmatches
# Move the centroids to the average of their members
for i in range(k):
avgs=[0.0]*len(rows[0])
if len(bestmatches[i])>0:
for rowid in bestmatches[i]:
for m in range(len(rows[rowid])):
avgs[m]+=rows[rowid][m]
for j in range(len(avgs)):
avgs[j]/=len(bestmatches[i])
clusters[i]=avgs
return bestmatches
86. Python Code
import random
def kcluster(rows,distance=pearson,k=4):
# Determine the minimum and maximum values for each point
ranges=[(min([row[i] for row in rows]),max([row[i] for row in rows]))
for i in range(len(rows[0]))]
# Create k randomly placed centroids
clusters=[[random.random()*(ranges[i][1]-ranges[i][0])+ranges[i][0]
for i in range(len(rows[0]))] for j in range(k)]
lastmatches=None
for t in range(100):
print 'Iteration %d' % t
bestmatches=[[] for i in range(k)]
# Find which centroid is the closest for each row
for j in range(len(rows)):
row=rows[j]
bestmatch=0
for i in range(k):
# Move the centroids to the average of their members
d=distance(clusters[i],row)
if d<distance(clusters[bestmatch],row): bestmatch=i
for i in range(k):
bestmatches[bestmatch].append(j)
# If the results are the same as last time, this is complete
avgs=[0.0]*len(rows[0])
if bestmatches==lastmatches: break
lastmatches=bestmatches
if len(bestmatches[i])>0:
# Move the centroids toin average of their members
for rowid the bestmatches[i]:
for i in range(k):
avgs=[0.0]*len(rows[0])range(len(rows[rowid])):
for m in
if len(bestmatches[i])>0:
avgs[m]+=rows[rowid][m]
for rowid in bestmatches[i]:
for m in range(len(rows[rowid])):
for j in range(len(avgs)):
avgs[m]+=rows[rowid][m]
for j in range(len(avgs)):
avgs[j]/=len(bestmatches[i])
avgs[j]/=len(bestmatches[i])
clusters[i]=avgs
clusters[i]=avgs
return bestmatches
87. K-Means Results
>> [rownames[r] for r in k[0]]
['The Viral Garden', 'Copyblogger', 'Creating Passionate Users',
'Oilman', 'ProBlogger Blog Tips', quot;Seth's Blogquot;]
>> [rownames[r] for r in k[1]]
['Wonkette', 'Gawker', 'Gothamist', 'Huffington Post']
88. 2D Visualizations
Instead of Clusters, a 2D Map
Goals
Preserve distances as much as possible
Draw in two dimensions
Dimension Reduction
Principal Components Analysis
Multidimensional Scaling
92. def scaledown(data,distance=pearson,rate=0.01):
n=len(data)
# The real distances between every pair of items
realdist=[[distance(data[i],data[j]) for j in range(n)]
for i in range(0,n)]
outersum=0.0
# Randomly initialize the starting points of the locations in 2D
loc=[[random.random(),random.random()] for i in range(n)]
fakedist=[[0.0 for j in range(n)] for i in range(n)]
lasterror=None
for m in range(0,1000):
# Find projected distances
for i in range(n):
for j in range(n):
fakedist[i][j]=sqrt(sum([pow(loc[i][x]-loc[j][x],2)
for x in range(len(loc[i]))]))
# Move points
grad=[[0.0,0.0] for i in range(n)]
totalerror=0
for k in range(n):
for j in range(n):
if j==k: continue
# The error is percent difference between the distances
errorterm=(fakedist[j][k]-realdist[j][k])/realdist[j][k]
# Each point needs to be moved away from or towards the other
# point in proportion to how much error it has
grad[k][0]+=((loc[k][0]-loc[j][0])/fakedist[j][k])*errorterm
grad[k][1]+=((loc[k][1]-loc[j][1])/fakedist[j][k])*errorterm
# Keep track of the total error
totalerror+=abs(errorterm)
print totalerror
# If the answer got worse by moving the points, we are done
if lasterror and lasterror<totalerror: break
lasterror=totalerror
# Move each of the points by the learning rate times the gradient
for k in range(n):
loc[k][0]-=rate*grad[k][0]
loc[k][1]-=rate*grad[k][1]
return loc
93. def scaledown(data,distance=pearson,rate=0.01):
n=len(data) The real distances between every pair of items
n=len(data)
#
# The realrealdist=[[distance(data[i],data[j]) for j inpair of items
distances between every range(n)]
for i in range(0,n)]
realdist=[[distance(data[i],data[j]) for j in range(n)]
outersum=0.0
# for i initialize the starting points of the locations in 2D
in range(0,n)]
Randomly
loc=[[random.random(),random.random()] for i in range(n)]
outersum=0.0
fakedist=[[0.0 for j in range(n)] for i in range(n)]
lasterror=None
for m in range(0,1000):
# Find projected distances
for i in range(n):
for j in range(n):
fakedist[i][j]=sqrt(sum([pow(loc[i][x]-loc[j][x],2)
for x in range(len(loc[i]))]))
# Move points
grad=[[0.0,0.0] for i in range(n)]
totalerror=0
for k in range(n):
for j in range(n):
if j==k: continue
# The error is percent difference between the distances
errorterm=(fakedist[j][k]-realdist[j][k])/realdist[j][k]
# Each point needs to be moved away from or towards the other
# point in proportion to how much error it has
grad[k][0]+=((loc[k][0]-loc[j][0])/fakedist[j][k])*errorterm
grad[k][1]+=((loc[k][1]-loc[j][1])/fakedist[j][k])*errorterm
# Keep track of the total error
totalerror+=abs(errorterm)
print totalerror
# If the answer got worse by moving the points, we are done
if lasterror and lasterror<totalerror: break
lasterror=totalerror
# Move each of the points by the learning rate times the gradient
for k in range(n):
loc[k][0]-=rate*grad[k][0]
loc[k][1]-=rate*grad[k][1]
return loc
94. def scaledown(data,distance=pearson,rate=0.01):
n=len(data)
# The real distances between every pair of items
realdist=[[distance(data[i],data[j]) for j in range(n)]
for i in range(0,n)]
outersum=0.0
# RandomlyRandomly initialize the startingof the locations in 2D in
# initialize the starting points points of the locations 2D
loc=[[random.random(),random.random()] for i in range(n)]
loc=[[random.random(),random.random()] for i in range(n)]
fakedist=[[0.0 for j in range(n)] for i in range(n)]
fakedist=[[0.0 for j in range(n)] for i in range(n)]
lasterror=None
for m in range(0,1000):
# Find projected distances
for i in range(n):
for j in range(n):
fakedist[i][j]=sqrt(sum([pow(loc[i][x]-loc[j][x],2)
for x in range(len(loc[i]))]))
# Move points
grad=[[0.0,0.0] for i in range(n)]
totalerror=0
for k in range(n):
for j in range(n):
if j==k: continue
# The error is percent difference between the distances
errorterm=(fakedist[j][k]-realdist[j][k])/realdist[j][k]
# Each point needs to be moved away from or towards the other
# point in proportion to how much error it has
grad[k][0]+=((loc[k][0]-loc[j][0])/fakedist[j][k])*errorterm
grad[k][1]+=((loc[k][1]-loc[j][1])/fakedist[j][k])*errorterm
# Keep track of the total error
totalerror+=abs(errorterm)
print totalerror
# If the answer got worse by moving the points, we are done
if lasterror and lasterror<totalerror: break
lasterror=totalerror
# Move each of the points by the learning rate times the gradient
for k in range(n):
loc[k][0]-=rate*grad[k][0]
loc[k][1]-=rate*grad[k][1]
return loc
95. def scaledown(data,distance=pearson,rate=0.01):
n=len(data)
# The real distances between every pair of items
realdist=[[distance(data[i],data[j]) for j in range(n)]
for i in range(0,n)]
outersum=0.0
# Randomly initialize the starting points of the locations in 2D
loc=[[random.random(),random.random()] for i in range(n)]
fakedist=[[0.0 for j in range(n)] for i in range(n)]
lasterror=None
lasterror=None
for m in range(0,1000):
for m in # Find projected distances
range(0,1000):
for i in range(n):
# Find projected distances
for j in range(n):
fakedist[i][j]=sqrt(sum([pow(loc[i][x]-loc[j][x],2)
for i in range(n): for x in range(len(loc[i]))]))
for j in range(n):
# Move points
fakedist[i][j]=sqrt(sum([pow(loc[i][x]-loc[j][x],2)
grad=[[0.0,0.0] for i in range(n)]
for x in range(len(loc[i]))]))
totalerror=0
for k in range(n):
for j in range(n):
if j==k: continue
# The error is percent difference between the distances
errorterm=(fakedist[j][k]-realdist[j][k])/realdist[j][k]
# Each point needs to be moved away from or towards the other
# point in proportion to how much error it has
grad[k][0]+=((loc[k][0]-loc[j][0])/fakedist[j][k])*errorterm
grad[k][1]+=((loc[k][1]-loc[j][1])/fakedist[j][k])*errorterm
# Keep track of the total error
totalerror+=abs(errorterm)
print totalerror
# If the answer got worse by moving the points, we are done
if lasterror and lasterror<totalerror: break
lasterror=totalerror
# Move each of the points by the learning rate times the gradient
for k in range(n):
loc[k][0]-=rate*grad[k][0]
loc[k][1]-=rate*grad[k][1]
return loc
96. def scaledown(data,distance=pearson,rate=0.01):
n=len(data)
# The real distances between every pair of items
realdist=[[distance(data[i],data[j]) for j in range(n)]
for i in range(0,n)]
outersum=0.0
# Randomly initialize the starting points of the locations in 2D
loc=[[random.random(),random.random()] for i in range(n)]
fakedist=[[0.0 for j in range(n)] for i in range(n)]
# Move points lasterror=None
grad=[[0.0,0.0]# m in range(0,1000):
for
for i in range(n)]
Find projected distances
for i in range(n):
for j in range(n):
totalerror=0 fakedist[i][j]=sqrt(sum([pow(loc[i][x]-loc[j][x],2)
for x in range(len(loc[i]))]))
for k in range(n):
# Move points
for j in range(n):
grad=[[0.0,0.0] for i in range(n)]
if j==k: continue
totalerror=0
# The errorfor k inpercent difference between the distances
is range(n):
for j in range(n):
errorterm=(fakedist[j][k]-realdist[j][k])/realdist[j][k]
if j==k: continue
# The error is percent difference between the distances
errorterm=(fakedist[j][k]-realdist[j][k])/realdist[j][k]
# Each point needs to be moved away from or towards the towards the
# Each point needs to be moved away from or other
# other point# in proportionhow much error much error it has
point in proportion to
to how it has
grad[k][0]+=((loc[k][0]-loc[j][0])/fakedist[j][k])*errorterm
grad[k][0]+=((loc[k][0]-loc[j][0])/fakedist[j][k])*errorterm
grad[k][1]+=((loc[k][1]-loc[j][1])/fakedist[j][k])*errorterm
# Keep track of the total error
grad[k][1]+=((loc[k][1]-loc[j][1])/fakedist[j][k])*errorterm
totalerror+=abs(errorterm)
print totalerror
# Keep trackIfof answer got worse by moving the points, we are done
the the total error
#
if lasterror and lasterror<totalerror: break
totalerror+=abs(errorterm)
lasterror=totalerror
# Move each of the points by the learning rate times the gradient
for k in range(n):
loc[k][0]-=rate*grad[k][0]
loc[k][1]-=rate*grad[k][1]
return loc
97. def scaledown(data,distance=pearson,rate=0.01):
n=len(data)
# The real distances between every pair of items
realdist=[[distance(data[i],data[j]) for j in range(n)]
for i in range(0,n)]
outersum=0.0
# Randomly initialize the starting points of the locations in 2D
loc=[[random.random(),random.random()] for i in range(n)]
fakedist=[[0.0 for j in range(n)] for i in range(n)]
lasterror=None
for m in range(0,1000):
# Find projected distances
for i in range(n):
for j in range(n):
fakedist[i][j]=sqrt(sum([pow(loc[i][x]-loc[j][x],2)
for x in range(len(loc[i]))]))
# Move points
grad=[[0.0,0.0] for i in range(n)]
totalerror=0
for k in range(n):
for j in range(n):
if j==k: continue
# The error is percent difference between the distances
errorterm=(fakedist[j][k]-realdist[j][k])/realdist[j][k]
# Each point needs to be moved away from or towards the other
# point in proportion to how much error it has
grad[k][0]+=((loc[k][0]-loc[j][0])/fakedist[j][k])*errorterm
grad[k][1]+=((loc[k][1]-loc[j][1])/fakedist[j][k])*errorterm
# Keep track of the total error
totalerror+=abs(errorterm)
# If the answer got worse by moving the points, we are done
print totalerror
# If the answer got worse by moving the points, we are done
if lasterror and lasterror<totalerror: break
if lasterror and lasterror<totalerror: break
lasterror=totalerror
lasterror=totalerror
# Move each of the points by the learning rate times the gradient
for k in range(n):
loc[k][0]-=rate*grad[k][0]
loc[k][1]-=rate*grad[k][1]
return loc
98. def scaledown(data,distance=pearson,rate=0.01):
n=len(data)
# The real distances between every pair of items
realdist=[[distance(data[i],data[j]) for j in range(n)]
for i in range(0,n)]
outersum=0.0
# Randomly initialize the starting points of the locations in 2D
loc=[[random.random(),random.random()] for i in range(n)]
fakedist=[[0.0 for j in range(n)] for i in range(n)]
lasterror=None
for m in range(0,1000):
# Find projected distances
for i in range(n):
for j in range(n):
fakedist[i][j]=sqrt(sum([pow(loc[i][x]-loc[j][x],2)
for x in range(len(loc[i]))]))
# Move points
grad=[[0.0,0.0] for i in range(n)]
totalerror=0
for k in range(n):
for j in range(n):
if j==k: continue
# The error is percent difference between the distances
errorterm=(fakedist[j][k]-realdist[j][k])/realdist[j][k]
# Each point needs to be moved away from or towards the other
# point in proportion to how much error it has
grad[k][0]+=((loc[k][0]-loc[j][0])/fakedist[j][k])*errorterm
grad[k][1]+=((loc[k][1]-loc[j][1])/fakedist[j][k])*errorterm
# Keep track of the total error
totalerror+=abs(errorterm)
print totalerror
# If the answer got worse by moving the points, we are done
if lasterror and lasterror<totalerror: break
# Move each of the points by the learning rate times the gradient
lasterror=totalerror
for k in range(n): of the points by the learning rate times the gradient
# Move each
loc[k][0]-=rate*grad[k][0]
for k in range(n):
loc[k][0]-=rate*grad[k][0]
loc[k][1]-=rate*grad[k][1]
loc[k][1]-=rate*grad[k][1]
return loc
99.
100.
101.
102.
103. Numerical Predictions
Back to “supervised” learning
We have a set of numerical attributes
Specs for a laptop
Age and rating for wine
Ratios for a stock
Want to predict another attribute
Formula/model is unknown
e.g. price
105. Statistical regression
Requires specification of a model
Usually linear
Doesn’t handle context
106. Alternative - Interpolation
Find “similar” items
Guess price based on similar items
Need to determine:
What is similar?
How should we aggregate prices?
110. Some Python Code
def getItem(itemID):
xml = quot;<?xml version='1.0' encoding='utf-8'?>quot;+
quot;<GetItemRequest xmlns=quot;urn:ebay:apis:eBLBaseComponentsquot;>quot;+
quot;<RequesterCredentials><eBayAuthToken>quot; +
userToken +
quot;</eBayAuthToken></RequesterCredentials>quot; +
quot;<ItemID>quot; + str(itemID) + quot;</ItemID>quot;+
quot;<DetailLevel>ItemReturnAttributes</DetailLevel>quot;+
quot;</GetItemRequest>quot;
data=sendRequest('GetItem',xml)
result={}
response=parseString(data)
result['title']=getSingleValue(response,'Title')
sellingStatusNode = response.getElementsByTagName('SellingStatus')[0];
result['price']=getSingleValue(sellingStatusNode,'CurrentPrice')
result['bids']=getSingleValue(sellingStatusNode,'BidCount')
seller = response.getElementsByTagName('Seller')
result['feedback'] = getSingleValue(seller[0],'FeedbackScore')
attributeSet=response.getElementsByTagName('Attribute');
attributes={}
for att in attributeSet:
attID=att.attributes.getNamedItem('attributeID').nodeValue
attValue=getSingleValue(att,'ValueLiteral')
attributes[attID]=attValue
result['attributes']=attributes
return result
111. Building an item table
RAM CPU HDD Screen DVD Price
D600 512 1400 40 14 1 $350
Lenovo 160 300 5 13 0 $80
T22 256 900 20 14 1 $200
Pavillion 1024 1600 120 17 1 $800
etc..
112. Distance between items
RAM CPU HDD Screen DVD Price
New 512 1400 40 14 1 ???
T22 256 900 20 14 1 $200
Euclidean, just like in clustering
(512 − 256) 2 + (1400 − 900) 2 + (40 − 20) 2 + (14 − 14) 2 + (1 − 1) 2
113. Idea 1 – use the closest item
With the item whose price I want to
guess:
Calculate the distance for every item in
my dataset
Guess that the price is the same as the
closest
This is called kNN with k=1
114. Problems with “outliers”
The closest item may be anomalous
Why?
Exceptional deal that won’t occur again
Something missing from the dataset
Data errors
115. Using an average
RAM CPU HDD Screen DVD Price
New 512 1400 40 14 1 ???
No. 1 512 1400 30 13 1 $360
No. 2 512 1400 60 14 1 $400
No. 3 1024 1600 120 15 0 $325
k=3, estimate = $361
116. Using a weighted average
RAM CPU HDD Screen DVD Price Weight
New 512 1400 40 14 1 ???
No. 1 512 1400 30 13 1 $360 3
No. 2 512 1400 60 14 1 $400 2
No. 3 1024 1600 120 15 0 $325 1
Estimate = $367
117. Python code
def getdistances(data,vec1):
distancelist=[]
for i in range(len(data)):
vec2=data[i]['input']
distancelist.append((euclidean(vec1,vec2),i))
distancelist.sort()
return distancelist
def weightedknn(data,vec1,k=5,weightf=gaussian):
# Get distances
dlist=getdistances(data,vec1)
avg=0.0
totalweight=0.0
# Get weighted average
for i in range(k):
dist=dlist[i][0]
idx=dlist[i][1]
weight=weightf(dist)
avg+=weight*data[idx]['result']
totalweight+=weight
avg=avg/totalweight
return avg
118. Python code
def getdistances(data,vec1):
distancelist=[]
for i in range(len(data)):
defvec2=data[i]['input']
weightedknn(data,vec1,k=5,weightf=gaussian):
distancelist.append((euclidean(vec1,vec2),i))
# Get distances
distancelist.sort()
dlist=getdistances(data,vec1)
return distancelist
avg=0.0
totalweight=0.0
# Get weighted average
for i in range(k):
dist=dlist[i][0]
idx=dlist[i][1]
weight=weightf(dist)
avg+=weight*data[idx]['result']
totalweight+=weight
avg=avg/totalweight
return avg
121. Determining the best k
Divide the dataset up
Training set
Test set
Guess the prices for the test set using
the training set
See how good the guesses are for
different values of k
Known as “cross-validation”
122. Determining the best k
Test set
Attribute Price
Attribute Price 10 20
10 20
Training set
11 30
Attribute Price
8 10
11 30
6 0
8 10
6 0
For k = 1, guess = 30, error = 10
For k = 2, guess = 20, error = 0
For k = 3, guess = 13, error = 7
Repeat with different test sets, average the error
123. Python code
def dividedata(data,test=0.05):
trainset=[]
testset=[]
for row in data:
if random()<test:
testset.append(row)
else:
trainset.append(row)
return trainset,testset
def testalgorithm(algf,trainset,testset):
error=0.0
for row in testset:
guess=algf(trainset,row['input'])
error+=(row['result']-guess)**2
return error/len(testset)
def crossvalidate(algf,data,trials=100,test=0.05):
error=0.0
for i in range(trials):
trainset,testset=dividedata(data,test)
error+=testalgorithm(algf,trainset,testset)
return error/trials
124. Python code
def dividedata(data,test=0.05):
trainset=[]
testset=[]
for row in data:
if random()<test:
testset.append(row)
else:
trainset.append(row)
return trainset,testset
def testalgorithm(algf,trainset,testset):
error=0.0
for row in testset:
guess=algf(trainset,row['input'])
error+=(row['result']-guess)**2
return error/len(testset)
def crossvalidate(algf,data,trials=100,test=0.05):
error=0.0
for i in range(trials):
trainset,testset=dividedata(data,test)
error+=testalgorithm(algf,trainset,testset)
return error/trials
125. Python code
def dividedata(data,test=0.05):
trainset=[]
testset=[]
for row in data:
if random()<test:
testset.append(row)
else:
trainset.append(row)
return trainset,testset
def
testalgorithm(algf,trainset,testset):
error=0.0
for row in testset:
guess=algf(trainset,row['input'])
error+=(row['result']-guess)**2
return error/len(testset)
def crossvalidate(algf,data,trials=100,test=0.05):
error=0.0
for i in range(trials):
trainset,testset=dividedata(data,test)
error+=testalgorithm(algf,trainset,testset)
return error/trials
126. Python code
def dividedata(data,test=0.05):
trainset=[]
testset=[]
for row in data:
if random()<test:
testset.append(row)
else:
trainset.append(row)
return trainset,testset
def testalgorithm(algf,trainset,testset):
error=0.0
for row in testset:
guess=algf(trainset,row['input'])
error+=(row['result']-guess)**2
return error/len(testset)
def crossvalidate(algf,data,trials=100,test=0.05):
error=0.0
for i in range(trials):
trainset,testset=dividedata(data,test)
error+=testalgorithm(algf,trainset,testset)
return error/trials
130. Determining the best scale
Try different weights
Use the “cross-validation” method
Different ways of choosing a scale:
Range-scaling
Intuitive guessing
Optimization
132. New projects
Openads
An open-source ad server
Users can share impression/click data
Matrix of what hits based on
Page Text
Ad
Ad placement
Search query
Can we improve targeting?
133. New Projects
Finance
Analysts already drowning in info
Stories sometimes broken on blogs
Message boards show sentiment
Extremely low signal-to-noise ratio
134. New Projects
Entertainment
How much buzz is a movie generating?
What psychographic profiles like this type
of movie?
Of interest to studios and media investors