SlideShare ist ein Scribd-Unternehmen logo
1 von 134
Data Mining and Open APIs


     Toby Segaran
About Me
 Software Developer at Genstruct
   Work directly with scientists
   Design algorithms to aid in drug testing
 “Programming Collective Intelligence”
   Published by O’Reilly
   Due out in August
 Consult with open-source projects and other
 companies
 http://kiwitobes.com
Presentation Goals

 Look at some Open APIs
 Get some data
 Visualize algorithms for data-mining
 Work through some Python code
 Variety of techniques and sources

 Advocacy (why you should care)
Open data APIs

 Zillow               Yahoo Answers
 eBay                 Amazon
 Facebook             Technorati
 del.icio.us          Twitter
 HotOrNot             Google News
 Upcoming

 programmableweb.com/apis for more…
Open API uses

 Mashups
 Integration
 Automation
 Command-line tools
 Most importantly, creating datasets!
What is data mining?

 From a large dataset find the:
   Implicit
   Unknown
   Useful
 Data could be:
   Tabular, e.g. Price lists
   Free text
   Pictures
Why it’s important now

  More devices produce more data
  People share more data
  The internet is vast
  Products are more customized
  Advertising is targeted
  Human cognition is limited
Traditional Applications

  Computational Biology
  Financial Markets
  Retail Markets
  Fraud Detection
  Surveillance
  Supply Chain Optimization
  National Security
Traditional = Inaccessible

  Real applications are esoteric
  Tutorial examples are trivial
  Generally lacking in “interest value”
Fun, Accessible Applications

  Home price modeling
  Where are the hottest people?
  Which bloggers are similar?
  Important attributes on eBay
  Predicting fashion trends
  Movie popularity
Zillow
The Zillow API

 Allows querying by address
 Returns information about the property
     Bedrooms
     Bathrooms
     Zip Code
     Price Estimate
     Last Sale Price
 Requires registration key
 http://www.zillow.com/howto/api/PropertyDetailsAPIOverview.htm
The Zillow API

REST Request

http://www.zillow.com/webservice/GetDeepSearchResults.htm?
zws-id=key&address=address&citystatezip=citystateszip
The Zillow API
<SearchResults:searchresults xmlns:SearchResults=quot;http://www.
zillow.com/vstatic/3/static/xsd/SearchResults.xsdquot;>
…
<response>
<results>
<result>
<zpid>48749425</zpid>
<links>
…
</links>
<address>
<street>2114 Bigelow Ave N</street>
<zipcode>98109</zipcode>
<city>SEATTLE</city>
<state>WA</state>
<latitude>47.637934</latitude> <longitude>-122.347936</longitude>
</address>
<yearBuilt>1924</yearBuilt>
<lotSizeSqFt>4680</lotSizeSqFt>
<finishedSqFt>3290</finishedSqFt>
<bathrooms>2.75</bathrooms>
<bedrooms>4</bedrooms>
<lastSoldDate>06/18/2002</lastSoldDate>
<lastSoldPrice currency=quot;USDquot;>770000</lastSoldPrice>
<valuation>
<amount currency=quot;USDquot;>1091061</amount>
</result>
</results>
</response>
The Zillow API
  <SearchResults:searchresults xmlns:SearchResults=quot;http://www.
  zillow.com/vstatic/3/static/xsd/SearchResults.xsdquot;>
  …
<zipcode>98109</zipcode>
  <response>
  <results>
<city>SEATTLE</city>
  <result>
<state>WA</state>
  <zpid>48749425</zpid>
  <links>
<latitude>47.637934</latitude>
  …
<longitude>-122.347936</longitude>
  </links>
  <address>
</address>Bigelow Ave N</street>
  <street>2114
<yearBuilt>1924</yearBuilt>
  <zipcode>98109</zipcode>
  <city>SEATTLE</city>
<lotSizeSqFt>4680</lotSizeSqFt>
  <state>WA</state>
<finishedSqFt>3290</finishedSqFt>
  <latitude>47.637934</latitude> <longitude>-122.347936</longitude>
  </address>
<bathrooms>2.75</bathrooms>
  <yearBuilt>1924</yearBuilt>
  <lotSizeSqFt>4680</lotSizeSqFt>
<bedrooms>4</bedrooms>
  <finishedSqFt>3290</finishedSqFt>
<lastSoldDate>06/18/2002</lastSoldDate>
  <bathrooms>2.75</bathrooms>
  <bedrooms>4</bedrooms>
<lastSoldPrice currency=quot;USDquot;>770000</lastSoldPrice>
  <lastSoldDate>06/18/2002</lastSoldDate>
<valuation> currency=quot;USDquot;>770000</lastSoldPrice>
  <lastSoldPrice
  <valuation>
<amountcurrency=quot;USDquot;>1091061</amount>
            currency=quot;USDquot;>1091061</amount>
  <amount
  </result>
  </results>
  </response>
Zillow from Python
def getaddressdata(address,city):
  escad=address.replace(' ','+')

  # Construct the URL
  url='http://www.zillow.com/webservice/GetDeepSearchResults.htm?'
  url+='zws-id=%s&address=%s&citystatezip=%s' % (zwskey,escad,city)

  # Parse resulting XML
  doc=xml.dom.minidom.parseString(urllib2.urlopen(url).read())
  code=doc.getElementsByTagName('code')[0].firstChild.data

  # Code 0 means success, otherwise there was an error
  if code!='0': return None

  # Extract the info about this property
  try:
    zipcode=doc.getElementsByTagName('zipcode')[0].firstChild.data
    use=doc.getElementsByTagName('useCode')[0].firstChild.data
    year=doc.getElementsByTagName('yearBuilt')[0].firstChild.data
    bath=doc.getElementsByTagName('bathrooms')[0].firstChild.data
    bed=doc.getElementsByTagName('bedrooms')[0].firstChild.data
    rooms=doc.getElementsByTagName('totalRooms')[0].firstChild.data
    price=doc.getElementsByTagName('amount')[0].firstChild.data
  except:
    return None

  return (zipcode,use,int(year),float(bath),int(bed),int(rooms),price)
Zillow from Python
  def getaddressdata(address,city):
    escad=address.replace(' ','+')

    # Construct the URL
# Construct the URL
    url='http://www.zillow.com/webservice/GetDeepSearchResults.htm?'
 url='http://www.zillow.com/webservice/GetDeepSearchResults.htm?'
    url+='zws-id=%s&address=%s&citystatezip=%s' % (zwskey,escad,city)
 url+='zws-id=%s&address=%s&citystatezip=%s' % (zwskey,escad,city)
    # Parse resulting XML
    doc=xml.dom.minidom.parseString(urllib2.urlopen(url).read())
    code=doc.getElementsByTagName('code')[0].firstChild.data

    # Code 0 means success, otherwise there was an error
    if code!='0': return None

    # Extract the info about this property
    try:
      zipcode=doc.getElementsByTagName('zipcode')[0].firstChild.data
      use=doc.getElementsByTagName('useCode')[0].firstChild.data
      year=doc.getElementsByTagName('yearBuilt')[0].firstChild.data
      bath=doc.getElementsByTagName('bathrooms')[0].firstChild.data
      bed=doc.getElementsByTagName('bedrooms')[0].firstChild.data
      rooms=doc.getElementsByTagName('totalRooms')[0].firstChild.data
      price=doc.getElementsByTagName('amount')[0].firstChild.data
    except:
      return None

    return (zipcode,use,int(year),float(bath),int(bed),int(rooms),price)
Zillow from Python
  def getaddressdata(address,city):
    escad=address.replace(' ','+')

    # Construct the URL
    url='http://www.zillow.com/webservice/GetDeepSearchResults.htm?'
    url+='zws-id=%s&address=%s&citystatezip=%s' % (zwskey,escad,city)

# Parse resulting XML
    # Parse resulting
                      XML
    doc=xml.dom.minidom.parseString(urllib2.urlopen(url).read())
 doc=xml.dom.minidom.parseString(urllib2.urlopen(url).read())
    code=doc.getElementsByTagName('code')[0].firstChild.data
 code=doc.getElementsByTagName('code')[0].firstChild.data
    # Code 0 means success, otherwise there was an error
    if code!='0': return None

    # Extract the info about this property
    try:
      zipcode=doc.getElementsByTagName('zipcode')[0].firstChild.data
      use=doc.getElementsByTagName('useCode')[0].firstChild.data
      year=doc.getElementsByTagName('yearBuilt')[0].firstChild.data
      bath=doc.getElementsByTagName('bathrooms')[0].firstChild.data
      bed=doc.getElementsByTagName('bedrooms')[0].firstChild.data
      rooms=doc.getElementsByTagName('totalRooms')[0].firstChild.data
      price=doc.getElementsByTagName('amount')[0].firstChild.data
    except:
      return None

    return (zipcode,use,int(year),float(bath),int(bed),int(rooms),price)
Zillow from Python
   def getaddressdata(address,city):
     escad=address.replace(' ','+')

     # Construct the URL
     url='http://www.zillow.com/webservice/GetDeepSearchResults.htm?'
     url+='zws-id=%s&address=%s&citystatezip=%s' % (zwskey,escad,city)

     # Parse resulting XML
     doc=xml.dom.minidom.parseString(urllib2.urlopen(url).read())
     code=doc.getElementsByTagName('code')[0].firstChild.data

     # Code 0 means success, otherwise there was an error
     if code!='0': return None

zipcode=doc.getElementsByTagName('zipcode')[0].firstChild.data
     # Extract the info about this property
     try:
use=doc.getElementsByTagName('useCode')[0].firstChild.data
       zipcode=doc.getElementsByTagName('zipcode')[0].firstChild.data
year=doc.getElementsByTagName('yearBuilt')[0].firstChild.data
       use=doc.getElementsByTagName('useCode')[0].firstChild.data
bath=doc.getElementsByTagName('bathrooms')[0].firstChild.data
       year=doc.getElementsByTagName('yearBuilt')[0].firstChild.data
       bath=doc.getElementsByTagName('bathrooms')[0].firstChild.data
bed=doc.getElementsByTagName('bedrooms')[0].firstChild.data
       bed=doc.getElementsByTagName('bedrooms')[0].firstChild.data
rooms=doc.getElementsByTagName('totalRooms')[0].firstChild.data
       rooms=doc.getElementsByTagName('totalRooms')[0].firstChild.data
price=doc.getElementsByTagName('amount')[0].firstChild.data
       price=doc.getElementsByTagName('amount')[0].firstChild.data
     except:
       return None

     return (zipcode,use,int(year),float(bath),int(bed),int(rooms),price)
A home price dataset

House    Zip     Bathrooms Bedrooms   Built   Type      Price

A        02138   1.5      2           1847    Single    505296

B        02139   3.5      9           1916    Triplex   776378

C        02140   3.5      4           1894    Duplex    595027

D        02139   2.5      4           1854    Duplex    552213

E        02138   3.5      5           1909    Duplex    947528

F        02138   3.5      4           1930    Single    2107871

etc..
What can we learn?

 A made-up houses price
 How important is Zip Code?
 What are the important attributes?

 Can we do better than averages?
Introducing Regression Trees
A     B        Value
10    Circle   20
11    Square   22
22    Square   8
18    Circle   6
Introducing Regression Trees
A     B        Value
10    Circle   20
11    Square   22
22    Square   8
18    Circle   6
Minimizing deviation
       Standard deviation is the “spread” of results
       Try all possible divisions
       Choose the division that decreases deviation the most

                                   Initially
A        B          Value
                                    Average = 14
10       Circle     20
                                    Standard Deviation = 8.2
11       Square     22

22       Square     8

18       Circle     6
Minimizing deviation
       Standard deviation is the “spread” of results
       Try all possible divisions
       Choose the division that decreases deviation the most

                                   B = Circle
A        B          Value
                                    Average = 13
10       Circle     20
                                    Standard Deviation = 9.9
11       Square     22

22       Square     8
                                    B = Square
18       Circle     6
                                    Average = 15
                                    Standard Deviation = 9.9
Minimizing deviation
       Standard deviation is the “spread” of results
       Try all possible divisions
       Choose the division that decreases deviation the most

                                   A > 18
A        B          Value
                                    Average = 8
10       Circle     20
                                    Standard Deviation = 0
11       Square     22

22       Square     8
                                    A <= 20
18       Circle     6
                                    Average = 16
                                    Standard Deviation = 8.7
Minimizing deviation
       Standard deviation is the “spread” of results
       Try all possible divisions
       Choose the division that decreases deviation the most

                                   A > 11
A        B          Value
                                    Average = 7
10       Circle     20
                                    Standard Deviation = 1.4
11       Square     22

22       Square     8
                                    A <= 11
18       Circle     6
                                    Average = 21
                                    Standard Deviation = 1.4
Python Code
def variance(rows):
  if len(rows)==0: return 0
  data=[float(row[len(row)-1]) for row in rows]
  mean=sum(data)/len(data)
  variance=sum([(d-mean)**2 for d in data])/len(data)
  return variance

def divideset(rows,column,value):
   # Make a function that tells us if a row is in
   # the first group (true) or the second group (false)
   split_function=None
   if isinstance(value,int) or isinstance(value,float):
      split_function=lambda row:row[column]>=value
   else:
      split_function=lambda row:row[column]==value

   # Divide the rows into two sets and return them
   set1=[row for row in rows if split_function(row)]
   set2=[row for row in rows if not split_function(row)]
   return (set1,set2)
Python Code
 def variance(rows):
def variance(rows):
   if len(rows)==0: return 0
  if len(rows)==0: return for row in rows]
   data=[float(row[len(row)-1]) 0
  data=[float(row[len(row)-1]) for row in rows]
   mean=sum(data)/len(data)
  mean=sum(data)/len(data)d in data])/len(data)
   variance=sum([(d-mean)**2 for
   return variance
  variance=sum([(d-mean)**2 for d in data])/len(data)
  return variance
 def divideset(rows,column,value):
   # Make a function that tells us if a row is in
   # the first group (true) or the second group (false)
   split_function=None
   if isinstance(value,int) or isinstance(value,float):
      split_function=lambda row:row[column]>=value
   else:
      split_function=lambda row:row[column]==value

   # Divide the rows into two sets and return them
   set1=[row for row in rows if split_function(row)]
   set2=[row for row in rows if not split_function(row)]
   return (set1,set2)
Python Code
 def variance(rows):
   if len(rows)==0: return 0
   data=[float(row[len(row)-1]) for row in rows]
   mean=sum(data)/len(data)
   variance=sum([(d-mean)**2 for d in data])/len(data)
   return variance

# def divideset(rows,column,value): us if a row is in
  Make a function that tells
# the Make a function (true) or the asecond in
     # first group that tells us if       row is group (false)
     # the first group (true) or the second group (false)
split_function=None
     split_function=None
if isinstance(value,int) or isinstance(value,float):
     if isinstance(value,int) or isinstance(value,float):
    split_function=lambda row:row[column]>=value
        split_function=lambda row:row[column]>=value
     else:
else:
        split_function=lambda row:row[column]==value
    split_function=lambda row:row[column]==value
    # Divide the rows into two sets and return them
    set1=[row for row in rows if split_function(row)]
    set2=[row for row in rows if not split_function(row)]
    return (set1,set2)
Python Code
 def variance(rows):
   if len(rows)==0: return 0
   data=[float(row[len(row)-1]) for row in rows]
   mean=sum(data)/len(data)
   variance=sum([(d-mean)**2 for d in data])/len(data)
   return variance

 def divideset(rows,column,value):
    # Make a function that tells us if a row is in
    # the first group (true) or the second group (false)
    split_function=None
    if isinstance(value,int) or isinstance(value,float):
       split_function=lambda row:row[column]>=value
    else:
       split_function=lambda row:row[column]==value

# Divide the rows into two sets and returnreturn them
   # Divide the rows into two sets and them
   set1=[row for row in rows if split_function(row)]
set1=[row for row in rows if not split_function(row)]
                      in rows if split_function(row)]
   set2=[row for row
set2=[row(set1,set2) in rows if not split_function(row)]
            for row
   return
return (set1,set2)
CART Algoritm
A     B        Value
10    Circle   20
11    Square   22
22    Square   8
18    Circle   6
CART Algoritm
A     B        Value
10    Circle   20
11    Square   22
22    Square   8
18    Circle   6
CART Algoritm




                      22   Square   8
   10   Circle   20
                      18   Circle   6
   11   Square   22
CART Algoritm
Python Code
def buildtree(rows,scoref=variance):
  if len(rows)==0: return decisionnode()
  current_score=scoref(rows)
  # Set up some variables to track the best criteria
  best_gain=0.0
  best_criteria=None
  best_sets=None
  column_count=len(rows[0])-1
  for col in range(0,column_count):
    # Generate the list of different values in
    # this column
    column_values={}
    for row in rows:
       column_values[row[col]]=1
    # Now try dividing the rows up for each value
    # in this column
    for value in column_values.keys():
      (set1,set2)=divideset(rows,col,value)
# Information gain
      p=float(len(set1))/len(rows)
      gain=current_score-p*scoref(set1)-(1-p)*scoref(set2)
      if gain>best_gain and len(set1)>0 and len(set2)>0:
        best_gain=gain
        best_criteria=(col,value)
        best_sets=(set1,set2)
  # Create the sub branches
  if best_gain>0:
    trueBranch=buildtree(best_sets[0])
    falseBranch=buildtree(best_sets[1])
    return decisionnode(col=best_criteria[0],value=best_criteria[1],tb=trueBranch,fb=falseBranch)
  else:
    return decisionnode(results=uniquecounts(rows))
Python Code
def buildtree(rows,scoref=variance):
  def buildtree(rows,scoref=variance):
  if len(rows)==0: return decisionnode()
    if len(rows)==0: return decisionnode()
    current_score=scoref(rows)
  current_score=scoref(rows) criteria
    # Set up some variables to track the best
  #best_gain=0.0some variables to track the best criteria
      Set up
    best_criteria=None
  best_gain=0.0
    best_sets=None
    column_count=len(rows[0])-1
  best_criteria=None
    for col in range(0,column_count):
  best_sets=None of different values in
      # Generate the list
      # this column
  column_count=len(rows[0])-1
      column_values={}
     for row in rows:
        column_values[row[col]]=1
     # Now try dividing the rows up for each value
     # in this column
     for value in column_values.keys():
       (set1,set2)=divideset(rows,col,value)
       # Information gain
       p=float(len(set1))/len(rows)
       gain=current_score-p*scoref(set1)-(1-p)*scoref(set2)
       if gain>best_gain and len(set1)>0 and len(set2)>0:
         best_gain=gain
         best_criteria=(col,value)
         best_sets=(set1,set2)
   # Create the sub branches
   if best_gain>0:
     trueBranch=buildtree(best_sets[0])
     falseBranch=buildtree(best_sets[1])
     return decisionnode(col=best_criteria[0],value=best_criteria[1],tb=trueBranch,fb=falseBranch)
   else:
     return decisionnode(results=uniquecounts(rows))
Python Code
def buildtree(rows,scoref=variance):
  if len(rows)==0: return decisionnode()
  current_score=scoref(rows)
  # Set up some variables to track the best criteria
  best_gain=0.0
  best_criteria=None
  best_sets=None
  column_count=len(rows[0])-1
  for col in range(0,column_count):
    # Generate the list of different values in
    # this column
    column_values={}
    for row in rows:
       column_values[row[col]]=1
    for try dividing the rows up for each value
    # Now value in column_values.keys():
    # in this column
            (set1,set2)=divideset(rows,col,value)
    for value in column_values.keys():
            # Information gain
      (set1,set2)=divideset(rows,col,value)
      # Information gain
            p=float(len(set1))/len(rows)
      p=float(len(set1))/len(rows)
      gain=current_score-p*scoref(set1)-(1-p)*scoref(set2)
            gain=current_score-p*scoref(set1)-(1-p)*scoref(set2)
      if gain>best_gain and len(set1)>0 and len(set2)>0:
            if gain>best_gain and len(set1)>0 and len(set2)>0:
        best_gain=gain
        best_criteria=(col,value)
                best_gain=gain
        best_sets=(set1,set2)
                best_criteria=(col,value)
  # Create the sub branches
  if best_gain>0:
                best_sets=(set1,set2)
    trueBranch=buildtree(best_sets[0])
    falseBranch=buildtree(best_sets[1])
    return decisionnode(col=best_criteria[0],value=best_criteria[1],tb=trueBranch,fb=falseBranch)
  else:
    return decisionnode(results=uniquecounts(rows))
Python Code
 def buildtree(rows,scoref=variance):
   if len(rows)==0: return decisionnode()
   current_score=scoref(rows)
   # Set up some variables to track the best criteria
   best_gain=0.0
   best_criteria=None
   best_sets=None
   column_count=len(rows[0])-1
   for col in range(0,column_count):
     # Generate the list of different values in
     # this column
     column_values={}
     for row in rows:
        column_values[row[col]]=1
     # Now try dividing the rows up for each value
     # in this column
     for value in column_values.keys():
       (set1,set2)=divideset(rows,col,value)
       # Information gain
       p=float(len(set1))/len(rows)
       gain=current_score-p*scoref(set1)-(1-p)*scoref(set2)
if best_gain>0: and len(set1)>0 and len(set2)>0:
       if gain>best_gain
         best_gain=gain
    trueBranch=buildtree(best_sets[0])
         best_criteria=(col,value)
         best_sets=(set1,set2)
    falseBranch=buildtree(best_sets[1])
   # Create the sub branches
   if best_gain>0:
    return decisionnode(col=best_criteria[0],value=best_criteria[1],
     trueBranch=buildtree(best_sets[0])
                                    tb=trueBranch,fb=falseBranch)
     falseBranch=buildtree(best_sets[1])
     return decisionnode(col=best_criteria[0],value=best_criteria[1],tb=trueBranch,fb=falseBranch)
 else:
   else:
    return decisionnode(results=uniquecounts(rows))
     return decisionnode(results=uniquecounts(rows))
Zillow Results

                           Bathrooms > 3




      Zip: 02139?                               After 1903?




Zip: 02140?    Bedrooms > 4?               Duplex?            Triplex?
Just for Fun… Hot or Not
Just for Fun… Hot or Not
Supervised and Unsupervised

 Regression trees are supervised
   “answers” are in the dataset
   Tree models predict answers
 Some methods are unsupervised
   There are no answers
   Methods just characterize the data
   Show interesting patterns
Next challenge - Bloggers

  Millions of blogs online
  Usually focus on a subject area
  Can they be characterized
  automatically?
  … using only the words in the posts?
The Technorati Top 100
A single blog
Getting the content

  Use Mark Pilgrim’s Universal Feed
  Reader
  Retrieve the post titles and text
  Split up the words
  Count occurrence of each word
Python Code
import feedparser
import re
# Returns title and dictionary of word counts for an RSS feed
def getwordcounts(url):
  # Parse the feed
  d=feedparser.parse(url)
  wc={}
  # Loop over all the entries
  for e in d.entries:
    if 'summary' in e: summary=e.summary
    else: summary=e.description
    # Extract a list of words
    words=getwords(e.title+' '+summary)
    for word in words:
      wc.setdefault(word,0)
      wc[word]+=1
  return d.feed.title,wc

                           def getwords(html):
                             # Remove all the HTML tags
                             txt=re.compile(r'<[^>]+>').sub('',html)
                             # Split words by all non-alpha characters
                             words=re.compile(r'[^A-Z^a-z]+').split(txt)
                             # Convert to lowercase
                             return [word.lower() for word in words if word!='']
Python Code
import feedparser
import re
# Returns title and dictionary of word counts for an RSS feed
def getwordcounts(url):
  # Parse the feed
  d=feedparser.parse(url)
  wc={}
 for e in d.entries:
  # Loop over all the entries
     if 'summary' in e: summary=e.summary
  for e in d.entries:
     else: summary=e.description
    if 'summary' in e: summary=e.summary
    else: summary=e.description words
     # Extract a list of
    # Extract a list of words
     words=getwords(e.title+' '+summary)
    words=getwords(e.title+' '+summary)
     for word in words:
    for word in words:
      wc.setdefault(word,0)
        wc.setdefault(word,0)
      wc[word]+=1
        wc[word]+=1
  return d.feed.title,wc

                           def getwords(html):
                             # Remove all the HTML tags
                             txt=re.compile(r'<[^>]+>').sub('',html)
                             # Split words by all non-alpha characters
                             words=re.compile(r'[^A-Z^a-z]+').split(txt)
                             # Convert to lowercase
                             return [word.lower() for word in words if word!='']
Python Code
import feedparser
import re
# Returns title and dictionary of word counts for an RSS feed
def getwordcounts(url):
  # Parse the feed
  d=feedparser.parse(url)
  wc={}
  # Loop over all the entries
  for e in d.entries:
    if 'summary' in e: summary=e.summary
    else: summary=e.description
    # Extract a list of words
    words=getwords(e.title+' '+summary)
    for word in words:
      wc.setdefault(word,0)
      wc[word]+=1
  return d.feed.title,wc
             def getwords(html):
               # Remove def getwords(html):
                         all the HTML tags
                          # Remove all the HTML tags
               txt=re.compile(r'<[^>]+>').sub('',html)
                          txt=re.compile(r'<[^>]+>').sub('',html)
               # Split words bywords by all non-alpha characters
                          # Split all non-alpha characters
               words=re.compile(r'[^A-Z^a-z]+').split(txt)
                          words=re.compile(r'[^A-Z^a-z]+').split(txt)
               # Convert # Convert to lowercase
                          to lowercase
                          return [word.lower() for word in words if word!='']
               return [word.lower() for word in words if word!='']
Building a Word Matrix

  Build a matrix of word counts
  Blogs are rows, words are columns
  Eliminate words that are:
    Too common
    Too rare
Python Code
apcount={}
wordcounts={}
for feedurl in file('feedlist.txt'):
  title,wc=getwordcounts(feedurl)
  wordcounts[title]=wc
  for word,count in wc.items():
    apcount.setdefault(word,0)
    if count>1:
      apcount[word]+=1

wordlist=[]
for w,bc in apcount.items():
  frac=float(bc)/len(feedlist)
  if frac>0.1 and frac<0.5: wordlist.append(w)

out=file('blogdata.txt','w')
out.write('Blog')
for word in wordlist: out.write('t%s' % word)
out.write('n')
for blog,wc in wordcounts.items():
  out.write(blog)
  for word in wordlist:
    if word in wc: out.write('t%d' % wc[word])
    else: out.write('t0')
  out.write('n')
Python Code
  apcount={}
  wordcounts={}
for feedurlinin file('feedlist.txt'):
  for feedurl     file('feedlist.txt'):
  title,wc=getwordcounts(feedurl)
    title,wc=getwordcounts(feedurl)
    wordcounts[title]=wc
  wordcounts[title]=wc
    for word,count in wc.items():
  forapcount.setdefault(word,0)
        word,count in wc.items():
     apcount.setdefault(word,0)
      if count>1:
     if apcount[word]+=1
         count>1:
       apcount[word]+=1
 wordlist=[]
 for w,bc in apcount.items():
   frac=float(bc)/len(feedlist)
   if frac>0.1 and frac<0.5: wordlist.append(w)

 out=file('blogdata.txt','w')
 out.write('Blog')
 for word in wordlist: out.write('t%s' % word)
 out.write('n')
 for blog,wc in wordcounts.items():
   out.write(blog)
   for word in wordlist:
     if word in wc: out.write('t%d' % wc[word])
     else: out.write('t0')
   out.write('n')
Python Code
  apcount={}
  wordcounts={}
  for feedurl in file('feedlist.txt'):
    title,wc=getwordcounts(feedurl)
    wordcounts[title]=wc
    for word,count in wc.items():
      apcount.setdefault(word,0)
      if count>1:
        apcount[word]+=1
wordlist=[]
  wordlist=[]
for w,bc in apcount.items():
  for w,bc in apcount.items():
    frac=float(bc)/len(feedlist)
  frac=float(bc)/len(feedlist)
    if frac>0.1 and frac<0.5: wordlist.append(w)
  if frac>0.1 and frac<0.5: wordlist.append(w)
  out=file('blogdata.txt','w')
  out.write('Blog')
  for word in wordlist: out.write('t%s' % word)
  out.write('n')
  for blog,wc in wordcounts.items():
    out.write(blog)
    for word in wordlist:
      if word in wc: out.write('t%d' % wc[word])
      else: out.write('t0')
    out.write('n')
Python Code
  apcount={}
  wordcounts={}
  for feedurl in file('feedlist.txt'):
    title,wc=getwordcounts(feedurl)
    wordcounts[title]=wc
    for word,count in wc.items():
      apcount.setdefault(word,0)
      if count>1:
        apcount[word]+=1

  wordlist=[]
  for w,bc in apcount.items():
    frac=float(bc)/len(feedlist)
    if frac>0.1 and frac<0.5: wordlist.append(w)
out=file('blogdata.txt','w')
out.write('Blog')
  out=file('blogdata.txt','w')
for word in wordlist: out.write('t%s' % word)
  out.write('Blog')
  for word in wordlist: out.write('t%s' % word)
out.write('n')
  out.write('n')
for blog,wcinin wordcounts.items():
  for blog,wc     wordcounts.items():
  out.write(blog)
    out.write(blog)
  for wordin wordlist:
    for word in wordlist:
      if word in wc: out.write('t%d' % wc[word])
     if word in wc: out.write('t%d' % wc[word])
      else: out.write('t0')
     else: out.write('t0')
    out.write('n')
  out.write('n')
The Word Matrix
                    “china”   “kids”   “music”   “yahoo”



Gothamist           0         3        3         0


GigaOM              6         0        1         2


Quick Online Tips   0         2        2         12
Determining distance
                     “china”    “kids”    “music”     “yahoo”




Gothamist            0          3         3           0



GigaOM               6          0         1           2



Quick Online Tips    0          2         2           12




Euclidean “as the crow flies”


            (6 − 0) 2 + (0 − 2) 2 + (1 − 2) 2 + (2 − 12) 2
                                                    = 12 (approx)
Other Distance Metrics

 Manhattan
 Tanamoto
 Pearson Correlation
 Chebychev
 Spearman
Hierarchical Clustering

  Find the two closest item
  Combine them into a single item
  Repeat…
Hierarchical Algorithm
Hierarchical Algorithm
Hierarchical Algorithm
Hierarchical Algorithm
Hierarchical Algorithm
Dendrogram
Python Code


class bicluster:
  def
__init__(self,vec,left=None,right=None,distance=0.0,id=None):
    self.left=left
    self.right=right
    self.vec=vec
    self.id=id
    self.distance=distance
Python Code
 def hcluster(rows,distance=pearson):
   distances={}
   currentclustid=-1
   # Clusters are initially just the rows
   clust=[bicluster(rows[i],id=i) for i in range(len(rows))]
   while len(clust)>1:
     lowestpair=(0,1)
     closest=distance(clust[0].vec,clust[1].vec)
     # loop through every pair looking for the smallest distance
     for i in range(len(clust)):
       for j in range(i+1,len(clust)):
         # distances is the cache of distance calculations
         if (clust[i].id,clust[j].id) not in distances:
           distances[(clust[i].id,clust[j].id)]=distance(clust[i].vec,clust[j].vec)
         d=distances[(clust[i].id,clust[j].id)]
         if d<closest:
           closest=d
           lowestpair=(i,j)
     # calculate the average of the two clusters
     mergevec=[
     (clust[lowestpair[0]].vec[i]+clust[lowestpair[1]].vec[i])/2.0
     for i in range(len(clust[0].vec))]
     # create the new cluster
     newcluster=bicluster(mergevec,left=clust[lowestpair[0]],
                          right=clust[lowestpair[1]],
                          distance=closest,id=currentclustid)
     # cluster ids that weren’t in the original set are negative
     currentclustid-=1
     del clust[lowestpair[1]]
     del clust[lowestpair[0]]
     clust.append(newcluster)
   return clust[0]
Python Code
    def hcluster(rows,distance=pearson):
      distances={}
distances={}
      currentclustid=-1
 currentclustid=-1
      # Clusters are initially just the rows
      clust=[bicluster(rows[i],id=i) for i in range(len(rows))]
 # Clusters are initially just the rows
      while len(clust)>1:
        lowestpair=(0,1)
 clust=[bicluster(rows[i],id=i) for i in range(len(rows))]
        closest=distance(clust[0].vec,clust[1].vec)
        # loop through every pair looking for the smallest distance
        for i in range(len(clust)):
          for j in range(i+1,len(clust)):
            # distances is the cache of distance calculations
            if (clust[i].id,clust[j].id) not in distances:
              distances[(clust[i].id,clust[j].id)]=distance(clust[i].vec,clust[j].vec)
            d=distances[(clust[i].id,clust[j].id)]
            if d<closest:
              closest=d
              lowestpair=(i,j)
        # calculate the average of the two clusters
        mergevec=[
        (clust[lowestpair[0]].vec[i]+clust[lowestpair[1]].vec[i])/2.0
        for i in range(len(clust[0].vec))]
        # create the new cluster
        newcluster=bicluster(mergevec,left=clust[lowestpair[0]],
                             right=clust[lowestpair[1]],
                             distance=closest,id=currentclustid)
        # cluster ids that weren’t in the original set are negative
        currentclustid-=1
        del clust[lowestpair[1]]
        del clust[lowestpair[0]]
        clust.append(newcluster)
      return clust[0]
Python Code
       def hcluster(rows,distance=pearson):
         distances={}
while len(clust)>1:
         currentclustid=-1
         # Clusters are initially just the rows
   lowestpair=(0,1)
         clust=[bicluster(rows[i],id=i) for i in range(len(rows))]
   closest=distance(clust[0].vec,clust[1].vec)
         while len(clust)>1:
           lowestpair=(0,1)
   # loop closest=distance(clust[0].vec,clust[1].vec) for the smallest distance
            through every pair looking
   for i inloopin range(len(clust)):
           #
               range(len(clust)): for the smallest distance
                  through every pair looking
           for i
     for j for j range(i+1,len(clust)):
              in in range(i+1,len(clust)):
               # distances is the cache of distance calculations
       # distances is the cache of distances:
               if (clust[i].id,clust[j].id) not in distance calculations
       if (clust[i].id,clust[j].id) not in distances:
                 distances[(clust[i].id,clust[j].id)]=distance(clust[i].vec,clust[j].vec)
               d=distances[(clust[i].id,clust[j].id)]
          distances[(clust[i].id,clust[j].id)]=
               if d<closest:
                 closest=d
                                             distance(clust[i].vec,clust[j].vec)
                 lowestpair=(i,j)
       d=distances[(clust[i].id,clust[j].id)]
           # calculate the average of the two clusters
           mergevec=[
       if (clust[lowestpair[0]].vec[i]+clust[lowestpair[1]].vec[i])/2.0
            d<closest:
           for i in range(len(clust[0].vec))]
          closest=d
           # create the new cluster
          lowestpair=(i,j)
           newcluster=bicluster(mergevec,left=clust[lowestpair[0]],
                                right=clust[lowestpair[1]],
                                distance=closest,id=currentclustid)
           # cluster ids that weren’t in the original set are negative
           currentclustid-=1
           del clust[lowestpair[1]]
           del clust[lowestpair[0]]
           clust.append(newcluster)
         return clust[0]
Python Code
      def hcluster(rows,distance=pearson):
        distances={}
        currentclustid=-1
        # Clusters are initially just the rows
        clust=[bicluster(rows[i],id=i) for i in range(len(rows))]
        while len(clust)>1:
          lowestpair=(0,1)
          closest=distance(clust[0].vec,clust[1].vec)
          # loop through every pair looking for the smallest distance
          for i in range(len(clust)):
            for j in range(i+1,len(clust)):
              # distances is the cache of distance calculations
              if (clust[i].id,clust[j].id) not in distances:
 # calculate distances[(clust[i].id,clust[j].id)]=distance(clust[i].vec,clust[j].vec)
                 the average of the two clusters
              d=distances[(clust[i].id,clust[j].id)]
mergevec=[ if d<closest:
                closest=d
   (clust[lowestpair[0]].vec[i]+clust[lowestpair[1]].vec[i])/2.0
                lowestpair=(i,j)
          #in range(len(clust[0].vec))
            calculate the average of the two clusters
   for i mergevec=[
]         (clust[lowestpair[0]].vec[i]+clust[lowestpair[1]].vec[i])/2.0
# create for i in range(len(clust[0].vec))]
          #the new new cluster
            create the cluster
newcluster=bicluster(mergevec,left=clust[lowestpair[0]],
          newcluster=bicluster(mergevec,left=clust[lowestpair[0]],
                               right=clust[lowestpair[1]],
                              right=clust[lowestpair[1]],
                               distance=closest,id=currentclustid)
          # cluster ids that weren’t in the original set are negative
                              distance=closest,id=currentclustid)
          currentclustid-=1
del clust[lowestpair[1]]
          del clust[lowestpair[1]]
          del clust[lowestpair[0]]
del clust[lowestpair[0]]
          clust.append(newcluster)
clust.append(newcluster)
        return clust[0]
Hierarchical Blog Clusters
Hierarchical Blog Clusters
Hierarchical Blog Clusters
Rotating the Matrix

  Words in a blog -> blogs containing each word


            Gothamist     GigaOM        Quick Onl
china       0             6             0
kids        3             0             2
music       3             1             2
Yahoo       0             2             12
Hierarchical Word Clusters
K-Means Clustering

 Divides data into distinct clusters
 User determines how many
 Algorithm
   Start with arbitrary centroids
   Assign points to centroids
   Move the centroids
   Repeat
K-Means Algorithm
K-Means Algorithm
K-Means Algorithm
K-Means Algorithm
K-Means Algorithm
Python Code
import random
def kcluster(rows,distance=pearson,k=4):
  # Determine the minimum and maximum values for each point
  ranges=[(min([row[i] for row in rows]),max([row[i] for row in rows]))
  for i in range(len(rows[0]))]
  # Create k randomly placed centroids
  clusters=[[random.random()*(ranges[i][1]-ranges[i][0])+ranges[i][0]
  for i in range(len(rows[0]))] for j in range(k)]

  lastmatches=None
  for t in range(100):
    print 'Iteration %d' % t
    bestmatches=[[] for i in range(k)]

    # Find which centroid is the closest for each row
    for j in range(len(rows)):
      row=rows[j]
      bestmatch=0
      for i in range(k):
        d=distance(clusters[i],row)
        if d<distance(clusters[bestmatch],row): bestmatch=i
      bestmatches[bestmatch].append(j)
    # If the results are the same as last time, this is complete
    if bestmatches==lastmatches: break
    lastmatches=bestmatches

    # Move the centroids to the average of their members
    for i in range(k):
      avgs=[0.0]*len(rows[0])
      if len(bestmatches[i])>0:
        for rowid in bestmatches[i]:
          for m in range(len(rows[rowid])):
            avgs[m]+=rows[rowid][m]
        for j in range(len(avgs)):
          avgs[j]/=len(bestmatches[i])
        clusters[i]=avgs

  return bestmatches
Python Code
      import random
      def kcluster(rows,distance=pearson,k=4):
# Determine the minimum and maximum values for each point
        # Determine the minimum and maximum values for each point
        ranges=[(min([row[i] for row in rows]),max([row[i] for row in rows]))
ranges=[(min([row[i] for row in rows]),
        for i in range(len(rows[0]))]
        # Create k randomly placed centroids
         max([row[i] for row in rows]))
        clusters=[[random.random()*(ranges[i][1]-ranges[i][0])+ranges[i][0]
        for i in range(len(rows[0]))] for j in range(k)]
         for i in range(len(rows[0]))]
        lastmatches=None
        for t in range(100):
# Create k randomly placed centroids
          print 'Iteration %d' % t
          bestmatches=[[] for i in range(k)]
clusters=[[random.random()*
        # Find which centroid is the closest for each row
        for j(ranges[i][1]-ranges[i][0])+ranges[i][0]
              in range(len(rows)):
          row=rows[j]
             for i in range(len(rows[0]))]
          bestmatch=0
          for i in range(k):
           for j in range(k)]
            d=distance(clusters[i],row)
              if d<distance(clusters[bestmatch],row): bestmatch=i
            bestmatches[bestmatch].append(j)
          # If the results are the same as last time, this is complete
          if bestmatches==lastmatches: break
          lastmatches=bestmatches

          # Move the centroids to the average of their members
          for i in range(k):
            avgs=[0.0]*len(rows[0])
            if len(bestmatches[i])>0:
              for rowid in bestmatches[i]:
                for m in range(len(rows[rowid])):
                  avgs[m]+=rows[rowid][m]
              for j in range(len(avgs)):
                avgs[j]/=len(bestmatches[i])
              clusters[i]=avgs

        return bestmatches
Python Code
     import random
     def kcluster(rows,distance=pearson,k=4):
       # Determine the minimum and maximum values for each point
       ranges=[(min([row[i] for row in rows]),max([row[i] for row in rows]))
       for i in range(len(rows[0]))]
       # Create k randomly placed centroids
for t in range(100):
       clusters=[[random.random()*(ranges[i][1]-ranges[i][0])+ranges[i][0]
       for i in range(len(rows[0]))] for j in range(k)]
   bestmatches=[[] for i in range(k)]
       lastmatches=None
       for t in range(100):
  # Find which centroid is the closest for each row
         print 'Iteration %d' % t
         bestmatches=[[] for i in range(k)]
  for j in range(len(rows)):
      # Find which centroid is the closest for each row
    row=rows[j]
      for j in range(len(rows)):
        row=rows[j]
    bestmatch=0
        bestmatch=0
    for for iin range(k):
         i in range(k):
          d=distance(clusters[i],row)
      d=distance(clusters[i],row)
          if d<distance(clusters[bestmatch],row): bestmatch=i
        bestmatches[bestmatch].append(j)
      if d<distance(clusters[bestmatch],row): bestmatch=i
      # If the results are the same as last time, this is complete
      if bestmatches==lastmatches: break
    bestmatches[bestmatch].append(j)
      lastmatches=bestmatches

         # Move the centroids to the average of their members
         for i in range(k):
           avgs=[0.0]*len(rows[0])
           if len(bestmatches[i])>0:
             for rowid in bestmatches[i]:
               for m in range(len(rows[rowid])):
                 avgs[m]+=rows[rowid][m]
             for j in range(len(avgs)):
               avgs[j]/=len(bestmatches[i])
             clusters[i]=avgs

       return bestmatches
Python Code
  import random
  def kcluster(rows,distance=pearson,k=4):
    # Determine the minimum and maximum values for each point
    ranges=[(min([row[i] for row in rows]),max([row[i] for row in rows]))
    for i in range(len(rows[0]))]
    # Create k randomly placed centroids
    clusters=[[random.random()*(ranges[i][1]-ranges[i][0])+ranges[i][0]
    for i in range(len(rows[0]))] for j in range(k)]

    lastmatches=None
    for t in range(100):
      print 'Iteration %d' % t
      bestmatches=[[] for i in range(k)]

      # Find which centroid is the closest for each row
      for j in range(len(rows)):
        row=rows[j]
        bestmatch=0
        for i in range(k):
          d=distance(clusters[i],row)
          if d<distance(clusters[bestmatch],row): bestmatch=i
# If the results are the same as last time, this is complete
        bestmatches[bestmatch].append(j)
      # If the results are the same as last time, this is complete
   if bestmatches==lastmatches: break
      if bestmatches==lastmatches: break
      lastmatches=bestmatches
   lastmatches=bestmatches
      # Move the centroids to the average of their members
      for i in range(k):
        avgs=[0.0]*len(rows[0])
        if len(bestmatches[i])>0:
          for rowid in bestmatches[i]:
            for m in range(len(rows[rowid])):
              avgs[m]+=rows[rowid][m]
          for j in range(len(avgs)):
            avgs[j]/=len(bestmatches[i])
          clusters[i]=avgs

    return bestmatches
Python Code
  import random
  def kcluster(rows,distance=pearson,k=4):
    # Determine the minimum and maximum values for each point
    ranges=[(min([row[i] for row in rows]),max([row[i] for row in rows]))
    for i in range(len(rows[0]))]
    # Create k randomly placed centroids
    clusters=[[random.random()*(ranges[i][1]-ranges[i][0])+ranges[i][0]
    for i in range(len(rows[0]))] for j in range(k)]

    lastmatches=None
    for t in range(100):
      print 'Iteration %d' % t
      bestmatches=[[] for i in range(k)]

      # Find which centroid is the closest for each row
      for j in range(len(rows)):
        row=rows[j]
        bestmatch=0
        for i in range(k):
# Move the centroids to the average of their members
          d=distance(clusters[i],row)
          if d<distance(clusters[bestmatch],row): bestmatch=i
   for i in range(k):
        bestmatches[bestmatch].append(j)
      # If the results are the same as last time, this is complete
     avgs=[0.0]*len(rows[0])
      if bestmatches==lastmatches: break
      lastmatches=bestmatches
     if len(bestmatches[i])>0:
    # Move the centroids toin average of their members
         for rowid the bestmatches[i]:
    for i in range(k):
      avgs=[0.0]*len(rows[0])range(len(rows[rowid])):
             for m in
      if len(bestmatches[i])>0:
                avgs[m]+=rows[rowid][m]
        for rowid in bestmatches[i]:
          for m in range(len(rows[rowid])):
         for j in range(len(avgs)):
            avgs[m]+=rows[rowid][m]
        for j in range(len(avgs)):
             avgs[j]/=len(bestmatches[i])
          avgs[j]/=len(bestmatches[i])
         clusters[i]=avgs
        clusters[i]=avgs

    return bestmatches
K-Means Results

>> [rownames[r] for r in k[0]]
['The Viral Garden', 'Copyblogger', 'Creating Passionate Users',
 'Oilman', 'ProBlogger Blog Tips', quot;Seth's Blogquot;]

>> [rownames[r] for r in k[1]]
['Wonkette', 'Gawker', 'Gothamist', 'Huffington Post']
2D Visualizations

  Instead of Clusters, a 2D Map
  Goals
    Preserve distances as much as possible
    Draw in two dimensions
  Dimension Reduction
    Principal Components Analysis
    Multidimensional Scaling
Multidimensional Scaling
Multidimensional Scaling
Multidimensional Scaling
def scaledown(data,distance=pearson,rate=0.01):
  n=len(data)
  # The real distances between every pair of items
  realdist=[[distance(data[i],data[j]) for j in range(n)]
              for i in range(0,n)]
  outersum=0.0

  # Randomly initialize the starting points of the locations in 2D
  loc=[[random.random(),random.random()] for i in range(n)]
  fakedist=[[0.0 for j in range(n)] for i in range(n)]

  lasterror=None
  for m in range(0,1000):
    # Find projected distances
    for i in range(n):
      for j in range(n):
        fakedist[i][j]=sqrt(sum([pow(loc[i][x]-loc[j][x],2)
                                 for x in range(len(loc[i]))]))

    # Move points
    grad=[[0.0,0.0] for i in range(n)]

    totalerror=0
    for k in range(n):
      for j in range(n):
        if j==k: continue
        # The error is percent difference between the distances
        errorterm=(fakedist[j][k]-realdist[j][k])/realdist[j][k]

        # Each point needs to be moved away from or towards the other
        # point in proportion to how much error it has
        grad[k][0]+=((loc[k][0]-loc[j][0])/fakedist[j][k])*errorterm
        grad[k][1]+=((loc[k][1]-loc[j][1])/fakedist[j][k])*errorterm
        # Keep track of the total error
        totalerror+=abs(errorterm)
    print totalerror
    # If the answer got worse by moving the points, we are done
    if lasterror and lasterror<totalerror: break
    lasterror=totalerror

    # Move each of the points by the learning rate times the gradient
    for k in range(n):
      loc[k][0]-=rate*grad[k][0]
      loc[k][1]-=rate*grad[k][1]
  return loc
def scaledown(data,distance=pearson,rate=0.01):
n=len(data) The real distances between every pair of items
          n=len(data)
          #
# The realrealdist=[[distance(data[i],data[j]) for j inpair of items
            distances between every range(n)]
                      for i in range(0,n)]
realdist=[[distance(data[i],data[j]) for j in range(n)]
          outersum=0.0

          # for i initialize the starting points of the locations in 2D
                        in range(0,n)]
            Randomly
          loc=[[random.random(),random.random()] for i in range(n)]
outersum=0.0
          fakedist=[[0.0 for j in range(n)] for i in range(n)]

             lasterror=None
             for m in range(0,1000):
               # Find projected distances
               for i in range(n):
                 for j in range(n):
                   fakedist[i][j]=sqrt(sum([pow(loc[i][x]-loc[j][x],2)
                                            for x in range(len(loc[i]))]))

               # Move points
               grad=[[0.0,0.0] for i in range(n)]

               totalerror=0
               for k in range(n):
                 for j in range(n):
                   if j==k: continue
                   # The error is percent difference between the distances
                   errorterm=(fakedist[j][k]-realdist[j][k])/realdist[j][k]

                   # Each point needs to be moved away from or towards the other
                   # point in proportion to how much error it has
                   grad[k][0]+=((loc[k][0]-loc[j][0])/fakedist[j][k])*errorterm
                   grad[k][1]+=((loc[k][1]-loc[j][1])/fakedist[j][k])*errorterm
                   # Keep track of the total error
                   totalerror+=abs(errorterm)
               print totalerror
               # If the answer got worse by moving the points, we are done
               if lasterror and lasterror<totalerror: break
               lasterror=totalerror

               # Move each of the points by the learning rate times the gradient
               for k in range(n):
                 loc[k][0]-=rate*grad[k][0]
                 loc[k][1]-=rate*grad[k][1]
             return loc
def scaledown(data,distance=pearson,rate=0.01):
          n=len(data)
          # The real distances between every pair of items
          realdist=[[distance(data[i],data[j]) for j in range(n)]
                      for i in range(0,n)]
          outersum=0.0

# RandomlyRandomly initialize the startingof the locations in 2D in
        # initialize the starting points points of the locations          2D
          loc=[[random.random(),random.random()] for i in range(n)]
loc=[[random.random(),random.random()] for i in range(n)]
          fakedist=[[0.0 for j in range(n)] for i in range(n)]

fakedist=[[0.0 for j in range(n)] for i in range(n)]
       lasterror=None
          for m in range(0,1000):
            # Find projected distances
            for i in range(n):
              for j in range(n):
                fakedist[i][j]=sqrt(sum([pow(loc[i][x]-loc[j][x],2)
                                         for x in range(len(loc[i]))]))

            # Move points
            grad=[[0.0,0.0] for i in range(n)]

            totalerror=0
            for k in range(n):
              for j in range(n):
                if j==k: continue
                # The error is percent difference between the distances
                errorterm=(fakedist[j][k]-realdist[j][k])/realdist[j][k]

                # Each point needs to be moved away from or towards the other
                # point in proportion to how much error it has
                grad[k][0]+=((loc[k][0]-loc[j][0])/fakedist[j][k])*errorterm
                grad[k][1]+=((loc[k][1]-loc[j][1])/fakedist[j][k])*errorterm
                # Keep track of the total error
                totalerror+=abs(errorterm)
            print totalerror
            # If the answer got worse by moving the points, we are done
            if lasterror and lasterror<totalerror: break
            lasterror=totalerror

            # Move each of the points by the learning rate times the gradient
            for k in range(n):
              loc[k][0]-=rate*grad[k][0]
              loc[k][1]-=rate*grad[k][1]
          return loc
def scaledown(data,distance=pearson,rate=0.01):
         n=len(data)
         # The real distances between every pair of items
         realdist=[[distance(data[i],data[j]) for j in range(n)]
                     for i in range(0,n)]
         outersum=0.0

         # Randomly initialize the starting points of the locations in 2D
         loc=[[random.random(),random.random()] for i in range(n)]
         fakedist=[[0.0 for j in range(n)] for i in range(n)]

         lasterror=None
lasterror=None
        for m in range(0,1000):
for m in # Find projected distances
          range(0,1000):
          for i in range(n):
  # Find projected distances
            for j in range(n):
              fakedist[i][j]=sqrt(sum([pow(loc[i][x]-loc[j][x],2)
  for i in range(n):                   for x in range(len(loc[i]))]))
    for j in range(n):
          # Move points
      fakedist[i][j]=sqrt(sum([pow(loc[i][x]-loc[j][x],2)
          grad=[[0.0,0.0] for i in range(n)]
                                             for x in range(len(loc[i]))]))
           totalerror=0
           for k in range(n):
             for j in range(n):
               if j==k: continue
               # The error is percent difference between the distances
               errorterm=(fakedist[j][k]-realdist[j][k])/realdist[j][k]

               # Each point needs to be moved away from or towards the other
               # point in proportion to how much error it has
               grad[k][0]+=((loc[k][0]-loc[j][0])/fakedist[j][k])*errorterm
               grad[k][1]+=((loc[k][1]-loc[j][1])/fakedist[j][k])*errorterm
               # Keep track of the total error
               totalerror+=abs(errorterm)
           print totalerror
           # If the answer got worse by moving the points, we are done
           if lasterror and lasterror<totalerror: break
           lasterror=totalerror

           # Move each of the points by the learning rate times the gradient
           for k in range(n):
             loc[k][0]-=rate*grad[k][0]
             loc[k][1]-=rate*grad[k][1]
         return loc
def scaledown(data,distance=pearson,rate=0.01):
                 n=len(data)
                 # The real distances between every pair of items
                 realdist=[[distance(data[i],data[j]) for j in range(n)]
                             for i in range(0,n)]
                 outersum=0.0

                 # Randomly initialize the starting points of the locations in 2D
                 loc=[[random.random(),random.random()] for i in range(n)]
                 fakedist=[[0.0 for j in range(n)] for i in range(n)]

# Move points lasterror=None
grad=[[0.0,0.0]# m in range(0,1000):
              for
                  for i in range(n)]
                  Find projected distances
                   for i in range(n):
                     for j in range(n):
totalerror=0           fakedist[i][j]=sqrt(sum([pow(loc[i][x]-loc[j][x],2)
                                                for x in range(len(loc[i]))]))
for k in range(n):
               # Move points
  for j in range(n):
               grad=[[0.0,0.0] for i in range(n)]
    if j==k: continue
               totalerror=0
    # The errorfor k inpercent difference between the distances
                 is range(n):
                 for j in range(n):
    errorterm=(fakedist[j][k]-realdist[j][k])/realdist[j][k]
                   if j==k: continue
                       # The error is percent difference between the distances
                       errorterm=(fakedist[j][k]-realdist[j][k])/realdist[j][k]

    # Each point needs to be moved away from or towards the towards the
                   # Each point needs to be moved away from or other
    # other point# in proportionhow much error much error it has
                     point in proportion to
                                              to how it has
                   grad[k][0]+=((loc[k][0]-loc[j][0])/fakedist[j][k])*errorterm
    grad[k][0]+=((loc[k][0]-loc[j][0])/fakedist[j][k])*errorterm
                   grad[k][1]+=((loc[k][1]-loc[j][1])/fakedist[j][k])*errorterm
                   # Keep track of the total error
    grad[k][1]+=((loc[k][1]-loc[j][1])/fakedist[j][k])*errorterm
                   totalerror+=abs(errorterm)
               print totalerror
    # Keep trackIfof answer got worse by moving the points, we are done
                    the the total error
               #
               if lasterror and lasterror<totalerror: break
    totalerror+=abs(errorterm)
               lasterror=totalerror

                   # Move each of the points by the learning rate times the gradient
                   for k in range(n):
                     loc[k][0]-=rate*grad[k][0]
                     loc[k][1]-=rate*grad[k][1]
                 return loc
def scaledown(data,distance=pearson,rate=0.01):
        n=len(data)
        # The real distances between every pair of items
        realdist=[[distance(data[i],data[j]) for j in range(n)]
                    for i in range(0,n)]
        outersum=0.0

        # Randomly initialize the starting points of the locations in 2D
        loc=[[random.random(),random.random()] for i in range(n)]
        fakedist=[[0.0 for j in range(n)] for i in range(n)]

        lasterror=None
        for m in range(0,1000):
          # Find projected distances
          for i in range(n):
            for j in range(n):
              fakedist[i][j]=sqrt(sum([pow(loc[i][x]-loc[j][x],2)
                                       for x in range(len(loc[i]))]))

          # Move points
          grad=[[0.0,0.0] for i in range(n)]

          totalerror=0
          for k in range(n):
            for j in range(n):
              if j==k: continue
              # The error is percent difference between the distances
              errorterm=(fakedist[j][k]-realdist[j][k])/realdist[j][k]

              # Each point needs to be moved away from or towards the other
              # point in proportion to how much error it has
              grad[k][0]+=((loc[k][0]-loc[j][0])/fakedist[j][k])*errorterm
              grad[k][1]+=((loc[k][1]-loc[j][1])/fakedist[j][k])*errorterm
              # Keep track of the total error
              totalerror+=abs(errorterm)
# If the answer got worse by moving the points, we are done
          print totalerror
          # If the answer got worse by moving the points, we are done
if lasterror and lasterror<totalerror: break
          if lasterror and lasterror<totalerror: break
          lasterror=totalerror
lasterror=totalerror
          # Move each of the points by the learning rate times the gradient
          for k in range(n):
            loc[k][0]-=rate*grad[k][0]
            loc[k][1]-=rate*grad[k][1]
        return loc
def scaledown(data,distance=pearson,rate=0.01):
             n=len(data)
             # The real distances between every pair of items
             realdist=[[distance(data[i],data[j]) for j in range(n)]
                         for i in range(0,n)]
             outersum=0.0

             # Randomly initialize the starting points of the locations in 2D
             loc=[[random.random(),random.random()] for i in range(n)]
             fakedist=[[0.0 for j in range(n)] for i in range(n)]

             lasterror=None
             for m in range(0,1000):
               # Find projected distances
               for i in range(n):
                 for j in range(n):
                   fakedist[i][j]=sqrt(sum([pow(loc[i][x]-loc[j][x],2)
                                            for x in range(len(loc[i]))]))

               # Move points
               grad=[[0.0,0.0] for i in range(n)]

               totalerror=0
               for k in range(n):
                 for j in range(n):
                   if j==k: continue
                   # The error is percent difference between the distances
                   errorterm=(fakedist[j][k]-realdist[j][k])/realdist[j][k]

                   # Each point needs to be moved away from or towards the other
                   # point in proportion to how much error it has
                   grad[k][0]+=((loc[k][0]-loc[j][0])/fakedist[j][k])*errorterm
                   grad[k][1]+=((loc[k][1]-loc[j][1])/fakedist[j][k])*errorterm
                   # Keep track of the total error
                   totalerror+=abs(errorterm)
               print totalerror
               # If the answer got worse by moving the points, we are done
               if lasterror and lasterror<totalerror: break
# Move each of the points by the learning rate times the gradient
               lasterror=totalerror
for k in range(n): of the points by the learning rate times the gradient
             # Move each
  loc[k][0]-=rate*grad[k][0]
             for k in range(n):
               loc[k][0]-=rate*grad[k][0]
  loc[k][1]-=rate*grad[k][1]
               loc[k][1]-=rate*grad[k][1]
             return loc
Numerical Predictions

 Back to “supervised” learning
 We have a set of numerical attributes
   Specs for a laptop
   Age and rating for wine
   Ratios for a stock
 Want to predict another attribute
   Formula/model is unknown
   e.g. price
Regression Trees?

 Regression trees find hard boundaries
 Can’t deal with complex formulae
Statistical regression

  Requires specification of a model
  Usually linear
  Doesn’t handle context
Alternative - Interpolation

  Find “similar” items
  Guess price based on similar items
  Need to determine:
    What is similar?
    How should we aggregate prices?
Price Data from EBay
The eBay API

 XML API
 Send XML over HTTPS
 Receive results in XML

 http://developer.ebay.com/quickstartguide.
Some Python Code
def getHeaders(apicall,siteID=quot;0quot;,compatabilityLevel = quot;433quot;):
  headers = {quot;X-EBAY-API-COMPATIBILITY-LEVELquot;: compatabilityLevel,
             quot;X-EBAY-API-DEV-NAMEquot;: devKey,
             quot;X-EBAY-API-APP-NAMEquot;: appKey,
             quot;X-EBAY-API-CERT-NAMEquot;: certKey,
             quot;X-EBAY-API-CALL-NAMEquot;: apicall,
             quot;X-EBAY-API-SITEIDquot;: siteID,
             quot;Content-Typequot;: quot;text/xmlquot;}
  return headers



def sendRequest(apicall,xmlparameters):
  connection = httplib.HTTPSConnection(serverUrl)
  connection.request(quot;POSTquot;, '/ws/api.dll', xmlparameters, getHeaders(apicall))
  response = connection.getresponse()
  if response.status != 200:
    print quot;Error sending request:quot; + response.reason
  else:
    data = response.read()
    connection.close()
  return data
Some Python Code
def getItem(itemID):
  xml = quot;<?xml version='1.0' encoding='utf-8'?>quot;+
        quot;<GetItemRequest xmlns=quot;urn:ebay:apis:eBLBaseComponentsquot;>quot;+
        quot;<RequesterCredentials><eBayAuthToken>quot; +
        userToken +
        quot;</eBayAuthToken></RequesterCredentials>quot; + 
        quot;<ItemID>quot; + str(itemID) + quot;</ItemID>quot;+
        quot;<DetailLevel>ItemReturnAttributes</DetailLevel>quot;+
        quot;</GetItemRequest>quot;
  data=sendRequest('GetItem',xml)
  result={}
  response=parseString(data)
  result['title']=getSingleValue(response,'Title')
  sellingStatusNode = response.getElementsByTagName('SellingStatus')[0];
  result['price']=getSingleValue(sellingStatusNode,'CurrentPrice')
  result['bids']=getSingleValue(sellingStatusNode,'BidCount')
  seller = response.getElementsByTagName('Seller')
  result['feedback'] = getSingleValue(seller[0],'FeedbackScore')
  attributeSet=response.getElementsByTagName('Attribute');
  attributes={}
  for att in attributeSet:
    attID=att.attributes.getNamedItem('attributeID').nodeValue
    attValue=getSingleValue(att,'ValueLiteral')
    attributes[attID]=attValue
  result['attributes']=attributes
  return result
Building an item table
         RAM     CPU    HDD   Screen   DVD   Price


D600     512     1400   40    14       1     $350


Lenovo   160     300    5     13       0     $80


T22      256     900    20    14       1     $200


Pavillion 1024   1600   120   17       1     $800

etc..
Distance between items
            RAM      CPU        HDD        Screen     DVD        Price


New         512      1400       40         14         1          ???


T22         256      900        20         14         1          $200


Euclidean, just like in clustering

 (512 − 256) 2 + (1400 − 900) 2 + (40 − 20) 2 + (14 − 14) 2 + (1 − 1) 2
Idea 1 – use the closest item

  With the item whose price I want to
  guess:
    Calculate the distance for every item in
    my dataset
    Guess that the price is the same as the
    closest
  This is called kNN with k=1
Problems with “outliers”

  The closest item may be anomalous
  Why?
    Exceptional deal that won’t occur again
    Something missing from the dataset
    Data errors
Using an average
        RAM    CPU    HDD   Screen   DVD      Price

New     512    1400   40    14       1        ???

No. 1   512    1400   30    13       1        $360

No. 2   512    1400   60    14       1        $400

No. 3   1024   1600   120   15       0        $325


                                 k=3, estimate = $361
Using a weighted average
        RAM    CPU    HDD   Screen   DVD   Price   Weight

New     512    1400   40    14       1     ???

No. 1   512    1400   30    13       1     $360    3

No. 2   512    1400   60    14       1     $400    2

No. 3   1024   1600   120   15       0     $325    1



                                           Estimate = $367
Python code
def getdistances(data,vec1):
  distancelist=[]
  for i in range(len(data)):
     vec2=data[i]['input']
     distancelist.append((euclidean(vec1,vec2),i))
  distancelist.sort()
  return distancelist
def weightedknn(data,vec1,k=5,weightf=gaussian):
  # Get distances
  dlist=getdistances(data,vec1)
  avg=0.0
  totalweight=0.0

  # Get weighted average
  for i in range(k):
    dist=dlist[i][0]
    idx=dlist[i][1]
    weight=weightf(dist)
    avg+=weight*data[idx]['result']
    totalweight+=weight
  avg=avg/totalweight
  return avg
Python code
def getdistances(data,vec1):
  distancelist=[]
  for i in range(len(data)):
defvec2=data[i]['input']
      weightedknn(data,vec1,k=5,weightf=gaussian):
    distancelist.append((euclidean(vec1,vec2),i))
   # Get distances
  distancelist.sort()
   dlist=getdistances(data,vec1)
  return distancelist
  avg=0.0
  totalweight=0.0

  # Get weighted average
  for i in range(k):
    dist=dlist[i][0]
    idx=dlist[i][1]
    weight=weightf(dist)
    avg+=weight*data[idx]['result']
    totalweight+=weight
  avg=avg/totalweight
  return avg
Too few – k too low
Too many – k too high
Determining the best k

  Divide the dataset up
    Training set
    Test set
  Guess the prices for the test set using
  the training set
  See how good the guesses are for
  different values of k
  Known as “cross-validation”
Determining the best k
                                            Test set

                                            Attribute      Price
Attribute       Price                       10             20
10              20
                                            Training set
11              30
                                            Attribute      Price
8               10
                                            11             30
6               0
                                            8              10
                                            6              0
        For k = 1, guess = 30, error = 10
        For k = 2, guess = 20, error = 0
        For k = 3, guess = 13, error = 7



     Repeat with different test sets, average the error
Python code
   def dividedata(data,test=0.05):
     trainset=[]
     testset=[]
     for row in data:
       if random()<test:
         testset.append(row)
       else:
         trainset.append(row)
     return trainset,testset


   def testalgorithm(algf,trainset,testset):
     error=0.0
     for row in testset:
       guess=algf(trainset,row['input'])
       error+=(row['result']-guess)**2
     return error/len(testset)



   def crossvalidate(algf,data,trials=100,test=0.05):
     error=0.0
     for i in range(trials):
       trainset,testset=dividedata(data,test)
       error+=testalgorithm(algf,trainset,testset)
     return error/trials
Python code
   def dividedata(data,test=0.05):
     trainset=[]
     testset=[]
     for row in data:
        if random()<test:
          testset.append(row)
        else:
          trainset.append(row)
     return trainset,testset
   def testalgorithm(algf,trainset,testset):
     error=0.0
     for row in testset:
       guess=algf(trainset,row['input'])
       error+=(row['result']-guess)**2
     return error/len(testset)



   def crossvalidate(algf,data,trials=100,test=0.05):
     error=0.0
     for i in range(trials):
       trainset,testset=dividedata(data,test)
       error+=testalgorithm(algf,trainset,testset)
     return error/trials
Python code
   def dividedata(data,test=0.05):
     trainset=[]
     testset=[]
     for row in data:
       if random()<test:
         testset.append(row)
       else:
         trainset.append(row)
     return trainset,testset
  def
  testalgorithm(algf,trainset,testset):
    error=0.0
    for row in testset:
      guess=algf(trainset,row['input'])
      error+=(row['result']-guess)**2
    return error/len(testset)

    def crossvalidate(algf,data,trials=100,test=0.05):
      error=0.0
      for i in range(trials):
        trainset,testset=dividedata(data,test)
        error+=testalgorithm(algf,trainset,testset)
      return error/trials
Python code
   def dividedata(data,test=0.05):
     trainset=[]
     testset=[]
     for row in data:
       if random()<test:
         testset.append(row)
       else:
         trainset.append(row)
     return trainset,testset


    def testalgorithm(algf,trainset,testset):
      error=0.0
      for row in testset:
        guess=algf(trainset,row['input'])
        error+=(row['result']-guess)**2
      return error/len(testset)
  def crossvalidate(algf,data,trials=100,test=0.05):
    error=0.0
    for i in range(trials):
      trainset,testset=dividedata(data,test)
      error+=testalgorithm(algf,trainset,testset)
    return error/trials
Problems with scale
Scaling the data
Scaling to zero
Determining the best scale

  Try different weights
  Use the “cross-validation” method
  Different ways of choosing a scale:
    Range-scaling
    Intuitive guessing
    Optimization
Methods covered

 Regression trees
 Hierarchical clustering
 k-means clustering
 Multidimensional scaling
 Weight k-nearest neighbors
New projects

 Openads
   An open-source ad server
   Users can share impression/click data
   Matrix of what hits based on
     Page Text
     Ad
     Ad placement
     Search query
   Can we improve targeting?
New Projects

 Finance
   Analysts already drowning in info
   Stories sometimes broken on blogs
   Message boards show sentiment

   Extremely low signal-to-noise ratio
New Projects

 Entertainment
   How much buzz is a movie generating?
   What psychographic profiles like this type
   of movie?

   Of interest to studios and media investors

Weitere ähnliche Inhalte

Was ist angesagt?

Be lazy, be ESI: HTTP caching and Symfony2 @ PHPDay 2011 05-13-2011
 Be lazy, be ESI: HTTP caching and Symfony2 @ PHPDay 2011 05-13-2011 Be lazy, be ESI: HTTP caching and Symfony2 @ PHPDay 2011 05-13-2011
Be lazy, be ESI: HTTP caching and Symfony2 @ PHPDay 2011 05-13-2011Alessandro Nadalin
 
jQuery Data Manipulate API - A source code dissecting journey
jQuery Data Manipulate API - A source code dissecting journeyjQuery Data Manipulate API - A source code dissecting journey
jQuery Data Manipulate API - A source code dissecting journeyHuiyi Yan
 
Passwords suck, but centralized proprietary services are not the answer
Passwords suck, but centralized proprietary services are not the answerPasswords suck, but centralized proprietary services are not the answer
Passwords suck, but centralized proprietary services are not the answerFrancois Marier
 
Coffeescript a z
Coffeescript a zCoffeescript a z
Coffeescript a zStarbuildr
 
jQuery from the very beginning
jQuery from the very beginningjQuery from the very beginning
jQuery from the very beginningAnis Ahmad
 
Write Less Do More
Write Less Do MoreWrite Less Do More
Write Less Do MoreRemy Sharp
 
SenchaCon 2016: Want to Use Ext JS Components with Angular 2? Here’s How to I...
SenchaCon 2016: Want to Use Ext JS Components with Angular 2? Here’s How to I...SenchaCon 2016: Want to Use Ext JS Components with Angular 2? Here’s How to I...
SenchaCon 2016: Want to Use Ext JS Components with Angular 2? Here’s How to I...Sencha
 

Was ist angesagt? (10)

Be lazy, be ESI: HTTP caching and Symfony2 @ PHPDay 2011 05-13-2011
 Be lazy, be ESI: HTTP caching and Symfony2 @ PHPDay 2011 05-13-2011 Be lazy, be ESI: HTTP caching and Symfony2 @ PHPDay 2011 05-13-2011
Be lazy, be ESI: HTTP caching and Symfony2 @ PHPDay 2011 05-13-2011
 
jQuery in 15 minutes
jQuery in 15 minutesjQuery in 15 minutes
jQuery in 15 minutes
 
jQuery
jQueryjQuery
jQuery
 
jQuery Data Manipulate API - A source code dissecting journey
jQuery Data Manipulate API - A source code dissecting journeyjQuery Data Manipulate API - A source code dissecting journey
jQuery Data Manipulate API - A source code dissecting journey
 
Passwords suck, but centralized proprietary services are not the answer
Passwords suck, but centralized proprietary services are not the answerPasswords suck, but centralized proprietary services are not the answer
Passwords suck, but centralized proprietary services are not the answer
 
Coffeescript a z
Coffeescript a zCoffeescript a z
Coffeescript a z
 
jQuery from the very beginning
jQuery from the very beginningjQuery from the very beginning
jQuery from the very beginning
 
Write Less Do More
Write Less Do MoreWrite Less Do More
Write Less Do More
 
JQuery introduction
JQuery introductionJQuery introduction
JQuery introduction
 
SenchaCon 2016: Want to Use Ext JS Components with Angular 2? Here’s How to I...
SenchaCon 2016: Want to Use Ext JS Components with Angular 2? Here’s How to I...SenchaCon 2016: Want to Use Ext JS Components with Angular 2? Here’s How to I...
SenchaCon 2016: Want to Use Ext JS Components with Angular 2? Here’s How to I...
 

Andere mochten auch

Os Vanlindberg
Os VanlindbergOs Vanlindberg
Os Vanlindbergoscon2007
 
Javascriptbootcamp
JavascriptbootcampJavascriptbootcamp
Javascriptbootcamposcon2007
 
Cyber Recomendaciones
Cyber RecomendacionesCyber Recomendaciones
Cyber Recomendacionesjosemorales
 
EL MUNDO DE LOS COPLEROS II
EL MUNDO DE LOS COPLEROS IIEL MUNDO DE LOS COPLEROS II
EL MUNDO DE LOS COPLEROS IIliliC
 
Beneficiosdelsexo
BeneficiosdelsexoBeneficiosdelsexo
Beneficiosdelsexojosemorales
 

Andere mochten auch (6)

Os Vanlindberg
Os VanlindbergOs Vanlindberg
Os Vanlindberg
 
Javascriptbootcamp
JavascriptbootcampJavascriptbootcamp
Javascriptbootcamp
 
Cyber Recomendaciones
Cyber RecomendacionesCyber Recomendaciones
Cyber Recomendaciones
 
EL MUNDO DE LOS COPLEROS II
EL MUNDO DE LOS COPLEROS IIEL MUNDO DE LOS COPLEROS II
EL MUNDO DE LOS COPLEROS II
 
態度
態度態度
態度
 
Beneficiosdelsexo
BeneficiosdelsexoBeneficiosdelsexo
Beneficiosdelsexo
 

Ähnlich wie Data Mining Open Ap Is

Native Phone Development 101
Native Phone Development 101Native Phone Development 101
Native Phone Development 101Sasmito Adibowo
 
Beyond php it's not (just) about the code
Beyond php   it's not (just) about the codeBeyond php   it's not (just) about the code
Beyond php it's not (just) about the codeWim Godden
 
Beyond PHP - It's not (just) about the code
Beyond PHP - It's not (just) about the codeBeyond PHP - It's not (just) about the code
Beyond PHP - It's not (just) about the codeWim Godden
 
Jquery optimization-tips
Jquery optimization-tipsJquery optimization-tips
Jquery optimization-tipsanubavam-techkt
 
PHPUnit Episode iv.iii: Return of the tests
PHPUnit Episode iv.iii: Return of the testsPHPUnit Episode iv.iii: Return of the tests
PHPUnit Episode iv.iii: Return of the testsMichelangelo van Dam
 
Ruby on Rails For Java Programmers
Ruby on Rails For Java ProgrammersRuby on Rails For Java Programmers
Ruby on Rails For Java Programmerselliando dias
 
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeBeyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeWim Godden
 
Machine Learning for Web Developers
Machine Learning for Web DevelopersMachine Learning for Web Developers
Machine Learning for Web DevelopersRiza Fahmi
 
Cross Domain Web
Mashups with JQuery and Google App Engine
Cross Domain Web
Mashups with JQuery and Google App EngineCross Domain Web
Mashups with JQuery and Google App Engine
Cross Domain Web
Mashups with JQuery and Google App EngineAndy McKay
 
Constance et qualité du code dans une équipe - Rémi Prévost
Constance et qualité du code dans une équipe - Rémi PrévostConstance et qualité du code dans une équipe - Rémi Prévost
Constance et qualité du code dans une équipe - Rémi PrévostWeb à Québec
 
From Knowledge Graphs to AI-powered SEO: Using taxonomies, schemas and knowle...
From Knowledge Graphs to AI-powered SEO: Using taxonomies, schemas and knowle...From Knowledge Graphs to AI-powered SEO: Using taxonomies, schemas and knowle...
From Knowledge Graphs to AI-powered SEO: Using taxonomies, schemas and knowle...Connected Data World
 
Spiffy Applications With JavaScript
Spiffy Applications With JavaScriptSpiffy Applications With JavaScript
Spiffy Applications With JavaScriptMark Casias
 
Micro app-framework - NodeLive Boston
Micro app-framework - NodeLive BostonMicro app-framework - NodeLive Boston
Micro app-framework - NodeLive BostonMichael Dawson
 
Scaling business app development with Play and Scala
Scaling business app development with Play and ScalaScaling business app development with Play and Scala
Scaling business app development with Play and ScalaPeter Hilton
 
Beeline Firebase talk - Firebase event Jun 2017
Beeline Firebase talk - Firebase event Jun 2017Beeline Firebase talk - Firebase event Jun 2017
Beeline Firebase talk - Firebase event Jun 2017Chetan Padia
 
Crafting Evolvable Api Responses
Crafting Evolvable Api ResponsesCrafting Evolvable Api Responses
Crafting Evolvable Api Responsesdarrelmiller71
 
How to Leverage APIs for SEO #TTTLive2019
How to Leverage APIs for SEO #TTTLive2019How to Leverage APIs for SEO #TTTLive2019
How to Leverage APIs for SEO #TTTLive2019Paul Shapiro
 

Ähnlich wie Data Mining Open Ap Is (20)

Native Phone Development 101
Native Phone Development 101Native Phone Development 101
Native Phone Development 101
 
Beyond php it's not (just) about the code
Beyond php   it's not (just) about the codeBeyond php   it's not (just) about the code
Beyond php it's not (just) about the code
 
Beyond PHP - It's not (just) about the code
Beyond PHP - It's not (just) about the codeBeyond PHP - It's not (just) about the code
Beyond PHP - It's not (just) about the code
 
Playing With The Web
Playing With The WebPlaying With The Web
Playing With The Web
 
Jquery optimization-tips
Jquery optimization-tipsJquery optimization-tips
Jquery optimization-tips
 
PHPUnit Episode iv.iii: Return of the tests
PHPUnit Episode iv.iii: Return of the testsPHPUnit Episode iv.iii: Return of the tests
PHPUnit Episode iv.iii: Return of the tests
 
Ruby on Rails For Java Programmers
Ruby on Rails For Java ProgrammersRuby on Rails For Java Programmers
Ruby on Rails For Java Programmers
 
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeBeyond php - it's not (just) about the code
Beyond php - it's not (just) about the code
 
Machine Learning for Web Developers
Machine Learning for Web DevelopersMachine Learning for Web Developers
Machine Learning for Web Developers
 
Cross Domain Web
Mashups with JQuery and Google App Engine
Cross Domain Web
Mashups with JQuery and Google App EngineCross Domain Web
Mashups with JQuery and Google App Engine
Cross Domain Web
Mashups with JQuery and Google App Engine
 
Constance et qualité du code dans une équipe - Rémi Prévost
Constance et qualité du code dans une équipe - Rémi PrévostConstance et qualité du code dans une équipe - Rémi Prévost
Constance et qualité du code dans une équipe - Rémi Prévost
 
Api
ApiApi
Api
 
From Knowledge Graphs to AI-powered SEO: Using taxonomies, schemas and knowle...
From Knowledge Graphs to AI-powered SEO: Using taxonomies, schemas and knowle...From Knowledge Graphs to AI-powered SEO: Using taxonomies, schemas and knowle...
From Knowledge Graphs to AI-powered SEO: Using taxonomies, schemas and knowle...
 
Spiffy Applications With JavaScript
Spiffy Applications With JavaScriptSpiffy Applications With JavaScript
Spiffy Applications With JavaScript
 
Micro app-framework - NodeLive Boston
Micro app-framework - NodeLive BostonMicro app-framework - NodeLive Boston
Micro app-framework - NodeLive Boston
 
Micro app-framework
Micro app-frameworkMicro app-framework
Micro app-framework
 
Scaling business app development with Play and Scala
Scaling business app development with Play and ScalaScaling business app development with Play and Scala
Scaling business app development with Play and Scala
 
Beeline Firebase talk - Firebase event Jun 2017
Beeline Firebase talk - Firebase event Jun 2017Beeline Firebase talk - Firebase event Jun 2017
Beeline Firebase talk - Firebase event Jun 2017
 
Crafting Evolvable Api Responses
Crafting Evolvable Api ResponsesCrafting Evolvable Api Responses
Crafting Evolvable Api Responses
 
How to Leverage APIs for SEO #TTTLive2019
How to Leverage APIs for SEO #TTTLive2019How to Leverage APIs for SEO #TTTLive2019
How to Leverage APIs for SEO #TTTLive2019
 

Mehr von oscon2007

J Ruby Whirlwind Tour
J Ruby Whirlwind TourJ Ruby Whirlwind Tour
J Ruby Whirlwind Touroscon2007
 
Solr Presentation5
Solr Presentation5Solr Presentation5
Solr Presentation5oscon2007
 
Os Fitzpatrick Sussman Wiifm
Os Fitzpatrick Sussman WiifmOs Fitzpatrick Sussman Wiifm
Os Fitzpatrick Sussman Wiifmoscon2007
 
Performance Whack A Mole
Performance Whack A MolePerformance Whack A Mole
Performance Whack A Moleoscon2007
 
Os Lanphier Brashears
Os Lanphier BrashearsOs Lanphier Brashears
Os Lanphier Brashearsoscon2007
 
Os Fitzpatrick Sussman Swp
Os Fitzpatrick Sussman SwpOs Fitzpatrick Sussman Swp
Os Fitzpatrick Sussman Swposcon2007
 
Os Berlin Dispelling Myths
Os Berlin Dispelling MythsOs Berlin Dispelling Myths
Os Berlin Dispelling Mythsoscon2007
 
Os Keysholistic
Os KeysholisticOs Keysholistic
Os Keysholisticoscon2007
 
Os Jonphillips
Os JonphillipsOs Jonphillips
Os Jonphillipsoscon2007
 
Os Urnerupdated
Os UrnerupdatedOs Urnerupdated
Os Urnerupdatedoscon2007
 

Mehr von oscon2007 (20)

J Ruby Whirlwind Tour
J Ruby Whirlwind TourJ Ruby Whirlwind Tour
J Ruby Whirlwind Tour
 
Solr Presentation5
Solr Presentation5Solr Presentation5
Solr Presentation5
 
Os Borger
Os BorgerOs Borger
Os Borger
 
Os Harkins
Os HarkinsOs Harkins
Os Harkins
 
Os Fitzpatrick Sussman Wiifm
Os Fitzpatrick Sussman WiifmOs Fitzpatrick Sussman Wiifm
Os Fitzpatrick Sussman Wiifm
 
Os Bunce
Os BunceOs Bunce
Os Bunce
 
Yuicss R7
Yuicss R7Yuicss R7
Yuicss R7
 
Performance Whack A Mole
Performance Whack A MolePerformance Whack A Mole
Performance Whack A Mole
 
Os Fogel
Os FogelOs Fogel
Os Fogel
 
Os Lanphier Brashears
Os Lanphier BrashearsOs Lanphier Brashears
Os Lanphier Brashears
 
Os Tucker
Os TuckerOs Tucker
Os Tucker
 
Os Fitzpatrick Sussman Swp
Os Fitzpatrick Sussman SwpOs Fitzpatrick Sussman Swp
Os Fitzpatrick Sussman Swp
 
Os Furlong
Os FurlongOs Furlong
Os Furlong
 
Os Berlin Dispelling Myths
Os Berlin Dispelling MythsOs Berlin Dispelling Myths
Os Berlin Dispelling Myths
 
Os Kimsal
Os KimsalOs Kimsal
Os Kimsal
 
Os Pruett
Os PruettOs Pruett
Os Pruett
 
Os Alrubaie
Os AlrubaieOs Alrubaie
Os Alrubaie
 
Os Keysholistic
Os KeysholisticOs Keysholistic
Os Keysholistic
 
Os Jonphillips
Os JonphillipsOs Jonphillips
Os Jonphillips
 
Os Urnerupdated
Os UrnerupdatedOs Urnerupdated
Os Urnerupdated
 

Kürzlich hochgeladen

A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 

Kürzlich hochgeladen (20)

A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 

Data Mining Open Ap Is

  • 1. Data Mining and Open APIs Toby Segaran
  • 2. About Me Software Developer at Genstruct Work directly with scientists Design algorithms to aid in drug testing “Programming Collective Intelligence” Published by O’Reilly Due out in August Consult with open-source projects and other companies http://kiwitobes.com
  • 3. Presentation Goals Look at some Open APIs Get some data Visualize algorithms for data-mining Work through some Python code Variety of techniques and sources Advocacy (why you should care)
  • 4. Open data APIs Zillow Yahoo Answers eBay Amazon Facebook Technorati del.icio.us Twitter HotOrNot Google News Upcoming programmableweb.com/apis for more…
  • 5. Open API uses Mashups Integration Automation Command-line tools Most importantly, creating datasets!
  • 6. What is data mining? From a large dataset find the: Implicit Unknown Useful Data could be: Tabular, e.g. Price lists Free text Pictures
  • 7. Why it’s important now More devices produce more data People share more data The internet is vast Products are more customized Advertising is targeted Human cognition is limited
  • 8. Traditional Applications Computational Biology Financial Markets Retail Markets Fraud Detection Surveillance Supply Chain Optimization National Security
  • 9. Traditional = Inaccessible Real applications are esoteric Tutorial examples are trivial Generally lacking in “interest value”
  • 10. Fun, Accessible Applications Home price modeling Where are the hottest people? Which bloggers are similar? Important attributes on eBay Predicting fashion trends Movie popularity
  • 12. The Zillow API Allows querying by address Returns information about the property Bedrooms Bathrooms Zip Code Price Estimate Last Sale Price Requires registration key http://www.zillow.com/howto/api/PropertyDetailsAPIOverview.htm
  • 13. The Zillow API REST Request http://www.zillow.com/webservice/GetDeepSearchResults.htm? zws-id=key&address=address&citystatezip=citystateszip
  • 14. The Zillow API <SearchResults:searchresults xmlns:SearchResults=quot;http://www. zillow.com/vstatic/3/static/xsd/SearchResults.xsdquot;> … <response> <results> <result> <zpid>48749425</zpid> <links> … </links> <address> <street>2114 Bigelow Ave N</street> <zipcode>98109</zipcode> <city>SEATTLE</city> <state>WA</state> <latitude>47.637934</latitude> <longitude>-122.347936</longitude> </address> <yearBuilt>1924</yearBuilt> <lotSizeSqFt>4680</lotSizeSqFt> <finishedSqFt>3290</finishedSqFt> <bathrooms>2.75</bathrooms> <bedrooms>4</bedrooms> <lastSoldDate>06/18/2002</lastSoldDate> <lastSoldPrice currency=quot;USDquot;>770000</lastSoldPrice> <valuation> <amount currency=quot;USDquot;>1091061</amount> </result> </results> </response>
  • 15. The Zillow API <SearchResults:searchresults xmlns:SearchResults=quot;http://www. zillow.com/vstatic/3/static/xsd/SearchResults.xsdquot;> … <zipcode>98109</zipcode> <response> <results> <city>SEATTLE</city> <result> <state>WA</state> <zpid>48749425</zpid> <links> <latitude>47.637934</latitude> … <longitude>-122.347936</longitude> </links> <address> </address>Bigelow Ave N</street> <street>2114 <yearBuilt>1924</yearBuilt> <zipcode>98109</zipcode> <city>SEATTLE</city> <lotSizeSqFt>4680</lotSizeSqFt> <state>WA</state> <finishedSqFt>3290</finishedSqFt> <latitude>47.637934</latitude> <longitude>-122.347936</longitude> </address> <bathrooms>2.75</bathrooms> <yearBuilt>1924</yearBuilt> <lotSizeSqFt>4680</lotSizeSqFt> <bedrooms>4</bedrooms> <finishedSqFt>3290</finishedSqFt> <lastSoldDate>06/18/2002</lastSoldDate> <bathrooms>2.75</bathrooms> <bedrooms>4</bedrooms> <lastSoldPrice currency=quot;USDquot;>770000</lastSoldPrice> <lastSoldDate>06/18/2002</lastSoldDate> <valuation> currency=quot;USDquot;>770000</lastSoldPrice> <lastSoldPrice <valuation> <amountcurrency=quot;USDquot;>1091061</amount> currency=quot;USDquot;>1091061</amount> <amount </result> </results> </response>
  • 16. Zillow from Python def getaddressdata(address,city): escad=address.replace(' ','+') # Construct the URL url='http://www.zillow.com/webservice/GetDeepSearchResults.htm?' url+='zws-id=%s&address=%s&citystatezip=%s' % (zwskey,escad,city) # Parse resulting XML doc=xml.dom.minidom.parseString(urllib2.urlopen(url).read()) code=doc.getElementsByTagName('code')[0].firstChild.data # Code 0 means success, otherwise there was an error if code!='0': return None # Extract the info about this property try: zipcode=doc.getElementsByTagName('zipcode')[0].firstChild.data use=doc.getElementsByTagName('useCode')[0].firstChild.data year=doc.getElementsByTagName('yearBuilt')[0].firstChild.data bath=doc.getElementsByTagName('bathrooms')[0].firstChild.data bed=doc.getElementsByTagName('bedrooms')[0].firstChild.data rooms=doc.getElementsByTagName('totalRooms')[0].firstChild.data price=doc.getElementsByTagName('amount')[0].firstChild.data except: return None return (zipcode,use,int(year),float(bath),int(bed),int(rooms),price)
  • 17. Zillow from Python def getaddressdata(address,city): escad=address.replace(' ','+') # Construct the URL # Construct the URL url='http://www.zillow.com/webservice/GetDeepSearchResults.htm?' url='http://www.zillow.com/webservice/GetDeepSearchResults.htm?' url+='zws-id=%s&address=%s&citystatezip=%s' % (zwskey,escad,city) url+='zws-id=%s&address=%s&citystatezip=%s' % (zwskey,escad,city) # Parse resulting XML doc=xml.dom.minidom.parseString(urllib2.urlopen(url).read()) code=doc.getElementsByTagName('code')[0].firstChild.data # Code 0 means success, otherwise there was an error if code!='0': return None # Extract the info about this property try: zipcode=doc.getElementsByTagName('zipcode')[0].firstChild.data use=doc.getElementsByTagName('useCode')[0].firstChild.data year=doc.getElementsByTagName('yearBuilt')[0].firstChild.data bath=doc.getElementsByTagName('bathrooms')[0].firstChild.data bed=doc.getElementsByTagName('bedrooms')[0].firstChild.data rooms=doc.getElementsByTagName('totalRooms')[0].firstChild.data price=doc.getElementsByTagName('amount')[0].firstChild.data except: return None return (zipcode,use,int(year),float(bath),int(bed),int(rooms),price)
  • 18. Zillow from Python def getaddressdata(address,city): escad=address.replace(' ','+') # Construct the URL url='http://www.zillow.com/webservice/GetDeepSearchResults.htm?' url+='zws-id=%s&address=%s&citystatezip=%s' % (zwskey,escad,city) # Parse resulting XML # Parse resulting XML doc=xml.dom.minidom.parseString(urllib2.urlopen(url).read()) doc=xml.dom.minidom.parseString(urllib2.urlopen(url).read()) code=doc.getElementsByTagName('code')[0].firstChild.data code=doc.getElementsByTagName('code')[0].firstChild.data # Code 0 means success, otherwise there was an error if code!='0': return None # Extract the info about this property try: zipcode=doc.getElementsByTagName('zipcode')[0].firstChild.data use=doc.getElementsByTagName('useCode')[0].firstChild.data year=doc.getElementsByTagName('yearBuilt')[0].firstChild.data bath=doc.getElementsByTagName('bathrooms')[0].firstChild.data bed=doc.getElementsByTagName('bedrooms')[0].firstChild.data rooms=doc.getElementsByTagName('totalRooms')[0].firstChild.data price=doc.getElementsByTagName('amount')[0].firstChild.data except: return None return (zipcode,use,int(year),float(bath),int(bed),int(rooms),price)
  • 19. Zillow from Python def getaddressdata(address,city): escad=address.replace(' ','+') # Construct the URL url='http://www.zillow.com/webservice/GetDeepSearchResults.htm?' url+='zws-id=%s&address=%s&citystatezip=%s' % (zwskey,escad,city) # Parse resulting XML doc=xml.dom.minidom.parseString(urllib2.urlopen(url).read()) code=doc.getElementsByTagName('code')[0].firstChild.data # Code 0 means success, otherwise there was an error if code!='0': return None zipcode=doc.getElementsByTagName('zipcode')[0].firstChild.data # Extract the info about this property try: use=doc.getElementsByTagName('useCode')[0].firstChild.data zipcode=doc.getElementsByTagName('zipcode')[0].firstChild.data year=doc.getElementsByTagName('yearBuilt')[0].firstChild.data use=doc.getElementsByTagName('useCode')[0].firstChild.data bath=doc.getElementsByTagName('bathrooms')[0].firstChild.data year=doc.getElementsByTagName('yearBuilt')[0].firstChild.data bath=doc.getElementsByTagName('bathrooms')[0].firstChild.data bed=doc.getElementsByTagName('bedrooms')[0].firstChild.data bed=doc.getElementsByTagName('bedrooms')[0].firstChild.data rooms=doc.getElementsByTagName('totalRooms')[0].firstChild.data rooms=doc.getElementsByTagName('totalRooms')[0].firstChild.data price=doc.getElementsByTagName('amount')[0].firstChild.data price=doc.getElementsByTagName('amount')[0].firstChild.data except: return None return (zipcode,use,int(year),float(bath),int(bed),int(rooms),price)
  • 20. A home price dataset House Zip Bathrooms Bedrooms Built Type Price A 02138 1.5 2 1847 Single 505296 B 02139 3.5 9 1916 Triplex 776378 C 02140 3.5 4 1894 Duplex 595027 D 02139 2.5 4 1854 Duplex 552213 E 02138 3.5 5 1909 Duplex 947528 F 02138 3.5 4 1930 Single 2107871 etc..
  • 21. What can we learn? A made-up houses price How important is Zip Code? What are the important attributes? Can we do better than averages?
  • 22. Introducing Regression Trees A B Value 10 Circle 20 11 Square 22 22 Square 8 18 Circle 6
  • 23. Introducing Regression Trees A B Value 10 Circle 20 11 Square 22 22 Square 8 18 Circle 6
  • 24. Minimizing deviation Standard deviation is the “spread” of results Try all possible divisions Choose the division that decreases deviation the most Initially A B Value Average = 14 10 Circle 20 Standard Deviation = 8.2 11 Square 22 22 Square 8 18 Circle 6
  • 25. Minimizing deviation Standard deviation is the “spread” of results Try all possible divisions Choose the division that decreases deviation the most B = Circle A B Value Average = 13 10 Circle 20 Standard Deviation = 9.9 11 Square 22 22 Square 8 B = Square 18 Circle 6 Average = 15 Standard Deviation = 9.9
  • 26. Minimizing deviation Standard deviation is the “spread” of results Try all possible divisions Choose the division that decreases deviation the most A > 18 A B Value Average = 8 10 Circle 20 Standard Deviation = 0 11 Square 22 22 Square 8 A <= 20 18 Circle 6 Average = 16 Standard Deviation = 8.7
  • 27. Minimizing deviation Standard deviation is the “spread” of results Try all possible divisions Choose the division that decreases deviation the most A > 11 A B Value Average = 7 10 Circle 20 Standard Deviation = 1.4 11 Square 22 22 Square 8 A <= 11 18 Circle 6 Average = 21 Standard Deviation = 1.4
  • 28. Python Code def variance(rows): if len(rows)==0: return 0 data=[float(row[len(row)-1]) for row in rows] mean=sum(data)/len(data) variance=sum([(d-mean)**2 for d in data])/len(data) return variance def divideset(rows,column,value): # Make a function that tells us if a row is in # the first group (true) or the second group (false) split_function=None if isinstance(value,int) or isinstance(value,float): split_function=lambda row:row[column]>=value else: split_function=lambda row:row[column]==value # Divide the rows into two sets and return them set1=[row for row in rows if split_function(row)] set2=[row for row in rows if not split_function(row)] return (set1,set2)
  • 29. Python Code def variance(rows): def variance(rows): if len(rows)==0: return 0 if len(rows)==0: return for row in rows] data=[float(row[len(row)-1]) 0 data=[float(row[len(row)-1]) for row in rows] mean=sum(data)/len(data) mean=sum(data)/len(data)d in data])/len(data) variance=sum([(d-mean)**2 for return variance variance=sum([(d-mean)**2 for d in data])/len(data) return variance def divideset(rows,column,value): # Make a function that tells us if a row is in # the first group (true) or the second group (false) split_function=None if isinstance(value,int) or isinstance(value,float): split_function=lambda row:row[column]>=value else: split_function=lambda row:row[column]==value # Divide the rows into two sets and return them set1=[row for row in rows if split_function(row)] set2=[row for row in rows if not split_function(row)] return (set1,set2)
  • 30. Python Code def variance(rows): if len(rows)==0: return 0 data=[float(row[len(row)-1]) for row in rows] mean=sum(data)/len(data) variance=sum([(d-mean)**2 for d in data])/len(data) return variance # def divideset(rows,column,value): us if a row is in Make a function that tells # the Make a function (true) or the asecond in # first group that tells us if row is group (false) # the first group (true) or the second group (false) split_function=None split_function=None if isinstance(value,int) or isinstance(value,float): if isinstance(value,int) or isinstance(value,float): split_function=lambda row:row[column]>=value split_function=lambda row:row[column]>=value else: else: split_function=lambda row:row[column]==value split_function=lambda row:row[column]==value # Divide the rows into two sets and return them set1=[row for row in rows if split_function(row)] set2=[row for row in rows if not split_function(row)] return (set1,set2)
  • 31. Python Code def variance(rows): if len(rows)==0: return 0 data=[float(row[len(row)-1]) for row in rows] mean=sum(data)/len(data) variance=sum([(d-mean)**2 for d in data])/len(data) return variance def divideset(rows,column,value): # Make a function that tells us if a row is in # the first group (true) or the second group (false) split_function=None if isinstance(value,int) or isinstance(value,float): split_function=lambda row:row[column]>=value else: split_function=lambda row:row[column]==value # Divide the rows into two sets and returnreturn them # Divide the rows into two sets and them set1=[row for row in rows if split_function(row)] set1=[row for row in rows if not split_function(row)] in rows if split_function(row)] set2=[row for row set2=[row(set1,set2) in rows if not split_function(row)] for row return return (set1,set2)
  • 32. CART Algoritm A B Value 10 Circle 20 11 Square 22 22 Square 8 18 Circle 6
  • 33. CART Algoritm A B Value 10 Circle 20 11 Square 22 22 Square 8 18 Circle 6
  • 34. CART Algoritm 22 Square 8 10 Circle 20 18 Circle 6 11 Square 22
  • 36. Python Code def buildtree(rows,scoref=variance): if len(rows)==0: return decisionnode() current_score=scoref(rows) # Set up some variables to track the best criteria best_gain=0.0 best_criteria=None best_sets=None column_count=len(rows[0])-1 for col in range(0,column_count): # Generate the list of different values in # this column column_values={} for row in rows: column_values[row[col]]=1 # Now try dividing the rows up for each value # in this column for value in column_values.keys(): (set1,set2)=divideset(rows,col,value) # Information gain p=float(len(set1))/len(rows) gain=current_score-p*scoref(set1)-(1-p)*scoref(set2) if gain>best_gain and len(set1)>0 and len(set2)>0: best_gain=gain best_criteria=(col,value) best_sets=(set1,set2) # Create the sub branches if best_gain>0: trueBranch=buildtree(best_sets[0]) falseBranch=buildtree(best_sets[1]) return decisionnode(col=best_criteria[0],value=best_criteria[1],tb=trueBranch,fb=falseBranch) else: return decisionnode(results=uniquecounts(rows))
  • 37. Python Code def buildtree(rows,scoref=variance): def buildtree(rows,scoref=variance): if len(rows)==0: return decisionnode() if len(rows)==0: return decisionnode() current_score=scoref(rows) current_score=scoref(rows) criteria # Set up some variables to track the best #best_gain=0.0some variables to track the best criteria Set up best_criteria=None best_gain=0.0 best_sets=None column_count=len(rows[0])-1 best_criteria=None for col in range(0,column_count): best_sets=None of different values in # Generate the list # this column column_count=len(rows[0])-1 column_values={} for row in rows: column_values[row[col]]=1 # Now try dividing the rows up for each value # in this column for value in column_values.keys(): (set1,set2)=divideset(rows,col,value) # Information gain p=float(len(set1))/len(rows) gain=current_score-p*scoref(set1)-(1-p)*scoref(set2) if gain>best_gain and len(set1)>0 and len(set2)>0: best_gain=gain best_criteria=(col,value) best_sets=(set1,set2) # Create the sub branches if best_gain>0: trueBranch=buildtree(best_sets[0]) falseBranch=buildtree(best_sets[1]) return decisionnode(col=best_criteria[0],value=best_criteria[1],tb=trueBranch,fb=falseBranch) else: return decisionnode(results=uniquecounts(rows))
  • 38. Python Code def buildtree(rows,scoref=variance): if len(rows)==0: return decisionnode() current_score=scoref(rows) # Set up some variables to track the best criteria best_gain=0.0 best_criteria=None best_sets=None column_count=len(rows[0])-1 for col in range(0,column_count): # Generate the list of different values in # this column column_values={} for row in rows: column_values[row[col]]=1 for try dividing the rows up for each value # Now value in column_values.keys(): # in this column (set1,set2)=divideset(rows,col,value) for value in column_values.keys(): # Information gain (set1,set2)=divideset(rows,col,value) # Information gain p=float(len(set1))/len(rows) p=float(len(set1))/len(rows) gain=current_score-p*scoref(set1)-(1-p)*scoref(set2) gain=current_score-p*scoref(set1)-(1-p)*scoref(set2) if gain>best_gain and len(set1)>0 and len(set2)>0: if gain>best_gain and len(set1)>0 and len(set2)>0: best_gain=gain best_criteria=(col,value) best_gain=gain best_sets=(set1,set2) best_criteria=(col,value) # Create the sub branches if best_gain>0: best_sets=(set1,set2) trueBranch=buildtree(best_sets[0]) falseBranch=buildtree(best_sets[1]) return decisionnode(col=best_criteria[0],value=best_criteria[1],tb=trueBranch,fb=falseBranch) else: return decisionnode(results=uniquecounts(rows))
  • 39. Python Code def buildtree(rows,scoref=variance): if len(rows)==0: return decisionnode() current_score=scoref(rows) # Set up some variables to track the best criteria best_gain=0.0 best_criteria=None best_sets=None column_count=len(rows[0])-1 for col in range(0,column_count): # Generate the list of different values in # this column column_values={} for row in rows: column_values[row[col]]=1 # Now try dividing the rows up for each value # in this column for value in column_values.keys(): (set1,set2)=divideset(rows,col,value) # Information gain p=float(len(set1))/len(rows) gain=current_score-p*scoref(set1)-(1-p)*scoref(set2) if best_gain>0: and len(set1)>0 and len(set2)>0: if gain>best_gain best_gain=gain trueBranch=buildtree(best_sets[0]) best_criteria=(col,value) best_sets=(set1,set2) falseBranch=buildtree(best_sets[1]) # Create the sub branches if best_gain>0: return decisionnode(col=best_criteria[0],value=best_criteria[1], trueBranch=buildtree(best_sets[0]) tb=trueBranch,fb=falseBranch) falseBranch=buildtree(best_sets[1]) return decisionnode(col=best_criteria[0],value=best_criteria[1],tb=trueBranch,fb=falseBranch) else: else: return decisionnode(results=uniquecounts(rows)) return decisionnode(results=uniquecounts(rows))
  • 40. Zillow Results Bathrooms > 3 Zip: 02139? After 1903? Zip: 02140? Bedrooms > 4? Duplex? Triplex?
  • 41. Just for Fun… Hot or Not
  • 42. Just for Fun… Hot or Not
  • 43. Supervised and Unsupervised Regression trees are supervised “answers” are in the dataset Tree models predict answers Some methods are unsupervised There are no answers Methods just characterize the data Show interesting patterns
  • 44. Next challenge - Bloggers Millions of blogs online Usually focus on a subject area Can they be characterized automatically? … using only the words in the posts?
  • 47. Getting the content Use Mark Pilgrim’s Universal Feed Reader Retrieve the post titles and text Split up the words Count occurrence of each word
  • 48. Python Code import feedparser import re # Returns title and dictionary of word counts for an RSS feed def getwordcounts(url): # Parse the feed d=feedparser.parse(url) wc={} # Loop over all the entries for e in d.entries: if 'summary' in e: summary=e.summary else: summary=e.description # Extract a list of words words=getwords(e.title+' '+summary) for word in words: wc.setdefault(word,0) wc[word]+=1 return d.feed.title,wc def getwords(html): # Remove all the HTML tags txt=re.compile(r'<[^>]+>').sub('',html) # Split words by all non-alpha characters words=re.compile(r'[^A-Z^a-z]+').split(txt) # Convert to lowercase return [word.lower() for word in words if word!='']
  • 49. Python Code import feedparser import re # Returns title and dictionary of word counts for an RSS feed def getwordcounts(url): # Parse the feed d=feedparser.parse(url) wc={} for e in d.entries: # Loop over all the entries if 'summary' in e: summary=e.summary for e in d.entries: else: summary=e.description if 'summary' in e: summary=e.summary else: summary=e.description words # Extract a list of # Extract a list of words words=getwords(e.title+' '+summary) words=getwords(e.title+' '+summary) for word in words: for word in words: wc.setdefault(word,0) wc.setdefault(word,0) wc[word]+=1 wc[word]+=1 return d.feed.title,wc def getwords(html): # Remove all the HTML tags txt=re.compile(r'<[^>]+>').sub('',html) # Split words by all non-alpha characters words=re.compile(r'[^A-Z^a-z]+').split(txt) # Convert to lowercase return [word.lower() for word in words if word!='']
  • 50. Python Code import feedparser import re # Returns title and dictionary of word counts for an RSS feed def getwordcounts(url): # Parse the feed d=feedparser.parse(url) wc={} # Loop over all the entries for e in d.entries: if 'summary' in e: summary=e.summary else: summary=e.description # Extract a list of words words=getwords(e.title+' '+summary) for word in words: wc.setdefault(word,0) wc[word]+=1 return d.feed.title,wc def getwords(html): # Remove def getwords(html): all the HTML tags # Remove all the HTML tags txt=re.compile(r'<[^>]+>').sub('',html) txt=re.compile(r'<[^>]+>').sub('',html) # Split words bywords by all non-alpha characters # Split all non-alpha characters words=re.compile(r'[^A-Z^a-z]+').split(txt) words=re.compile(r'[^A-Z^a-z]+').split(txt) # Convert # Convert to lowercase to lowercase return [word.lower() for word in words if word!=''] return [word.lower() for word in words if word!='']
  • 51. Building a Word Matrix Build a matrix of word counts Blogs are rows, words are columns Eliminate words that are: Too common Too rare
  • 52. Python Code apcount={} wordcounts={} for feedurl in file('feedlist.txt'): title,wc=getwordcounts(feedurl) wordcounts[title]=wc for word,count in wc.items(): apcount.setdefault(word,0) if count>1: apcount[word]+=1 wordlist=[] for w,bc in apcount.items(): frac=float(bc)/len(feedlist) if frac>0.1 and frac<0.5: wordlist.append(w) out=file('blogdata.txt','w') out.write('Blog') for word in wordlist: out.write('t%s' % word) out.write('n') for blog,wc in wordcounts.items(): out.write(blog) for word in wordlist: if word in wc: out.write('t%d' % wc[word]) else: out.write('t0') out.write('n')
  • 53. Python Code apcount={} wordcounts={} for feedurlinin file('feedlist.txt'): for feedurl file('feedlist.txt'): title,wc=getwordcounts(feedurl) title,wc=getwordcounts(feedurl) wordcounts[title]=wc wordcounts[title]=wc for word,count in wc.items(): forapcount.setdefault(word,0) word,count in wc.items(): apcount.setdefault(word,0) if count>1: if apcount[word]+=1 count>1: apcount[word]+=1 wordlist=[] for w,bc in apcount.items(): frac=float(bc)/len(feedlist) if frac>0.1 and frac<0.5: wordlist.append(w) out=file('blogdata.txt','w') out.write('Blog') for word in wordlist: out.write('t%s' % word) out.write('n') for blog,wc in wordcounts.items(): out.write(blog) for word in wordlist: if word in wc: out.write('t%d' % wc[word]) else: out.write('t0') out.write('n')
  • 54. Python Code apcount={} wordcounts={} for feedurl in file('feedlist.txt'): title,wc=getwordcounts(feedurl) wordcounts[title]=wc for word,count in wc.items(): apcount.setdefault(word,0) if count>1: apcount[word]+=1 wordlist=[] wordlist=[] for w,bc in apcount.items(): for w,bc in apcount.items(): frac=float(bc)/len(feedlist) frac=float(bc)/len(feedlist) if frac>0.1 and frac<0.5: wordlist.append(w) if frac>0.1 and frac<0.5: wordlist.append(w) out=file('blogdata.txt','w') out.write('Blog') for word in wordlist: out.write('t%s' % word) out.write('n') for blog,wc in wordcounts.items(): out.write(blog) for word in wordlist: if word in wc: out.write('t%d' % wc[word]) else: out.write('t0') out.write('n')
  • 55. Python Code apcount={} wordcounts={} for feedurl in file('feedlist.txt'): title,wc=getwordcounts(feedurl) wordcounts[title]=wc for word,count in wc.items(): apcount.setdefault(word,0) if count>1: apcount[word]+=1 wordlist=[] for w,bc in apcount.items(): frac=float(bc)/len(feedlist) if frac>0.1 and frac<0.5: wordlist.append(w) out=file('blogdata.txt','w') out.write('Blog') out=file('blogdata.txt','w') for word in wordlist: out.write('t%s' % word) out.write('Blog') for word in wordlist: out.write('t%s' % word) out.write('n') out.write('n') for blog,wcinin wordcounts.items(): for blog,wc wordcounts.items(): out.write(blog) out.write(blog) for wordin wordlist: for word in wordlist: if word in wc: out.write('t%d' % wc[word]) if word in wc: out.write('t%d' % wc[word]) else: out.write('t0') else: out.write('t0') out.write('n') out.write('n')
  • 56. The Word Matrix “china” “kids” “music” “yahoo” Gothamist 0 3 3 0 GigaOM 6 0 1 2 Quick Online Tips 0 2 2 12
  • 57. Determining distance “china” “kids” “music” “yahoo” Gothamist 0 3 3 0 GigaOM 6 0 1 2 Quick Online Tips 0 2 2 12 Euclidean “as the crow flies” (6 − 0) 2 + (0 − 2) 2 + (1 − 2) 2 + (2 − 12) 2 = 12 (approx)
  • 58. Other Distance Metrics Manhattan Tanamoto Pearson Correlation Chebychev Spearman
  • 59. Hierarchical Clustering Find the two closest item Combine them into a single item Repeat…
  • 66. Python Code class bicluster: def __init__(self,vec,left=None,right=None,distance=0.0,id=None): self.left=left self.right=right self.vec=vec self.id=id self.distance=distance
  • 67. Python Code def hcluster(rows,distance=pearson): distances={} currentclustid=-1 # Clusters are initially just the rows clust=[bicluster(rows[i],id=i) for i in range(len(rows))] while len(clust)>1: lowestpair=(0,1) closest=distance(clust[0].vec,clust[1].vec) # loop through every pair looking for the smallest distance for i in range(len(clust)): for j in range(i+1,len(clust)): # distances is the cache of distance calculations if (clust[i].id,clust[j].id) not in distances: distances[(clust[i].id,clust[j].id)]=distance(clust[i].vec,clust[j].vec) d=distances[(clust[i].id,clust[j].id)] if d<closest: closest=d lowestpair=(i,j) # calculate the average of the two clusters mergevec=[ (clust[lowestpair[0]].vec[i]+clust[lowestpair[1]].vec[i])/2.0 for i in range(len(clust[0].vec))] # create the new cluster newcluster=bicluster(mergevec,left=clust[lowestpair[0]], right=clust[lowestpair[1]], distance=closest,id=currentclustid) # cluster ids that weren’t in the original set are negative currentclustid-=1 del clust[lowestpair[1]] del clust[lowestpair[0]] clust.append(newcluster) return clust[0]
  • 68. Python Code def hcluster(rows,distance=pearson): distances={} distances={} currentclustid=-1 currentclustid=-1 # Clusters are initially just the rows clust=[bicluster(rows[i],id=i) for i in range(len(rows))] # Clusters are initially just the rows while len(clust)>1: lowestpair=(0,1) clust=[bicluster(rows[i],id=i) for i in range(len(rows))] closest=distance(clust[0].vec,clust[1].vec) # loop through every pair looking for the smallest distance for i in range(len(clust)): for j in range(i+1,len(clust)): # distances is the cache of distance calculations if (clust[i].id,clust[j].id) not in distances: distances[(clust[i].id,clust[j].id)]=distance(clust[i].vec,clust[j].vec) d=distances[(clust[i].id,clust[j].id)] if d<closest: closest=d lowestpair=(i,j) # calculate the average of the two clusters mergevec=[ (clust[lowestpair[0]].vec[i]+clust[lowestpair[1]].vec[i])/2.0 for i in range(len(clust[0].vec))] # create the new cluster newcluster=bicluster(mergevec,left=clust[lowestpair[0]], right=clust[lowestpair[1]], distance=closest,id=currentclustid) # cluster ids that weren’t in the original set are negative currentclustid-=1 del clust[lowestpair[1]] del clust[lowestpair[0]] clust.append(newcluster) return clust[0]
  • 69. Python Code def hcluster(rows,distance=pearson): distances={} while len(clust)>1: currentclustid=-1 # Clusters are initially just the rows lowestpair=(0,1) clust=[bicluster(rows[i],id=i) for i in range(len(rows))] closest=distance(clust[0].vec,clust[1].vec) while len(clust)>1: lowestpair=(0,1) # loop closest=distance(clust[0].vec,clust[1].vec) for the smallest distance through every pair looking for i inloopin range(len(clust)): # range(len(clust)): for the smallest distance through every pair looking for i for j for j range(i+1,len(clust)): in in range(i+1,len(clust)): # distances is the cache of distance calculations # distances is the cache of distances: if (clust[i].id,clust[j].id) not in distance calculations if (clust[i].id,clust[j].id) not in distances: distances[(clust[i].id,clust[j].id)]=distance(clust[i].vec,clust[j].vec) d=distances[(clust[i].id,clust[j].id)] distances[(clust[i].id,clust[j].id)]= if d<closest: closest=d distance(clust[i].vec,clust[j].vec) lowestpair=(i,j) d=distances[(clust[i].id,clust[j].id)] # calculate the average of the two clusters mergevec=[ if (clust[lowestpair[0]].vec[i]+clust[lowestpair[1]].vec[i])/2.0 d<closest: for i in range(len(clust[0].vec))] closest=d # create the new cluster lowestpair=(i,j) newcluster=bicluster(mergevec,left=clust[lowestpair[0]], right=clust[lowestpair[1]], distance=closest,id=currentclustid) # cluster ids that weren’t in the original set are negative currentclustid-=1 del clust[lowestpair[1]] del clust[lowestpair[0]] clust.append(newcluster) return clust[0]
  • 70. Python Code def hcluster(rows,distance=pearson): distances={} currentclustid=-1 # Clusters are initially just the rows clust=[bicluster(rows[i],id=i) for i in range(len(rows))] while len(clust)>1: lowestpair=(0,1) closest=distance(clust[0].vec,clust[1].vec) # loop through every pair looking for the smallest distance for i in range(len(clust)): for j in range(i+1,len(clust)): # distances is the cache of distance calculations if (clust[i].id,clust[j].id) not in distances: # calculate distances[(clust[i].id,clust[j].id)]=distance(clust[i].vec,clust[j].vec) the average of the two clusters d=distances[(clust[i].id,clust[j].id)] mergevec=[ if d<closest: closest=d (clust[lowestpair[0]].vec[i]+clust[lowestpair[1]].vec[i])/2.0 lowestpair=(i,j) #in range(len(clust[0].vec)) calculate the average of the two clusters for i mergevec=[ ] (clust[lowestpair[0]].vec[i]+clust[lowestpair[1]].vec[i])/2.0 # create for i in range(len(clust[0].vec))] #the new new cluster create the cluster newcluster=bicluster(mergevec,left=clust[lowestpair[0]], newcluster=bicluster(mergevec,left=clust[lowestpair[0]], right=clust[lowestpair[1]], right=clust[lowestpair[1]], distance=closest,id=currentclustid) # cluster ids that weren’t in the original set are negative distance=closest,id=currentclustid) currentclustid-=1 del clust[lowestpair[1]] del clust[lowestpair[1]] del clust[lowestpair[0]] del clust[lowestpair[0]] clust.append(newcluster) clust.append(newcluster) return clust[0]
  • 74. Rotating the Matrix Words in a blog -> blogs containing each word Gothamist GigaOM Quick Onl china 0 6 0 kids 3 0 2 music 3 1 2 Yahoo 0 2 12
  • 76. K-Means Clustering Divides data into distinct clusters User determines how many Algorithm Start with arbitrary centroids Assign points to centroids Move the centroids Repeat
  • 82. Python Code import random def kcluster(rows,distance=pearson,k=4): # Determine the minimum and maximum values for each point ranges=[(min([row[i] for row in rows]),max([row[i] for row in rows])) for i in range(len(rows[0]))] # Create k randomly placed centroids clusters=[[random.random()*(ranges[i][1]-ranges[i][0])+ranges[i][0] for i in range(len(rows[0]))] for j in range(k)] lastmatches=None for t in range(100): print 'Iteration %d' % t bestmatches=[[] for i in range(k)] # Find which centroid is the closest for each row for j in range(len(rows)): row=rows[j] bestmatch=0 for i in range(k): d=distance(clusters[i],row) if d<distance(clusters[bestmatch],row): bestmatch=i bestmatches[bestmatch].append(j) # If the results are the same as last time, this is complete if bestmatches==lastmatches: break lastmatches=bestmatches # Move the centroids to the average of their members for i in range(k): avgs=[0.0]*len(rows[0]) if len(bestmatches[i])>0: for rowid in bestmatches[i]: for m in range(len(rows[rowid])): avgs[m]+=rows[rowid][m] for j in range(len(avgs)): avgs[j]/=len(bestmatches[i]) clusters[i]=avgs return bestmatches
  • 83. Python Code import random def kcluster(rows,distance=pearson,k=4): # Determine the minimum and maximum values for each point # Determine the minimum and maximum values for each point ranges=[(min([row[i] for row in rows]),max([row[i] for row in rows])) ranges=[(min([row[i] for row in rows]), for i in range(len(rows[0]))] # Create k randomly placed centroids max([row[i] for row in rows])) clusters=[[random.random()*(ranges[i][1]-ranges[i][0])+ranges[i][0] for i in range(len(rows[0]))] for j in range(k)] for i in range(len(rows[0]))] lastmatches=None for t in range(100): # Create k randomly placed centroids print 'Iteration %d' % t bestmatches=[[] for i in range(k)] clusters=[[random.random()* # Find which centroid is the closest for each row for j(ranges[i][1]-ranges[i][0])+ranges[i][0] in range(len(rows)): row=rows[j] for i in range(len(rows[0]))] bestmatch=0 for i in range(k): for j in range(k)] d=distance(clusters[i],row) if d<distance(clusters[bestmatch],row): bestmatch=i bestmatches[bestmatch].append(j) # If the results are the same as last time, this is complete if bestmatches==lastmatches: break lastmatches=bestmatches # Move the centroids to the average of their members for i in range(k): avgs=[0.0]*len(rows[0]) if len(bestmatches[i])>0: for rowid in bestmatches[i]: for m in range(len(rows[rowid])): avgs[m]+=rows[rowid][m] for j in range(len(avgs)): avgs[j]/=len(bestmatches[i]) clusters[i]=avgs return bestmatches
  • 84. Python Code import random def kcluster(rows,distance=pearson,k=4): # Determine the minimum and maximum values for each point ranges=[(min([row[i] for row in rows]),max([row[i] for row in rows])) for i in range(len(rows[0]))] # Create k randomly placed centroids for t in range(100): clusters=[[random.random()*(ranges[i][1]-ranges[i][0])+ranges[i][0] for i in range(len(rows[0]))] for j in range(k)] bestmatches=[[] for i in range(k)] lastmatches=None for t in range(100): # Find which centroid is the closest for each row print 'Iteration %d' % t bestmatches=[[] for i in range(k)] for j in range(len(rows)): # Find which centroid is the closest for each row row=rows[j] for j in range(len(rows)): row=rows[j] bestmatch=0 bestmatch=0 for for iin range(k): i in range(k): d=distance(clusters[i],row) d=distance(clusters[i],row) if d<distance(clusters[bestmatch],row): bestmatch=i bestmatches[bestmatch].append(j) if d<distance(clusters[bestmatch],row): bestmatch=i # If the results are the same as last time, this is complete if bestmatches==lastmatches: break bestmatches[bestmatch].append(j) lastmatches=bestmatches # Move the centroids to the average of their members for i in range(k): avgs=[0.0]*len(rows[0]) if len(bestmatches[i])>0: for rowid in bestmatches[i]: for m in range(len(rows[rowid])): avgs[m]+=rows[rowid][m] for j in range(len(avgs)): avgs[j]/=len(bestmatches[i]) clusters[i]=avgs return bestmatches
  • 85. Python Code import random def kcluster(rows,distance=pearson,k=4): # Determine the minimum and maximum values for each point ranges=[(min([row[i] for row in rows]),max([row[i] for row in rows])) for i in range(len(rows[0]))] # Create k randomly placed centroids clusters=[[random.random()*(ranges[i][1]-ranges[i][0])+ranges[i][0] for i in range(len(rows[0]))] for j in range(k)] lastmatches=None for t in range(100): print 'Iteration %d' % t bestmatches=[[] for i in range(k)] # Find which centroid is the closest for each row for j in range(len(rows)): row=rows[j] bestmatch=0 for i in range(k): d=distance(clusters[i],row) if d<distance(clusters[bestmatch],row): bestmatch=i # If the results are the same as last time, this is complete bestmatches[bestmatch].append(j) # If the results are the same as last time, this is complete if bestmatches==lastmatches: break if bestmatches==lastmatches: break lastmatches=bestmatches lastmatches=bestmatches # Move the centroids to the average of their members for i in range(k): avgs=[0.0]*len(rows[0]) if len(bestmatches[i])>0: for rowid in bestmatches[i]: for m in range(len(rows[rowid])): avgs[m]+=rows[rowid][m] for j in range(len(avgs)): avgs[j]/=len(bestmatches[i]) clusters[i]=avgs return bestmatches
  • 86. Python Code import random def kcluster(rows,distance=pearson,k=4): # Determine the minimum and maximum values for each point ranges=[(min([row[i] for row in rows]),max([row[i] for row in rows])) for i in range(len(rows[0]))] # Create k randomly placed centroids clusters=[[random.random()*(ranges[i][1]-ranges[i][0])+ranges[i][0] for i in range(len(rows[0]))] for j in range(k)] lastmatches=None for t in range(100): print 'Iteration %d' % t bestmatches=[[] for i in range(k)] # Find which centroid is the closest for each row for j in range(len(rows)): row=rows[j] bestmatch=0 for i in range(k): # Move the centroids to the average of their members d=distance(clusters[i],row) if d<distance(clusters[bestmatch],row): bestmatch=i for i in range(k): bestmatches[bestmatch].append(j) # If the results are the same as last time, this is complete avgs=[0.0]*len(rows[0]) if bestmatches==lastmatches: break lastmatches=bestmatches if len(bestmatches[i])>0: # Move the centroids toin average of their members for rowid the bestmatches[i]: for i in range(k): avgs=[0.0]*len(rows[0])range(len(rows[rowid])): for m in if len(bestmatches[i])>0: avgs[m]+=rows[rowid][m] for rowid in bestmatches[i]: for m in range(len(rows[rowid])): for j in range(len(avgs)): avgs[m]+=rows[rowid][m] for j in range(len(avgs)): avgs[j]/=len(bestmatches[i]) avgs[j]/=len(bestmatches[i]) clusters[i]=avgs clusters[i]=avgs return bestmatches
  • 87. K-Means Results >> [rownames[r] for r in k[0]] ['The Viral Garden', 'Copyblogger', 'Creating Passionate Users', 'Oilman', 'ProBlogger Blog Tips', quot;Seth's Blogquot;] >> [rownames[r] for r in k[1]] ['Wonkette', 'Gawker', 'Gothamist', 'Huffington Post']
  • 88. 2D Visualizations Instead of Clusters, a 2D Map Goals Preserve distances as much as possible Draw in two dimensions Dimension Reduction Principal Components Analysis Multidimensional Scaling
  • 92. def scaledown(data,distance=pearson,rate=0.01): n=len(data) # The real distances between every pair of items realdist=[[distance(data[i],data[j]) for j in range(n)] for i in range(0,n)] outersum=0.0 # Randomly initialize the starting points of the locations in 2D loc=[[random.random(),random.random()] for i in range(n)] fakedist=[[0.0 for j in range(n)] for i in range(n)] lasterror=None for m in range(0,1000): # Find projected distances for i in range(n): for j in range(n): fakedist[i][j]=sqrt(sum([pow(loc[i][x]-loc[j][x],2) for x in range(len(loc[i]))])) # Move points grad=[[0.0,0.0] for i in range(n)] totalerror=0 for k in range(n): for j in range(n): if j==k: continue # The error is percent difference between the distances errorterm=(fakedist[j][k]-realdist[j][k])/realdist[j][k] # Each point needs to be moved away from or towards the other # point in proportion to how much error it has grad[k][0]+=((loc[k][0]-loc[j][0])/fakedist[j][k])*errorterm grad[k][1]+=((loc[k][1]-loc[j][1])/fakedist[j][k])*errorterm # Keep track of the total error totalerror+=abs(errorterm) print totalerror # If the answer got worse by moving the points, we are done if lasterror and lasterror<totalerror: break lasterror=totalerror # Move each of the points by the learning rate times the gradient for k in range(n): loc[k][0]-=rate*grad[k][0] loc[k][1]-=rate*grad[k][1] return loc
  • 93. def scaledown(data,distance=pearson,rate=0.01): n=len(data) The real distances between every pair of items n=len(data) # # The realrealdist=[[distance(data[i],data[j]) for j inpair of items distances between every range(n)] for i in range(0,n)] realdist=[[distance(data[i],data[j]) for j in range(n)] outersum=0.0 # for i initialize the starting points of the locations in 2D in range(0,n)] Randomly loc=[[random.random(),random.random()] for i in range(n)] outersum=0.0 fakedist=[[0.0 for j in range(n)] for i in range(n)] lasterror=None for m in range(0,1000): # Find projected distances for i in range(n): for j in range(n): fakedist[i][j]=sqrt(sum([pow(loc[i][x]-loc[j][x],2) for x in range(len(loc[i]))])) # Move points grad=[[0.0,0.0] for i in range(n)] totalerror=0 for k in range(n): for j in range(n): if j==k: continue # The error is percent difference between the distances errorterm=(fakedist[j][k]-realdist[j][k])/realdist[j][k] # Each point needs to be moved away from or towards the other # point in proportion to how much error it has grad[k][0]+=((loc[k][0]-loc[j][0])/fakedist[j][k])*errorterm grad[k][1]+=((loc[k][1]-loc[j][1])/fakedist[j][k])*errorterm # Keep track of the total error totalerror+=abs(errorterm) print totalerror # If the answer got worse by moving the points, we are done if lasterror and lasterror<totalerror: break lasterror=totalerror # Move each of the points by the learning rate times the gradient for k in range(n): loc[k][0]-=rate*grad[k][0] loc[k][1]-=rate*grad[k][1] return loc
  • 94. def scaledown(data,distance=pearson,rate=0.01): n=len(data) # The real distances between every pair of items realdist=[[distance(data[i],data[j]) for j in range(n)] for i in range(0,n)] outersum=0.0 # RandomlyRandomly initialize the startingof the locations in 2D in # initialize the starting points points of the locations 2D loc=[[random.random(),random.random()] for i in range(n)] loc=[[random.random(),random.random()] for i in range(n)] fakedist=[[0.0 for j in range(n)] for i in range(n)] fakedist=[[0.0 for j in range(n)] for i in range(n)] lasterror=None for m in range(0,1000): # Find projected distances for i in range(n): for j in range(n): fakedist[i][j]=sqrt(sum([pow(loc[i][x]-loc[j][x],2) for x in range(len(loc[i]))])) # Move points grad=[[0.0,0.0] for i in range(n)] totalerror=0 for k in range(n): for j in range(n): if j==k: continue # The error is percent difference between the distances errorterm=(fakedist[j][k]-realdist[j][k])/realdist[j][k] # Each point needs to be moved away from or towards the other # point in proportion to how much error it has grad[k][0]+=((loc[k][0]-loc[j][0])/fakedist[j][k])*errorterm grad[k][1]+=((loc[k][1]-loc[j][1])/fakedist[j][k])*errorterm # Keep track of the total error totalerror+=abs(errorterm) print totalerror # If the answer got worse by moving the points, we are done if lasterror and lasterror<totalerror: break lasterror=totalerror # Move each of the points by the learning rate times the gradient for k in range(n): loc[k][0]-=rate*grad[k][0] loc[k][1]-=rate*grad[k][1] return loc
  • 95. def scaledown(data,distance=pearson,rate=0.01): n=len(data) # The real distances between every pair of items realdist=[[distance(data[i],data[j]) for j in range(n)] for i in range(0,n)] outersum=0.0 # Randomly initialize the starting points of the locations in 2D loc=[[random.random(),random.random()] for i in range(n)] fakedist=[[0.0 for j in range(n)] for i in range(n)] lasterror=None lasterror=None for m in range(0,1000): for m in # Find projected distances range(0,1000): for i in range(n): # Find projected distances for j in range(n): fakedist[i][j]=sqrt(sum([pow(loc[i][x]-loc[j][x],2) for i in range(n): for x in range(len(loc[i]))])) for j in range(n): # Move points fakedist[i][j]=sqrt(sum([pow(loc[i][x]-loc[j][x],2) grad=[[0.0,0.0] for i in range(n)] for x in range(len(loc[i]))])) totalerror=0 for k in range(n): for j in range(n): if j==k: continue # The error is percent difference between the distances errorterm=(fakedist[j][k]-realdist[j][k])/realdist[j][k] # Each point needs to be moved away from or towards the other # point in proportion to how much error it has grad[k][0]+=((loc[k][0]-loc[j][0])/fakedist[j][k])*errorterm grad[k][1]+=((loc[k][1]-loc[j][1])/fakedist[j][k])*errorterm # Keep track of the total error totalerror+=abs(errorterm) print totalerror # If the answer got worse by moving the points, we are done if lasterror and lasterror<totalerror: break lasterror=totalerror # Move each of the points by the learning rate times the gradient for k in range(n): loc[k][0]-=rate*grad[k][0] loc[k][1]-=rate*grad[k][1] return loc
  • 96. def scaledown(data,distance=pearson,rate=0.01): n=len(data) # The real distances between every pair of items realdist=[[distance(data[i],data[j]) for j in range(n)] for i in range(0,n)] outersum=0.0 # Randomly initialize the starting points of the locations in 2D loc=[[random.random(),random.random()] for i in range(n)] fakedist=[[0.0 for j in range(n)] for i in range(n)] # Move points lasterror=None grad=[[0.0,0.0]# m in range(0,1000): for for i in range(n)] Find projected distances for i in range(n): for j in range(n): totalerror=0 fakedist[i][j]=sqrt(sum([pow(loc[i][x]-loc[j][x],2) for x in range(len(loc[i]))])) for k in range(n): # Move points for j in range(n): grad=[[0.0,0.0] for i in range(n)] if j==k: continue totalerror=0 # The errorfor k inpercent difference between the distances is range(n): for j in range(n): errorterm=(fakedist[j][k]-realdist[j][k])/realdist[j][k] if j==k: continue # The error is percent difference between the distances errorterm=(fakedist[j][k]-realdist[j][k])/realdist[j][k] # Each point needs to be moved away from or towards the towards the # Each point needs to be moved away from or other # other point# in proportionhow much error much error it has point in proportion to to how it has grad[k][0]+=((loc[k][0]-loc[j][0])/fakedist[j][k])*errorterm grad[k][0]+=((loc[k][0]-loc[j][0])/fakedist[j][k])*errorterm grad[k][1]+=((loc[k][1]-loc[j][1])/fakedist[j][k])*errorterm # Keep track of the total error grad[k][1]+=((loc[k][1]-loc[j][1])/fakedist[j][k])*errorterm totalerror+=abs(errorterm) print totalerror # Keep trackIfof answer got worse by moving the points, we are done the the total error # if lasterror and lasterror<totalerror: break totalerror+=abs(errorterm) lasterror=totalerror # Move each of the points by the learning rate times the gradient for k in range(n): loc[k][0]-=rate*grad[k][0] loc[k][1]-=rate*grad[k][1] return loc
  • 97. def scaledown(data,distance=pearson,rate=0.01): n=len(data) # The real distances between every pair of items realdist=[[distance(data[i],data[j]) for j in range(n)] for i in range(0,n)] outersum=0.0 # Randomly initialize the starting points of the locations in 2D loc=[[random.random(),random.random()] for i in range(n)] fakedist=[[0.0 for j in range(n)] for i in range(n)] lasterror=None for m in range(0,1000): # Find projected distances for i in range(n): for j in range(n): fakedist[i][j]=sqrt(sum([pow(loc[i][x]-loc[j][x],2) for x in range(len(loc[i]))])) # Move points grad=[[0.0,0.0] for i in range(n)] totalerror=0 for k in range(n): for j in range(n): if j==k: continue # The error is percent difference between the distances errorterm=(fakedist[j][k]-realdist[j][k])/realdist[j][k] # Each point needs to be moved away from or towards the other # point in proportion to how much error it has grad[k][0]+=((loc[k][0]-loc[j][0])/fakedist[j][k])*errorterm grad[k][1]+=((loc[k][1]-loc[j][1])/fakedist[j][k])*errorterm # Keep track of the total error totalerror+=abs(errorterm) # If the answer got worse by moving the points, we are done print totalerror # If the answer got worse by moving the points, we are done if lasterror and lasterror<totalerror: break if lasterror and lasterror<totalerror: break lasterror=totalerror lasterror=totalerror # Move each of the points by the learning rate times the gradient for k in range(n): loc[k][0]-=rate*grad[k][0] loc[k][1]-=rate*grad[k][1] return loc
  • 98. def scaledown(data,distance=pearson,rate=0.01): n=len(data) # The real distances between every pair of items realdist=[[distance(data[i],data[j]) for j in range(n)] for i in range(0,n)] outersum=0.0 # Randomly initialize the starting points of the locations in 2D loc=[[random.random(),random.random()] for i in range(n)] fakedist=[[0.0 for j in range(n)] for i in range(n)] lasterror=None for m in range(0,1000): # Find projected distances for i in range(n): for j in range(n): fakedist[i][j]=sqrt(sum([pow(loc[i][x]-loc[j][x],2) for x in range(len(loc[i]))])) # Move points grad=[[0.0,0.0] for i in range(n)] totalerror=0 for k in range(n): for j in range(n): if j==k: continue # The error is percent difference between the distances errorterm=(fakedist[j][k]-realdist[j][k])/realdist[j][k] # Each point needs to be moved away from or towards the other # point in proportion to how much error it has grad[k][0]+=((loc[k][0]-loc[j][0])/fakedist[j][k])*errorterm grad[k][1]+=((loc[k][1]-loc[j][1])/fakedist[j][k])*errorterm # Keep track of the total error totalerror+=abs(errorterm) print totalerror # If the answer got worse by moving the points, we are done if lasterror and lasterror<totalerror: break # Move each of the points by the learning rate times the gradient lasterror=totalerror for k in range(n): of the points by the learning rate times the gradient # Move each loc[k][0]-=rate*grad[k][0] for k in range(n): loc[k][0]-=rate*grad[k][0] loc[k][1]-=rate*grad[k][1] loc[k][1]-=rate*grad[k][1] return loc
  • 99.
  • 100.
  • 101.
  • 102.
  • 103. Numerical Predictions Back to “supervised” learning We have a set of numerical attributes Specs for a laptop Age and rating for wine Ratios for a stock Want to predict another attribute Formula/model is unknown e.g. price
  • 104. Regression Trees? Regression trees find hard boundaries Can’t deal with complex formulae
  • 105. Statistical regression Requires specification of a model Usually linear Doesn’t handle context
  • 106. Alternative - Interpolation Find “similar” items Guess price based on similar items Need to determine: What is similar? How should we aggregate prices?
  • 108. The eBay API XML API Send XML over HTTPS Receive results in XML http://developer.ebay.com/quickstartguide.
  • 109. Some Python Code def getHeaders(apicall,siteID=quot;0quot;,compatabilityLevel = quot;433quot;): headers = {quot;X-EBAY-API-COMPATIBILITY-LEVELquot;: compatabilityLevel, quot;X-EBAY-API-DEV-NAMEquot;: devKey, quot;X-EBAY-API-APP-NAMEquot;: appKey, quot;X-EBAY-API-CERT-NAMEquot;: certKey, quot;X-EBAY-API-CALL-NAMEquot;: apicall, quot;X-EBAY-API-SITEIDquot;: siteID, quot;Content-Typequot;: quot;text/xmlquot;} return headers def sendRequest(apicall,xmlparameters): connection = httplib.HTTPSConnection(serverUrl) connection.request(quot;POSTquot;, '/ws/api.dll', xmlparameters, getHeaders(apicall)) response = connection.getresponse() if response.status != 200: print quot;Error sending request:quot; + response.reason else: data = response.read() connection.close() return data
  • 110. Some Python Code def getItem(itemID): xml = quot;<?xml version='1.0' encoding='utf-8'?>quot;+ quot;<GetItemRequest xmlns=quot;urn:ebay:apis:eBLBaseComponentsquot;>quot;+ quot;<RequesterCredentials><eBayAuthToken>quot; + userToken + quot;</eBayAuthToken></RequesterCredentials>quot; + quot;<ItemID>quot; + str(itemID) + quot;</ItemID>quot;+ quot;<DetailLevel>ItemReturnAttributes</DetailLevel>quot;+ quot;</GetItemRequest>quot; data=sendRequest('GetItem',xml) result={} response=parseString(data) result['title']=getSingleValue(response,'Title') sellingStatusNode = response.getElementsByTagName('SellingStatus')[0]; result['price']=getSingleValue(sellingStatusNode,'CurrentPrice') result['bids']=getSingleValue(sellingStatusNode,'BidCount') seller = response.getElementsByTagName('Seller') result['feedback'] = getSingleValue(seller[0],'FeedbackScore') attributeSet=response.getElementsByTagName('Attribute'); attributes={} for att in attributeSet: attID=att.attributes.getNamedItem('attributeID').nodeValue attValue=getSingleValue(att,'ValueLiteral') attributes[attID]=attValue result['attributes']=attributes return result
  • 111. Building an item table RAM CPU HDD Screen DVD Price D600 512 1400 40 14 1 $350 Lenovo 160 300 5 13 0 $80 T22 256 900 20 14 1 $200 Pavillion 1024 1600 120 17 1 $800 etc..
  • 112. Distance between items RAM CPU HDD Screen DVD Price New 512 1400 40 14 1 ??? T22 256 900 20 14 1 $200 Euclidean, just like in clustering (512 − 256) 2 + (1400 − 900) 2 + (40 − 20) 2 + (14 − 14) 2 + (1 − 1) 2
  • 113. Idea 1 – use the closest item With the item whose price I want to guess: Calculate the distance for every item in my dataset Guess that the price is the same as the closest This is called kNN with k=1
  • 114. Problems with “outliers” The closest item may be anomalous Why? Exceptional deal that won’t occur again Something missing from the dataset Data errors
  • 115. Using an average RAM CPU HDD Screen DVD Price New 512 1400 40 14 1 ??? No. 1 512 1400 30 13 1 $360 No. 2 512 1400 60 14 1 $400 No. 3 1024 1600 120 15 0 $325 k=3, estimate = $361
  • 116. Using a weighted average RAM CPU HDD Screen DVD Price Weight New 512 1400 40 14 1 ??? No. 1 512 1400 30 13 1 $360 3 No. 2 512 1400 60 14 1 $400 2 No. 3 1024 1600 120 15 0 $325 1 Estimate = $367
  • 117. Python code def getdistances(data,vec1): distancelist=[] for i in range(len(data)): vec2=data[i]['input'] distancelist.append((euclidean(vec1,vec2),i)) distancelist.sort() return distancelist def weightedknn(data,vec1,k=5,weightf=gaussian): # Get distances dlist=getdistances(data,vec1) avg=0.0 totalweight=0.0 # Get weighted average for i in range(k): dist=dlist[i][0] idx=dlist[i][1] weight=weightf(dist) avg+=weight*data[idx]['result'] totalweight+=weight avg=avg/totalweight return avg
  • 118. Python code def getdistances(data,vec1): distancelist=[] for i in range(len(data)): defvec2=data[i]['input'] weightedknn(data,vec1,k=5,weightf=gaussian): distancelist.append((euclidean(vec1,vec2),i)) # Get distances distancelist.sort() dlist=getdistances(data,vec1) return distancelist avg=0.0 totalweight=0.0 # Get weighted average for i in range(k): dist=dlist[i][0] idx=dlist[i][1] weight=weightf(dist) avg+=weight*data[idx]['result'] totalweight+=weight avg=avg/totalweight return avg
  • 119. Too few – k too low
  • 120. Too many – k too high
  • 121. Determining the best k Divide the dataset up Training set Test set Guess the prices for the test set using the training set See how good the guesses are for different values of k Known as “cross-validation”
  • 122. Determining the best k Test set Attribute Price Attribute Price 10 20 10 20 Training set 11 30 Attribute Price 8 10 11 30 6 0 8 10 6 0 For k = 1, guess = 30, error = 10 For k = 2, guess = 20, error = 0 For k = 3, guess = 13, error = 7 Repeat with different test sets, average the error
  • 123. Python code def dividedata(data,test=0.05): trainset=[] testset=[] for row in data: if random()<test: testset.append(row) else: trainset.append(row) return trainset,testset def testalgorithm(algf,trainset,testset): error=0.0 for row in testset: guess=algf(trainset,row['input']) error+=(row['result']-guess)**2 return error/len(testset) def crossvalidate(algf,data,trials=100,test=0.05): error=0.0 for i in range(trials): trainset,testset=dividedata(data,test) error+=testalgorithm(algf,trainset,testset) return error/trials
  • 124. Python code def dividedata(data,test=0.05): trainset=[] testset=[] for row in data: if random()<test: testset.append(row) else: trainset.append(row) return trainset,testset def testalgorithm(algf,trainset,testset): error=0.0 for row in testset: guess=algf(trainset,row['input']) error+=(row['result']-guess)**2 return error/len(testset) def crossvalidate(algf,data,trials=100,test=0.05): error=0.0 for i in range(trials): trainset,testset=dividedata(data,test) error+=testalgorithm(algf,trainset,testset) return error/trials
  • 125. Python code def dividedata(data,test=0.05): trainset=[] testset=[] for row in data: if random()<test: testset.append(row) else: trainset.append(row) return trainset,testset def testalgorithm(algf,trainset,testset): error=0.0 for row in testset: guess=algf(trainset,row['input']) error+=(row['result']-guess)**2 return error/len(testset) def crossvalidate(algf,data,trials=100,test=0.05): error=0.0 for i in range(trials): trainset,testset=dividedata(data,test) error+=testalgorithm(algf,trainset,testset) return error/trials
  • 126. Python code def dividedata(data,test=0.05): trainset=[] testset=[] for row in data: if random()<test: testset.append(row) else: trainset.append(row) return trainset,testset def testalgorithm(algf,trainset,testset): error=0.0 for row in testset: guess=algf(trainset,row['input']) error+=(row['result']-guess)**2 return error/len(testset) def crossvalidate(algf,data,trials=100,test=0.05): error=0.0 for i in range(trials): trainset,testset=dividedata(data,test) error+=testalgorithm(algf,trainset,testset) return error/trials
  • 130. Determining the best scale Try different weights Use the “cross-validation” method Different ways of choosing a scale: Range-scaling Intuitive guessing Optimization
  • 131. Methods covered Regression trees Hierarchical clustering k-means clustering Multidimensional scaling Weight k-nearest neighbors
  • 132. New projects Openads An open-source ad server Users can share impression/click data Matrix of what hits based on Page Text Ad Ad placement Search query Can we improve targeting?
  • 133. New Projects Finance Analysts already drowning in info Stories sometimes broken on blogs Message boards show sentiment Extremely low signal-to-noise ratio
  • 134. New Projects Entertainment How much buzz is a movie generating? What psychographic profiles like this type of movie? Of interest to studios and media investors