In this session, we will explore the principles behind building a highly scalable, efficient, and effective web data mining architecture, based on standard semantic principles of data collection. This type of standard collection will allow any company to turn unstructured web data into structurally sound, valuable content.
Creating Operational Redundancy for Effective Web Data Mining
1. Using Operational Redundancy
Effective Web Data Mining
Jonathan LeBlanc
Head of Developer Evangelism N.A. (PayPal)
Github: http://github.com/jcleblanc
Slides: http://slideshare.net/jcleblanc
Twitter: @jcleblanc
3. Elements of Mining Redundancy
Website
Data
Mining
User
Emotional
State Mining
User
Interaction
Mining
4.
5. Our Subject Material
HTML content is poorly structured
There are some pretty bad web
practices on the interwebz
You can’t trust that anything
semantically valid will be present
6. How We’ll Capture This Data
Start with base linguistics
Extend with available extras
7. The Basic Pieces
Page Data
Scrapey
Scrapey
Keywords
Without all
the fluff
Weighting
Word diets
FTW
8. Capture Raw Page Data
Semantic data on the web
is sucktastic
Assume 5 year olds built
the sites
Language is the key
9. Extract Keywords
We now have a big jumble
of words. Let’s extract
Why is “and” a top word?
Stop words = sad panda
10. Weight Keywords
All content is not created
equal
Meta and headers and
semantics oh my!
This is where we leech
off the work of others
11.
12. Questions to Keep in Mind
Should I use regex to parse web
content?
How do users interact with page
content?
What key identifiers can be monitored
to detect interest?
15. //set up list of stop words and the final found stopped list
$common_words = array('a', ..., 'zero');
$searched_words = array();
//extract list of keywords with number of occurrences
foreach($mod_content as $word) {
$word = trim($word);
if(strlen($word) > 2 && !in_array($word, $common_words)){
$searched_words[$word]++;
}
}
arsort($searched_words, SORT_NUMERIC);
16. Scraping Site Meta Data
//load scraped page data as a valid DOM document
$dom = new DOMDocument();
@$dom->loadHTML($page_content);
//scrape title
$title = $dom->getElementsByTagName("title");
$title = $title->item(0)->nodeValue;
17. //loop through all found meta tags
$metas = $dom->getElementsByTagName("meta");
for ($i = 0; $i < $metas->length; $i++){
$meta = $metas->item($i);
if($meta->getAttribute("property")){
if ($meta->getAttribute("property") == "og:description"){
$dataReturn["description"] = $meta->getAttribute("content");
}
} else {
if($meta->getAttribute("name") == "description"){
$dataReturn["description"] = $meta->getAttribute("content");
} else if($meta->getAttribute("name") == "keywords”){
$dataReturn[”keywords"] = $meta->getAttribute("content");
}
}
}
18. Weighting Important Data
Tags you should care
about: meta (include OG),
title, description, h1+,
header
Bonus points for adding in
content location modifiers
20. Expanding to Phrases
2-3 adjacent words, making
up a direct relevant callout
Seems easy right? Just like
single words
Language gets wonky
without stop words
21. Adding in Time Interactions
Interaction with a site does
not necessarily mean
interest in it
Time needs to also include
an interaction component
Gift buying seasons see
interest variations
24. Using Color Theory
Products with a feel-good message
Happiness, energy, encouragement
Health care (but not food!)
Relatable, calm, friendly, peace, security
Startups / innovative products
Creativity, imagination
Auction sites (but not sales sites!)
Passion, stimulation, excitement, power
28. The Basics of a Design Engine
//create new LESS object
$less= new lessc();
//compile LESS code to CSS
$less->checkedCompile(
'/path/styles.less',
'path/styles.css');
//create new CSS file and return new file link
echo "<link rel='stylesheet' href='http://path/styles.css'
type='text/css' />";
29. Passing Variables into LESSPHP
//create a new LESS object
$less = new lessc();
//set the variables
$less->setVariables(array(
'color' => 'red',
'base' => '960px'
));
//compile LESS into PHP and unset variables
echo $less->compile(".magic { color: @color;
width: @base - 200; }");
$less->unsetVariable('color');