SlideShare ist ein Scribd-Unternehmen logo
1 von 39
Using Operational Redundancy
Effective Web Data Mining
Jonathan LeBlanc
Head of Developer Evangelism N.A. (PayPal)
Github: http://github.com/jcleblanc
Slides: http://slideshare.net/jcleblanc
Twitter: @jcleblanc
Premise
The interactions of a user can be used to
personalize their experience
Elements of Mining Redundancy
Website
Data
Mining
User
Emotional
State Mining
User
Interaction
Mining
Our Subject Material
HTML content is poorly structured
There are some pretty bad web
practices on the interwebz
You can’t trust that anything
semantically valid will be present
How We’ll Capture This Data
Start with base linguistics
Extend with available extras
The Basic Pieces
Page Data
Scrapey
Scrapey
Keywords
Without all
the fluff
Weighting
Word diets
FTW
Capture Raw Page Data
Semantic data on the web
is sucktastic
Assume 5 year olds built
the sites
Language is the key
Extract Keywords
We now have a big jumble
of words. Let’s extract
Why is “and” a top word?
Stop words = sad panda
Weight Keywords
All content is not created
equal
Meta and headers and
semantics oh my!
This is where we leech
off the work of others
Questions to Keep in Mind
Should I use regex to parse web
content?
How do users interact with page
content?
What key identifiers can be monitored
to detect interest?
Fetching the Data: cURL
$req = curl_init($url);
$options = array(
CURLOPT_URL => $url,
CURLOPT_HEADER => $header,
CURLOPT_RETURNTRANSFER => true,
CURLOPT_FOLLOWLOCATION => true,
CURLOPT_AUTOREFERER => true,
CURLOPT_TIMEOUT => 15,
CURLOPT_MAXREDIRS => 10
);
curl_setopt_array($req, $options);
//list of findable / replaceable string characters
$find = array('/r/', '/n/', '/ss+/'); $replace = array(' ', ' ', ' ');
//perform page content modification
$mod_content = preg_replace('#<script(.*?)>(.*?)</
script>#is', '', $page_content);
$mod_content = preg_replace('#<style(.*?)>(.*?)</
style>#is', '', $mod_content);
$mod_content = strip_tags($mod_content);
$mod_content = strtolower($mod_content);
$mod_content = preg_replace($find, $replace, $mod_content);
$mod_content = trim($mod_content);
$mod_content = explode(' ', $mod_content);
natcasesort($mod_content);
//set up list of stop words and the final found stopped list
$common_words = array('a', ..., 'zero');
$searched_words = array();
//extract list of keywords with number of occurrences
foreach($mod_content as $word) {
$word = trim($word);
if(strlen($word) > 2 && !in_array($word, $common_words)){
$searched_words[$word]++;
}
}
arsort($searched_words, SORT_NUMERIC);
Scraping Site Meta Data
//load scraped page data as a valid DOM document
$dom = new DOMDocument();
@$dom->loadHTML($page_content);
//scrape title
$title = $dom->getElementsByTagName("title");
$title = $title->item(0)->nodeValue;
//loop through all found meta tags
$metas = $dom->getElementsByTagName("meta");
for ($i = 0; $i < $metas->length; $i++){
$meta = $metas->item($i);
if($meta->getAttribute("property")){
if ($meta->getAttribute("property") == "og:description"){
$dataReturn["description"] = $meta->getAttribute("content");
}
} else {
if($meta->getAttribute("name") == "description"){
$dataReturn["description"] = $meta->getAttribute("content");
} else if($meta->getAttribute("name") == "keywords”){
$dataReturn[”keywords"] = $meta->getAttribute("content");
}
}
}
Weighting Important Data
Tags you should care
about: meta (include OG),
title, description, h1+,
header
Bonus points for adding in
content location modifiers
Weighting Important Tags
//our keyword weights
$weights = array("keywords" => "3.0",
"meta" => "2.0",
"header1" => "1.5",
"header2" => "1.2");
//add modifier here
if(strlen($word) > 2 && !in_array($word, $common_words)){
$searched_words[$word]++;
}
Expanding to Phrases
2-3 adjacent words, making
up a direct relevant callout
Seems easy right? Just like
single words
Language gets wonky
without stop words
Adding in Time Interactions
Interaction with a site does
not necessarily mean
interest in it
Time needs to also include
an interaction component
Gift buying seasons see
interest variations
Grouping Using Commonality
Interests
User A
Interests
User B
Interests
Common
Using Color Theory
Products with a feel-good message
Happiness, energy, encouragement
Health care (but not food!)
Relatable, calm, friendly, peace, security
Startups / innovative products
Creativity, imagination
Auction sites (but not sales sites!)
Passion, stimulation, excitement, power
What We’re Talking About
The CSS Service Engine
lesscss.org
sass-lang.com
learnboost.github.com/stylus
http://leafo.net/lessphp/
Design Engine Foundation: LESSPHP
+
The Basics of a Design Engine
//create new LESS object
$less= new lessc();
//compile LESS code to CSS
$less->checkedCompile(
'/path/styles.less',
'path/styles.css');
//create new CSS file and return new file link
echo "<link rel='stylesheet' href='http://path/styles.css'
type='text/css' />";
Passing Variables into LESSPHP
//create a new LESS object
$less = new lessc();
//set the variables
$less->setVariables(array(
'color' => 'red',
'base' => '960px'
));
//compile LESS into PHP and unset variables
echo $less->compile(".magic { color: @color;
width: @base - 200; }");
$less->unsetVariable('color');
Implementing Color Functions
Lighten / Darken Saturate / Desaturate
Adjust HueMix Colors
Managing Irrelevant Content
Remove / hide content
based on user profile
and state
Managing Irrelevant Content
//variables passed into LESS compilation
$less->setVariables(array(
"percent" => "80%",
));
//LESS template
.highlight{
@bg-color: "#464646”;
@font-color: "#eee";
background-color: fade(@bg-color, @percent);
color: fade(@font-color, @percent);
}
Traits of the Bored
Distraction
Repetition
Tiredness
Reasons for Boredom
Lack of interest
Readiness
Acting on Disinterest / Boredom
Highlighting on Agitated Behavior
Highlight relevant
content to reduce
agitated behavior
Acting Upon User Queues
$less->setVariables(array(
"percent" => "100%",
"size-mod" => "2"
));
Variables passed into LESS script
Acting Upon User Queues
.highlight{
@bg-calm: "blue";
@bg-action: "red";
@base-font: "14px";
background-color: mix(@bg-calm,
@bg-action,
@percent );
font-size: @size-mod + @base-font;
}
LESS script logic for color / size variations
Interaction and Emotion Plugin
jQuery Behavior Miner
by Cedric Dugas
https://github.com/posa
bsolute/jquery-
behavior-miner
In the End…
What a person is interested in
What a person is doing
What their emotional state is
http://slideshare.com/jcleblanc
Thank You! Questions?
Jonathan LeBlanc
Head of Developer Evangelism N.A. (PayPal)
Github: http://github.com/jcleblanc
Slides: http://slideshare.net/jcleblanc
Twitter: @jcleblanc

Weitere ähnliche Inhalte

Was ist angesagt?

JSON-LD: JSON for Linked Data
JSON-LD: JSON for Linked DataJSON-LD: JSON for Linked Data
JSON-LD: JSON for Linked DataGregg Kellogg
 
Spiffy Applications With JavaScript
Spiffy Applications With JavaScriptSpiffy Applications With JavaScript
Spiffy Applications With JavaScriptMark Casias
 
Dream House Project Presentation
Dream House Project PresentationDream House Project Presentation
Dream House Project Presentationjongosling
 
Contacto server API in PHP
Contacto server API in PHPContacto server API in PHP
Contacto server API in PHPHem Shrestha
 
Elastify you application: from SQL to NoSQL in less than one hour!
Elastify you application: from SQL to NoSQL in less than one hour!Elastify you application: from SQL to NoSQL in less than one hour!
Elastify you application: from SQL to NoSQL in less than one hour!David Pilato
 
Hi5 Opensocial Code Lab Presentation
Hi5 Opensocial Code Lab PresentationHi5 Opensocial Code Lab Presentation
Hi5 Opensocial Code Lab Presentationplindner
 
Why Python Web Frameworks Are Changing the Web
Why Python Web Frameworks Are Changing the WebWhy Python Web Frameworks Are Changing the Web
Why Python Web Frameworks Are Changing the Webjoelburton
 
nodum.io MongoDB Meetup (Dutch)
nodum.io MongoDB Meetup (Dutch)nodum.io MongoDB Meetup (Dutch)
nodum.io MongoDB Meetup (Dutch)Wietse Wind
 
20180424 #18 we_are_javascripters
20180424 #18 we_are_javascripters20180424 #18 we_are_javascripters
20180424 #18 we_are_javascripters将一 深見
 
Introduction to jQuery
Introduction to jQueryIntroduction to jQuery
Introduction to jQueryGunjan Kumar
 
jQuery Presentation
jQuery PresentationjQuery Presentation
jQuery PresentationRod Johnson
 
Findability Bliss Through Web Standards
Findability Bliss Through Web StandardsFindability Bliss Through Web Standards
Findability Bliss Through Web StandardsAarron Walter
 
Introduction to Web Design, Week 1
Introduction to Web Design, Week 1Introduction to Web Design, Week 1
Introduction to Web Design, Week 1Lou Susi
 
Linked Data in Use: Schema.org, JSON-LD and hypermedia APIs - Front in Bahia...
Linked Data in Use: Schema.org, JSON-LD and hypermedia APIs  - Front in Bahia...Linked Data in Use: Schema.org, JSON-LD and hypermedia APIs  - Front in Bahia...
Linked Data in Use: Schema.org, JSON-LD and hypermedia APIs - Front in Bahia...Ícaro Medeiros
 
So cal0365productivitygroup feb2019
So cal0365productivitygroup feb2019So cal0365productivitygroup feb2019
So cal0365productivitygroup feb2019RonRohlfs1
 

Was ist angesagt? (20)

Google Hack
Google HackGoogle Hack
Google Hack
 
HTML5 Essentials
HTML5 EssentialsHTML5 Essentials
HTML5 Essentials
 
JSON-LD: JSON for Linked Data
JSON-LD: JSON for Linked DataJSON-LD: JSON for Linked Data
JSON-LD: JSON for Linked Data
 
Spiffy Applications With JavaScript
Spiffy Applications With JavaScriptSpiffy Applications With JavaScript
Spiffy Applications With JavaScript
 
Dream House Project Presentation
Dream House Project PresentationDream House Project Presentation
Dream House Project Presentation
 
Contacto server API in PHP
Contacto server API in PHPContacto server API in PHP
Contacto server API in PHP
 
Elastify you application: from SQL to NoSQL in less than one hour!
Elastify you application: from SQL to NoSQL in less than one hour!Elastify you application: from SQL to NoSQL in less than one hour!
Elastify you application: from SQL to NoSQL in less than one hour!
 
Hi5 Opensocial Code Lab Presentation
Hi5 Opensocial Code Lab PresentationHi5 Opensocial Code Lab Presentation
Hi5 Opensocial Code Lab Presentation
 
Why Python Web Frameworks Are Changing the Web
Why Python Web Frameworks Are Changing the WebWhy Python Web Frameworks Are Changing the Web
Why Python Web Frameworks Are Changing the Web
 
nodum.io MongoDB Meetup (Dutch)
nodum.io MongoDB Meetup (Dutch)nodum.io MongoDB Meetup (Dutch)
nodum.io MongoDB Meetup (Dutch)
 
20180424 #18 we_are_javascripters
20180424 #18 we_are_javascripters20180424 #18 we_are_javascripters
20180424 #18 we_are_javascripters
 
Introduction to jQuery
Introduction to jQueryIntroduction to jQuery
Introduction to jQuery
 
jQuery Best Practice
jQuery Best Practice jQuery Best Practice
jQuery Best Practice
 
jQuery Presentation
jQuery PresentationjQuery Presentation
jQuery Presentation
 
Findability Bliss Through Web Standards
Findability Bliss Through Web StandardsFindability Bliss Through Web Standards
Findability Bliss Through Web Standards
 
jQuery
jQueryjQuery
jQuery
 
jQuery
jQueryjQuery
jQuery
 
Introduction to Web Design, Week 1
Introduction to Web Design, Week 1Introduction to Web Design, Week 1
Introduction to Web Design, Week 1
 
Linked Data in Use: Schema.org, JSON-LD and hypermedia APIs - Front in Bahia...
Linked Data in Use: Schema.org, JSON-LD and hypermedia APIs  - Front in Bahia...Linked Data in Use: Schema.org, JSON-LD and hypermedia APIs  - Front in Bahia...
Linked Data in Use: Schema.org, JSON-LD and hypermedia APIs - Front in Bahia...
 
So cal0365productivitygroup feb2019
So cal0365productivitygroup feb2019So cal0365productivitygroup feb2019
So cal0365productivitygroup feb2019
 

Ähnlich wie Creating Operational Redundancy for Effective Web Data Mining

Assetic (Symfony Live Paris)
Assetic (Symfony Live Paris)Assetic (Symfony Live Paris)
Assetic (Symfony Live Paris)Kris Wallsmith
 
Intro to php
Intro to phpIntro to php
Intro to phpSp Singh
 
Mojolicious, real-time web framework
Mojolicious, real-time web frameworkMojolicious, real-time web framework
Mojolicious, real-time web frameworktaggg
 
Introducing Assetic: Asset Management for PHP 5.3
Introducing Assetic: Asset Management for PHP 5.3Introducing Assetic: Asset Management for PHP 5.3
Introducing Assetic: Asset Management for PHP 5.3Kris Wallsmith
 
PHP and Rich Internet Applications
PHP and Rich Internet ApplicationsPHP and Rich Internet Applications
PHP and Rich Internet Applicationselliando dias
 
Introducing Assetic (NYPHP)
Introducing Assetic (NYPHP)Introducing Assetic (NYPHP)
Introducing Assetic (NYPHP)Kris Wallsmith
 
Share point hosted add ins munich
Share point hosted add ins munichShare point hosted add ins munich
Share point hosted add ins munichSonja Madsen
 
FamilySearch Reference Client
FamilySearch Reference ClientFamilySearch Reference Client
FamilySearch Reference ClientDallan Quass
 
How to insert json data into my sql using php
How to insert json data into my sql using phpHow to insert json data into my sql using php
How to insert json data into my sql using phpTrà Minh
 
Building a real life application in node js
Building a real life application in node jsBuilding a real life application in node js
Building a real life application in node jsfakedarren
 
MYSQL DATABASE INTRODUCTION TO JAVASCRIPT.pptx
MYSQL DATABASE INTRODUCTION TO JAVASCRIPT.pptxMYSQL DATABASE INTRODUCTION TO JAVASCRIPT.pptx
MYSQL DATABASE INTRODUCTION TO JAVASCRIPT.pptxArjayBalberan1
 
PHP and Rich Internet Applications
PHP and Rich Internet ApplicationsPHP and Rich Internet Applications
PHP and Rich Internet Applicationselliando dias
 
Pitfalls to Avoid for Cascade Server Newbies by Lisa Hall
Pitfalls to Avoid for Cascade Server Newbies by Lisa HallPitfalls to Avoid for Cascade Server Newbies by Lisa Hall
Pitfalls to Avoid for Cascade Server Newbies by Lisa Hallhannonhill
 
Scaling Complexity in WordPress Enterprise Apps
Scaling Complexity in WordPress Enterprise AppsScaling Complexity in WordPress Enterprise Apps
Scaling Complexity in WordPress Enterprise AppsMike Schinkel
 
php-mysql-tutorial-part-3
php-mysql-tutorial-part-3php-mysql-tutorial-part-3
php-mysql-tutorial-part-3tutorialsruby
 
&lt;b>PHP&lt;/b>/MySQL &lt;b>Tutorial&lt;/b> webmonkey/programming/
&lt;b>PHP&lt;/b>/MySQL &lt;b>Tutorial&lt;/b> webmonkey/programming/&lt;b>PHP&lt;/b>/MySQL &lt;b>Tutorial&lt;/b> webmonkey/programming/
&lt;b>PHP&lt;/b>/MySQL &lt;b>Tutorial&lt;/b> webmonkey/programming/tutorialsruby
 
php-mysql-tutorial-part-3
php-mysql-tutorial-part-3php-mysql-tutorial-part-3
php-mysql-tutorial-part-3tutorialsruby
 

Ähnlich wie Creating Operational Redundancy for Effective Web Data Mining (20)

Assetic (Symfony Live Paris)
Assetic (Symfony Live Paris)Assetic (Symfony Live Paris)
Assetic (Symfony Live Paris)
 
Intro to php
Intro to phpIntro to php
Intro to php
 
Assetic (OSCON)
Assetic (OSCON)Assetic (OSCON)
Assetic (OSCON)
 
Mojolicious, real-time web framework
Mojolicious, real-time web frameworkMojolicious, real-time web framework
Mojolicious, real-time web framework
 
Assetic (Zendcon)
Assetic (Zendcon)Assetic (Zendcon)
Assetic (Zendcon)
 
Introducing Assetic: Asset Management for PHP 5.3
Introducing Assetic: Asset Management for PHP 5.3Introducing Assetic: Asset Management for PHP 5.3
Introducing Assetic: Asset Management for PHP 5.3
 
PHP and Rich Internet Applications
PHP and Rich Internet ApplicationsPHP and Rich Internet Applications
PHP and Rich Internet Applications
 
Introducing Assetic (NYPHP)
Introducing Assetic (NYPHP)Introducing Assetic (NYPHP)
Introducing Assetic (NYPHP)
 
Share point hosted add ins munich
Share point hosted add ins munichShare point hosted add ins munich
Share point hosted add ins munich
 
FamilySearch Reference Client
FamilySearch Reference ClientFamilySearch Reference Client
FamilySearch Reference Client
 
PHP POWERPOINT SLIDES
PHP POWERPOINT SLIDESPHP POWERPOINT SLIDES
PHP POWERPOINT SLIDES
 
How to insert json data into my sql using php
How to insert json data into my sql using phpHow to insert json data into my sql using php
How to insert json data into my sql using php
 
Building a real life application in node js
Building a real life application in node jsBuilding a real life application in node js
Building a real life application in node js
 
MYSQL DATABASE INTRODUCTION TO JAVASCRIPT.pptx
MYSQL DATABASE INTRODUCTION TO JAVASCRIPT.pptxMYSQL DATABASE INTRODUCTION TO JAVASCRIPT.pptx
MYSQL DATABASE INTRODUCTION TO JAVASCRIPT.pptx
 
PHP and Rich Internet Applications
PHP and Rich Internet ApplicationsPHP and Rich Internet Applications
PHP and Rich Internet Applications
 
Pitfalls to Avoid for Cascade Server Newbies by Lisa Hall
Pitfalls to Avoid for Cascade Server Newbies by Lisa HallPitfalls to Avoid for Cascade Server Newbies by Lisa Hall
Pitfalls to Avoid for Cascade Server Newbies by Lisa Hall
 
Scaling Complexity in WordPress Enterprise Apps
Scaling Complexity in WordPress Enterprise AppsScaling Complexity in WordPress Enterprise Apps
Scaling Complexity in WordPress Enterprise Apps
 
php-mysql-tutorial-part-3
php-mysql-tutorial-part-3php-mysql-tutorial-part-3
php-mysql-tutorial-part-3
 
&lt;b>PHP&lt;/b>/MySQL &lt;b>Tutorial&lt;/b> webmonkey/programming/
&lt;b>PHP&lt;/b>/MySQL &lt;b>Tutorial&lt;/b> webmonkey/programming/&lt;b>PHP&lt;/b>/MySQL &lt;b>Tutorial&lt;/b> webmonkey/programming/
&lt;b>PHP&lt;/b>/MySQL &lt;b>Tutorial&lt;/b> webmonkey/programming/
 
php-mysql-tutorial-part-3
php-mysql-tutorial-part-3php-mysql-tutorial-part-3
php-mysql-tutorial-part-3
 

Mehr von Jonathan LeBlanc

JavaScript App Security: Auth and Identity on the Client
JavaScript App Security: Auth and Identity on the ClientJavaScript App Security: Auth and Identity on the Client
JavaScript App Security: Auth and Identity on the ClientJonathan LeBlanc
 
Improving Developer Onboarding Through Intelligent Data Insights
Improving Developer Onboarding Through Intelligent Data InsightsImproving Developer Onboarding Through Intelligent Data Insights
Improving Developer Onboarding Through Intelligent Data InsightsJonathan LeBlanc
 
Better Data with Machine Learning and Serverless
Better Data with Machine Learning and ServerlessBetter Data with Machine Learning and Serverless
Better Data with Machine Learning and ServerlessJonathan LeBlanc
 
Best Practices for Application Development with Box
Best Practices for Application Development with BoxBest Practices for Application Development with Box
Best Practices for Application Development with BoxJonathan LeBlanc
 
Box Platform Developer Workshop
Box Platform Developer WorkshopBox Platform Developer Workshop
Box Platform Developer WorkshopJonathan LeBlanc
 
Modern Cloud Data Security Practices
Modern Cloud Data Security PracticesModern Cloud Data Security Practices
Modern Cloud Data Security PracticesJonathan LeBlanc
 
Understanding Box UI Elements
Understanding Box UI ElementsUnderstanding Box UI Elements
Understanding Box UI ElementsJonathan LeBlanc
 
Understanding Box applications, tokens, and scoping
Understanding Box applications, tokens, and scopingUnderstanding Box applications, tokens, and scoping
Understanding Box applications, tokens, and scopingJonathan LeBlanc
 
The Future of Online Money: Creating Secure Payments Globally
The Future of Online Money: Creating Secure Payments GloballyThe Future of Online Money: Creating Secure Payments Globally
The Future of Online Money: Creating Secure Payments GloballyJonathan LeBlanc
 
Modern API Security with JSON Web Tokens
Modern API Security with JSON Web TokensModern API Security with JSON Web Tokens
Modern API Security with JSON Web TokensJonathan LeBlanc
 
Creating an In-Aisle Purchasing System from Scratch
Creating an In-Aisle Purchasing System from ScratchCreating an In-Aisle Purchasing System from Scratch
Creating an In-Aisle Purchasing System from ScratchJonathan LeBlanc
 
Secure Payments Over Mixed Communication Media
Secure Payments Over Mixed Communication MediaSecure Payments Over Mixed Communication Media
Secure Payments Over Mixed Communication MediaJonathan LeBlanc
 
Protecting the Future of Mobile Payments
Protecting the Future of Mobile PaymentsProtecting the Future of Mobile Payments
Protecting the Future of Mobile PaymentsJonathan LeBlanc
 
Node.js Authentication and Data Security
Node.js Authentication and Data SecurityNode.js Authentication and Data Security
Node.js Authentication and Data SecurityJonathan LeBlanc
 
PHP Identity and Data Security
PHP Identity and Data SecurityPHP Identity and Data Security
PHP Identity and Data SecurityJonathan LeBlanc
 
Secure Payments Over Mixed Communication Media
Secure Payments Over Mixed Communication MediaSecure Payments Over Mixed Communication Media
Secure Payments Over Mixed Communication MediaJonathan LeBlanc
 
Protecting the Future of Mobile Payments
Protecting the Future of Mobile PaymentsProtecting the Future of Mobile Payments
Protecting the Future of Mobile PaymentsJonathan LeBlanc
 
Future of Identity, Data, and Wearable Security
Future of Identity, Data, and Wearable SecurityFuture of Identity, Data, and Wearable Security
Future of Identity, Data, and Wearable SecurityJonathan LeBlanc
 

Mehr von Jonathan LeBlanc (20)

JavaScript App Security: Auth and Identity on the Client
JavaScript App Security: Auth and Identity on the ClientJavaScript App Security: Auth and Identity on the Client
JavaScript App Security: Auth and Identity on the Client
 
Improving Developer Onboarding Through Intelligent Data Insights
Improving Developer Onboarding Through Intelligent Data InsightsImproving Developer Onboarding Through Intelligent Data Insights
Improving Developer Onboarding Through Intelligent Data Insights
 
Better Data with Machine Learning and Serverless
Better Data with Machine Learning and ServerlessBetter Data with Machine Learning and Serverless
Better Data with Machine Learning and Serverless
 
Best Practices for Application Development with Box
Best Practices for Application Development with BoxBest Practices for Application Development with Box
Best Practices for Application Development with Box
 
Box Platform Overview
Box Platform OverviewBox Platform Overview
Box Platform Overview
 
Box Platform Developer Workshop
Box Platform Developer WorkshopBox Platform Developer Workshop
Box Platform Developer Workshop
 
Modern Cloud Data Security Practices
Modern Cloud Data Security PracticesModern Cloud Data Security Practices
Modern Cloud Data Security Practices
 
Box Authentication Types
Box Authentication TypesBox Authentication Types
Box Authentication Types
 
Understanding Box UI Elements
Understanding Box UI ElementsUnderstanding Box UI Elements
Understanding Box UI Elements
 
Understanding Box applications, tokens, and scoping
Understanding Box applications, tokens, and scopingUnderstanding Box applications, tokens, and scoping
Understanding Box applications, tokens, and scoping
 
The Future of Online Money: Creating Secure Payments Globally
The Future of Online Money: Creating Secure Payments GloballyThe Future of Online Money: Creating Secure Payments Globally
The Future of Online Money: Creating Secure Payments Globally
 
Modern API Security with JSON Web Tokens
Modern API Security with JSON Web TokensModern API Security with JSON Web Tokens
Modern API Security with JSON Web Tokens
 
Creating an In-Aisle Purchasing System from Scratch
Creating an In-Aisle Purchasing System from ScratchCreating an In-Aisle Purchasing System from Scratch
Creating an In-Aisle Purchasing System from Scratch
 
Secure Payments Over Mixed Communication Media
Secure Payments Over Mixed Communication MediaSecure Payments Over Mixed Communication Media
Secure Payments Over Mixed Communication Media
 
Protecting the Future of Mobile Payments
Protecting the Future of Mobile PaymentsProtecting the Future of Mobile Payments
Protecting the Future of Mobile Payments
 
Node.js Authentication and Data Security
Node.js Authentication and Data SecurityNode.js Authentication and Data Security
Node.js Authentication and Data Security
 
PHP Identity and Data Security
PHP Identity and Data SecurityPHP Identity and Data Security
PHP Identity and Data Security
 
Secure Payments Over Mixed Communication Media
Secure Payments Over Mixed Communication MediaSecure Payments Over Mixed Communication Media
Secure Payments Over Mixed Communication Media
 
Protecting the Future of Mobile Payments
Protecting the Future of Mobile PaymentsProtecting the Future of Mobile Payments
Protecting the Future of Mobile Payments
 
Future of Identity, Data, and Wearable Security
Future of Identity, Data, and Wearable SecurityFuture of Identity, Data, and Wearable Security
Future of Identity, Data, and Wearable Security
 

Kürzlich hochgeladen

Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 

Kürzlich hochgeladen (20)

Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 

Creating Operational Redundancy for Effective Web Data Mining

  • 1. Using Operational Redundancy Effective Web Data Mining Jonathan LeBlanc Head of Developer Evangelism N.A. (PayPal) Github: http://github.com/jcleblanc Slides: http://slideshare.net/jcleblanc Twitter: @jcleblanc
  • 2. Premise The interactions of a user can be used to personalize their experience
  • 3. Elements of Mining Redundancy Website Data Mining User Emotional State Mining User Interaction Mining
  • 4.
  • 5. Our Subject Material HTML content is poorly structured There are some pretty bad web practices on the interwebz You can’t trust that anything semantically valid will be present
  • 6. How We’ll Capture This Data Start with base linguistics Extend with available extras
  • 7. The Basic Pieces Page Data Scrapey Scrapey Keywords Without all the fluff Weighting Word diets FTW
  • 8. Capture Raw Page Data Semantic data on the web is sucktastic Assume 5 year olds built the sites Language is the key
  • 9. Extract Keywords We now have a big jumble of words. Let’s extract Why is “and” a top word? Stop words = sad panda
  • 10. Weight Keywords All content is not created equal Meta and headers and semantics oh my! This is where we leech off the work of others
  • 11.
  • 12. Questions to Keep in Mind Should I use regex to parse web content? How do users interact with page content? What key identifiers can be monitored to detect interest?
  • 13. Fetching the Data: cURL $req = curl_init($url); $options = array( CURLOPT_URL => $url, CURLOPT_HEADER => $header, CURLOPT_RETURNTRANSFER => true, CURLOPT_FOLLOWLOCATION => true, CURLOPT_AUTOREFERER => true, CURLOPT_TIMEOUT => 15, CURLOPT_MAXREDIRS => 10 ); curl_setopt_array($req, $options);
  • 14. //list of findable / replaceable string characters $find = array('/r/', '/n/', '/ss+/'); $replace = array(' ', ' ', ' '); //perform page content modification $mod_content = preg_replace('#<script(.*?)>(.*?)</ script>#is', '', $page_content); $mod_content = preg_replace('#<style(.*?)>(.*?)</ style>#is', '', $mod_content); $mod_content = strip_tags($mod_content); $mod_content = strtolower($mod_content); $mod_content = preg_replace($find, $replace, $mod_content); $mod_content = trim($mod_content); $mod_content = explode(' ', $mod_content); natcasesort($mod_content);
  • 15. //set up list of stop words and the final found stopped list $common_words = array('a', ..., 'zero'); $searched_words = array(); //extract list of keywords with number of occurrences foreach($mod_content as $word) { $word = trim($word); if(strlen($word) > 2 && !in_array($word, $common_words)){ $searched_words[$word]++; } } arsort($searched_words, SORT_NUMERIC);
  • 16. Scraping Site Meta Data //load scraped page data as a valid DOM document $dom = new DOMDocument(); @$dom->loadHTML($page_content); //scrape title $title = $dom->getElementsByTagName("title"); $title = $title->item(0)->nodeValue;
  • 17. //loop through all found meta tags $metas = $dom->getElementsByTagName("meta"); for ($i = 0; $i < $metas->length; $i++){ $meta = $metas->item($i); if($meta->getAttribute("property")){ if ($meta->getAttribute("property") == "og:description"){ $dataReturn["description"] = $meta->getAttribute("content"); } } else { if($meta->getAttribute("name") == "description"){ $dataReturn["description"] = $meta->getAttribute("content"); } else if($meta->getAttribute("name") == "keywords”){ $dataReturn[”keywords"] = $meta->getAttribute("content"); } } }
  • 18. Weighting Important Data Tags you should care about: meta (include OG), title, description, h1+, header Bonus points for adding in content location modifiers
  • 19. Weighting Important Tags //our keyword weights $weights = array("keywords" => "3.0", "meta" => "2.0", "header1" => "1.5", "header2" => "1.2"); //add modifier here if(strlen($word) > 2 && !in_array($word, $common_words)){ $searched_words[$word]++; }
  • 20. Expanding to Phrases 2-3 adjacent words, making up a direct relevant callout Seems easy right? Just like single words Language gets wonky without stop words
  • 21. Adding in Time Interactions Interaction with a site does not necessarily mean interest in it Time needs to also include an interaction component Gift buying seasons see interest variations
  • 22. Grouping Using Commonality Interests User A Interests User B Interests Common
  • 23.
  • 24. Using Color Theory Products with a feel-good message Happiness, energy, encouragement Health care (but not food!) Relatable, calm, friendly, peace, security Startups / innovative products Creativity, imagination Auction sites (but not sales sites!) Passion, stimulation, excitement, power
  • 26. The CSS Service Engine lesscss.org sass-lang.com learnboost.github.com/stylus
  • 28. The Basics of a Design Engine //create new LESS object $less= new lessc(); //compile LESS code to CSS $less->checkedCompile( '/path/styles.less', 'path/styles.css'); //create new CSS file and return new file link echo "<link rel='stylesheet' href='http://path/styles.css' type='text/css' />";
  • 29. Passing Variables into LESSPHP //create a new LESS object $less = new lessc(); //set the variables $less->setVariables(array( 'color' => 'red', 'base' => '960px' )); //compile LESS into PHP and unset variables echo $less->compile(".magic { color: @color; width: @base - 200; }"); $less->unsetVariable('color');
  • 30. Implementing Color Functions Lighten / Darken Saturate / Desaturate Adjust HueMix Colors
  • 31. Managing Irrelevant Content Remove / hide content based on user profile and state
  • 32. Managing Irrelevant Content //variables passed into LESS compilation $less->setVariables(array( "percent" => "80%", )); //LESS template .highlight{ @bg-color: "#464646”; @font-color: "#eee"; background-color: fade(@bg-color, @percent); color: fade(@font-color, @percent); }
  • 33. Traits of the Bored Distraction Repetition Tiredness Reasons for Boredom Lack of interest Readiness Acting on Disinterest / Boredom
  • 34. Highlighting on Agitated Behavior Highlight relevant content to reduce agitated behavior
  • 35. Acting Upon User Queues $less->setVariables(array( "percent" => "100%", "size-mod" => "2" )); Variables passed into LESS script
  • 36. Acting Upon User Queues .highlight{ @bg-calm: "blue"; @bg-action: "red"; @base-font: "14px"; background-color: mix(@bg-calm, @bg-action, @percent ); font-size: @size-mod + @base-font; } LESS script logic for color / size variations
  • 37. Interaction and Emotion Plugin jQuery Behavior Miner by Cedric Dugas https://github.com/posa bsolute/jquery- behavior-miner
  • 38. In the End… What a person is interested in What a person is doing What their emotional state is
  • 39. http://slideshare.com/jcleblanc Thank You! Questions? Jonathan LeBlanc Head of Developer Evangelism N.A. (PayPal) Github: http://github.com/jcleblanc Slides: http://slideshare.net/jcleblanc Twitter: @jcleblanc

Hinweis der Redaktion

  1. The semantic data movement was an abysmal failure. Strip down the site to its basic components – the language and words used on the page
  2. Open graph protocol
  3. This is why I prefer using cURL: customization of requests, timeouts, allows redirects, etc.
  4. Stripping irrelevant data
  5. Scraping site keywords
  6. You can also play with the fade in / fade out to modify the lightness and highlighting