Weitere ähnliche Inhalte Ähnlich wie (Mobile Web Applications track) "Profiling User Activities with Minimal Traffic Traces" - Tiep Mai, Deepak Ajwani and Alessandra SalaIcwe v3 b (20) Kürzlich hochgeladen (20) (Mobile Web Applications track) "Profiling User Activities with Minimal Traffic Traces" - Tiep Mai, Deepak Ajwani and Alessandra SalaIcwe v3 b1. COPYRIGHT © 2011 ALCATEL-LUCENT. ALL RIGHTS RESERVED.
ALCATEL-LUCENT — INTERNAL PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION
1
COPYRIGHT © 2011 ALCATEL-LUCENT. ALL RIGHTS RESERVED.
ALCATEL-LUCENT — INTERNAL PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION
Profiling User Activities With Minimal Traffic Traces
Tiep Mai, Deepak Ajwani and Alessandra Sala
Bell Laboratories, Ireland
2. COPYRIGHT © 2011 ALCATEL-LUCENT. ALL RIGHTS RESERVED.
ALCATEL-LUCENT — INTERNAL PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION
2
Outline
• Micro-action burst decomposition
• Representative URL selection
3. COPYRIGHT © 2011 ALCATEL-LUCENT. ALL RIGHTS RESERVED.
ALCATEL-LUCENT — INTERNAL PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION
3
End-to-End View of the Telecom Network
Mobile
user
Web
services
Client-side
data
Server-side
data
Telecom data
Huge data but
with limited features
Empower telecom data analysis with this data
4. COPYRIGHT © 2011 ALCATEL-LUCENT. ALL RIGHTS RESERVED.
ALCATEL-LUCENT — INTERNAL PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION
4
Providing Personalized Services
• Personalized services require user activity profiling
- Traditional approaches rely on features extracted from rich data sources
- Server side data: full URLs of visited pages, page categories, transaction data, search queries, click
through rate, etc.
- Client side data: full URLs (cookies), application data (web browsing), etc.
- Network side data: full URLs, HTTP packet content, etc.
• Our goal: Provide medium-grained user profiling with privacy preserving limited
dataset for a large user-pool
5. COPYRIGHT © 2011 ALCATEL-LUCENT. ALL RIGHTS RESERVED.
ALCATEL-LUCENT — INTERNAL PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION
5
Mobile Web Traces
User Behavioral Analysis from Timestamped Data
• Mobile traces provide precious insights in user behavior
- Critical to enable service personalization and enrich user’s online
experience
• Complete mobile web traces risk to reveal sensitive info
- http://finance.yahoo.com/q?s=BAC Bank of America Corp. stock
price
- https://www.google.ie/#q=postnatal+depression sensitive health
condition
- http://www.amazon.com/Dell-Inspiron-i15R-15-6-inch-
Laptop/dp/B009US2BKA specific purchased product
6. COPYRIGHT © 2011 ALCATEL-LUCENT. ALL RIGHTS RESERVED.
ALCATEL-LUCENT — INTERNAL PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION
6
Removing Sensitive Data from URL Traces
• Telecom Operators subjected to restrictive privacy legislations
• Conservative approach to share data
- Anonymized, truncate and sampled data
- Traces from10,000 anonymized users over 30 days, i.e. +130 Million records
• Focus on the dataset of truncated URLs or IP addresses
• Resulting data:
1. Truncated: www.amazon.com/Dell-Inspiron-i15R-15-6-inch-Laptop/dp/B009US2BKA
2. Noisy: unintentional web traffic as advertisement, web analytics, etc.
Quality of behavior analysis depends on effectively separating
unintentional traffic from user activities on truncated URL
7. COPYRIGHT © 2011 ALCATEL-LUCENT. ALL RIGHTS RESERVED.
ALCATEL-LUCENT — INTERNAL PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION
7
• Collection of web traces of several URL types
• Aim: filter out traces that do not represent explicit user action
- Identifying features to drive detection on unintentional traces
- Validate across different users
• Diversity of web domains:
Web Browsing Behaviors Across Time & Users
1e−03
1e−01
1e+01
1e+03
1e+05
0 25 50 75
time (secs)
downloadsize
Domain
1e−03
1e−01
1e+01
1e+03
1e+05
0 400 800 1200
time (secs)downloadsize
Domain (gaming)
High diversity in user activities High diversity across users
8. COPYRIGHT © 2011 ALCATEL-LUCENT. ALL RIGHTS RESERVED.
ALCATEL-LUCENT — INTERNAL PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION
8
Methodology Approach
• User activities as collection of micro user actions, i.e. burst
- Web clicks, chat replies
• Assumption: Each burst represents atomic user activity
- Combination of intended and unintended web-traffics
• Methodology
1. Burst decomposition
2. Activity extraction:
- Domain classification : Leverage specialized feature of domain appearance in the burst
- Online representative URL selection and activity association
Increase prediction
accuracy by 20%
9. COPYRIGHT © 2011 ALCATEL-LUCENT. ALL RIGHTS RESERVED.
ALCATEL-LUCENT — INTERNAL PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION
9
Burst Decomposition –
Statistical Parametric Distribution Fitting
• Goal: Decompose the web-trace back into
constituent data bursts
• A need for a threshold of packet inter-arrival time (IAT)
to separate traces into bursts
• Study the inter-arrival time distribution
• No parametric distribution would match most user
traces
10. COPYRIGHT © 2011 ALCATEL-LUCENT. ALL RIGHTS RESERVED.
ALCATEL-LUCENT — INTERNAL PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION
10
Burst Decomposition Algorithm
• Robust burst decomposition algorithm that is
independent of the distribution shape
• Starting from the smallest value, find the
value such that extended probability by
increasing decaying point is insignificant,
compared to the accumulated probability at
that point
11. COPYRIGHT © 2011 ALCATEL-LUCENT. ALL RIGHTS RESERVED.
ALCATEL-LUCENT — INTERNAL PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION
11
Domain Classification – Initial Insight
• Goal: automatically identify URLs representing user
activities
• Measurements are aggregated for all users for each
domain
- Record-level measurements
- Burst-level measurements
12. COPYRIGHT © 2011 ALCATEL-LUCENT. ALL RIGHTS RESERVED.
ALCATEL-LUCENT — INTERNAL PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION
12
Domain Classification - Methodology
• Logistic regression
• Validation error and AIC, BIC
• Two discriminating features
- ob,j=1 – ub,j=1 (~ 22.87) : probability that a domain comes first in bursts with more than one
unique domains
- ub,j=2 (~ -9.51) : probability that a domain comes in bursts with two unique domains
13. COPYRIGHT © 2011 ALCATEL-LUCENT. ALL RIGHTS RESERVED.
ALCATEL-LUCENT — INTERNAL PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION
13
Trade-offs of Domain Classification Results
• Trade-off between accuracy, sensitivity, precision
and specificity
- Maximizing accuracy
- Maximizing sensitivity and specificity
14. COPYRIGHT © 2011 ALCATEL-LUCENT. ALL RIGHTS RESERVED.
ALCATEL-LUCENT — INTERNAL PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION
14
Future Works
• Mapping domain to activities (reading, shopping, browsing) and identifying
user activities online
• Activity query and recommendation
• Correlating truncated URL data with user location data
- Spatial temporal study of user activities
15. COPYRIGHT © 2011 ALCATEL-LUCENT. ALL RIGHTS RESERVED.
ALCATEL-LUCENT — INTERNAL PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION
15
Conclusions and Remarks
• Telecom data: Huge but limited; Strict privacy regulations
• URL trace data:
- Privacy preservation with truncation
- Noisy data
- Burst property of micro user actions
• Goal: Perform activity extraction and behaviour analysis for a large user-pool with
limited and noisy data
• Method:
- Burst decomposition and feature extractions
- Representative URL identification and activity extraction
Doing medium-grained behavior analysis
is feasible with limited, noisy and privacy preservation URL data
16. COPYRIGHT © 2011 ALCATEL-LUCENT. ALL RIGHTS RESERVED.
ALCATEL-LUCENT — INTERNAL PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION
16
Thank you
• Thank you
• Questions?