SlideShare ist ein Scribd-Unternehmen logo
1 von 40
Jiang Zhu and Joy Y. Zhang
Carnegie Mellon University




August 2nd, 2011

                             1
• Monitor and track user mobility behavior in WLAN environment
 using RSS trace
• Convert mobility traces and other context information to Behavior
 Text representations
• Build n-gram language model with behavior text and use it for
 anomaly detection to discover loss or theft events




                                                                      2
3
60%                                                                                             Miami
                                                                                                       New York
       50%
                                                                                                       Los Angeles
       40%                                                                                             Phoenix

        30%                                                                                            Sacramento
                                                                                                       Chicago
        20%
                                                                                                       Dallas
        10%                                                                                            Houston
          0%                                                                                           Philadelphia
                                                                                                       Boston
                                Mobile Device Loss or theft                                            San
                                                                                                       Francisco
                                   frequent visited

Strategy One Survey conducted among a U.S. sample of 3017 adults age 18 years older in September 21-
   28, 2010, with an oversample in the top 20 cities (based on population).

                                                                                                                      4
Business and personal               • CAPEX loss
applications running together       • Data loss
Corporate messaging, email on       • Recovery effort
personal devices
                                    •Loss of business
Intranet wireless access on
personal devices                    ―The 329 organizations polled had
Personal finance and banking on     collectively lost more than 86,000
                                    devices … with average cost of lost
corporate devices
                                    data at $49,246 per device, worth
Mobile payments and credentials     $2.1 billion or $6.4 million per
                                    organization.

                                  "The Billion Dollar Lost-Laptop Study," conducted by Intel
                                     Corporation and the Ponemon Institute, analyzed the
                                     scope and circumstances of missing laptop PCs.



                                                                                               5
Detection
To discover the                        Mitigation
loss and theft early
enough to initiate                  Revoke access to
other steps                                  sensitive
                                    data, applications
                                           or services
   Notification
    Notify
    owners, administra    Recovery
    tors or authority
                          Rescue device
                         Recover/restore
                                    data




                                                         6
• Mobilityas Behavior
   • Mobility modeling is a well studied research area
   • Can be measured and tracked: Wi-Fi, GPS, Cellular, etc
   • Other contextual information can be combined: Bluetooth, accelerometer, etc

• Other motivating applications
   • Healthcare: Inpatient telemetry.
   • Education: Young children monitoring
   • Law reinforcement: Inmates monitoring and control




                                                                                   7
8
• Past and current location trigger future locations

                                              Hallway A                              Office


            Break
            Room


                                             Hallway B                             Bathroom



• User mobility as short sequence of locations
                                                                                                  [1] [2]
• ―Language as action‖: Language vs. streams of sensor data
    • Composing elements: sensor data vs. words in corpus
    • Sequence structure: local dependency vs. ―grammar‖
 [1] Aipperspach, et al, ―Modeling Human Behavior from Simple sensors in the Home‖, PerCom 2006
 [2] Buthpitya, et al, ―n—gram Geo-Trace Modeling‖, Pervasive 2011


                                                                                                            9
• User location at time t depends only on the last n-1 locations

• Sequence of locations can be predicted by n consecutive location
 in the past


• Maximum Likelihood Estimation from training data by counting:



• MLE assign zero probability to unseen n-grams
   Incorporate smoothing function (Katz)
    Discount probability for observed grams
    Reserve probability for unseen grams




                                                                     10
• Long distance dependency of words in sentences
   • tri-grams for ―I hit the tennis ball‖: ―I hit the‖, ―hit the tennis‖ ―the tennis ball‖
   • ―I hit ball‖ not captured

• Future pseudo location depends on locations far in the past.
 Intermediate behavior has little relevance or influence
   • Noise in the data collected: ―ping-pong‖ effect in WLAN
   association, interference, sampling errors, etc
   • Model size




                                                                                              11
Preprocessing   Anomaly
                          Detection


 RSS                      N-gram
 Trace                    Model

Sensing




            Anomaly Y/N

                                      12
• Collect RSS of the devices on multiple WAPs with timestamps

• Aggregate and serialize into time series of RSS vectors




* Lin, et al ―WASP: An enhanced indoor location algorithm for a congested wi-fi environment‖

                                                                                               13
• Dimensionality in RSS vector – too fine for modeling

• Proximity in location results in similar RSS vector

• K-means clustering algorithm with distance function similar to
   WASP[1] and each cluster assigned a pseudo location label




[1] Lin, et al ―WASP: An enhanced indoor location algorithm for a congested wi-fi environment‖

                                                                                                 14
• Repeating location labels dominate n-gram statistics

• Extracting ―duration‖ by counting repeating labels

• Only append ―duration‖ label if Mutual Information of locationand
 duration is high
   • Dependency - ―Conference Room‖ + ―1 hours‖ infer ―Meeting‖
   • Personal - ―Professor’s Office‖ + ―10 minutes‖ infer ―Student’s quick chat‖

• Segment behavior text sequences based on time-of-day

  • Behavior follows routine and agenda

  • Varying among users

  • Cut the boundary based on activity level



                                                                                   15
Extract   Preprocessing        Anomaly
          Pseudo                          Detection
          Location

                          Behavior Text
 RSS                                      N-gram
                           Generation
 Trace                                    Model
                             Fusion

           Extract
Sensing
           Other
          Features




                       Anomaly Y/N

                                                      16
• Feed sequence of the past locations in a sliding window of size N
 to n-gram model for testing
• For a testing sequence of pseudo locations



• Estimate the average log probability this sequence is generated
 from the n-gram or skipped n-gram model




• If this likelihood drops below a threshold, flag an anomaly alert


                                                                      17
0. 8

                                0. 7
Aver age Log Pr obabi l i t y




                                0. 6

                                0. 5

                                0. 4
                                       C             D           A
                                0. 3

                                0. 2
                                       Log Probility                   B
                                       Low Threshold
                                       High Threshold
                                0. 1

                                  0
                                              Sl i di ng W ndow Posi t i on
                                                          i



                                                                              18
Extract   Preprocessing        Anomaly
          Pseudo                          Detection
          Location

                          Behavior Text
 RSS                                      N-gram
                           Generation
 Trace                                    Model
                             Fusion

           Extract
Sensing
           Other
          Features

                              Threshold      >



                       Anomaly Y/N

                                                      19
20
Dataset
                                     • RSS vector clustering
Users              40
                                     • Run small subset trace with
                   Cisco SJC 14 1F
Location
                   Alpha networks
                                      different K and evaluate
                                      clustering performance by
RSS
                   13 sec             average distance to centroids
sampling rate
Period             5 days            • K = 3X #WAPs has the best
                                      trade-offs
Number of WAPs 87
                                     • Yield ~260 pseudo locations
                   Cisco Aironet
Device
                   1500 + MSE
Dataset Size       3.2 mil points


                                                                      21
• Testing samples
   Positive sample: simulated anomaly by splicing traces from two different users
   Negative sample: trace from ―owner‖




                                                                                    22
• Train n-gram models with 8 hour data

• Continuous 5-gram model and Skipped 3-gram with skipping
 factor k=2 result in similar accuracy ~ 60%
   • Model complexity: k-order reduction
   • Skip factor K is data dependent: particular scenarios in our data set: office
   with hallways and corridors
   • Further investigation needed to find the optimal K.

• Replacing repeating labels with duration feature improve the
 model
   Before collapsing, 5-gram statistics are dominated by several sequences with
   long repeating locations. Top 200 grams are repeating labels
   After collapsing, 5-gram statistics are well distributed

• Time-of-day has only marginal improvement, <1%

                                                                                     23
1
                                   0.9
                                   0.8
              True Positive Rate




                                   0.7
                                   0.6
                                   0.5
                                   0.4
                                   0.3
                                   0.2                                                     Data Size (12 Hrs)
                                   0.1                                                     Data Size (8 Hrs)

                                    0
                                         0   0.1   0.2   0.3    0.4    0.5   0.6     0.7      0.8      0.9      1
                                                               False Positive Rate
Source information is set at 12 points.

                                                                                                                    24
1

                     0.9

                     0.8

                     0.7

                     0.6
          Accuracy




                     0.5

                     0.4

                     0.3
                                                                         Data size (4hr)
                     0.2
                                                                         Data size (8hr)
                     0.1                                                 Data size (12hr)
                      0
                           0   1     2    3   4      5      6    7   8        9       10
                                                  n-gram order
Source information is set at 12 points.

                                                                                            25
26
Extract   Preprocessing        Anomaly
                      Pseudo                          Detection
                      Location

                                      Behavior Text
    RSS                                               N-gram
                                       Generation
    Trace                                             Model
                                         Fusion

                       Extract
   Sensing
                       Other
                      Features

                                          Threshold      >



                                   Anomaly Y/N
• Experiments to discover loss or theft event through anomaly
 detection with 70~80% accuracy with only 8 hours of training data
                                                                  27
Thank you.

              And special thanks to our sponsors
                       CyLab Mobility Research Center
             Cisco Systems Inc.                 Army
                                      Research Office
29
30
31
32
• Extract Mobility model from real trace in WLAN environment
                                                                                                         [1]



    • Extract mobility tracks, duration from WLAN association records
    • Analyze mobility characteristics: pause time, speed, direction, destination
    region and their distributions
    • Build empirical model to generate synthetic trace

• Steady state and transient behavior can be modeled with Semi-
                            [2]
  Markov model
    Transition probability matrix and sojourn time distribution

• Language model to model behavior from sensors in home
                                                                                                   [3]



    Show support on similarity between language and behavior
    Smoothed n-gram model to make single-step prediction on binary sensor
    readings from smarthome
 [1] Kim et al, ―Extract a Mobility model from Real User Traces, INFORCOM 2006
 [2] Lee and Hou, ―Modeling Steady-State and Transient Behaviors of User Mobility‖, MobiHoc 2006
 [3] Aipperspach, et al, ―Modeling Human Behavior from Simple sensors in the Home‖, PerCom 2006

                                                                                                               33
• Overhead and lack    • Model complexity    • It is straightforward
  of granularity in     and computational      to convert binary
  inferring user        overhead not           sensor data to
  location and pause    suitable for real      behavior text for
  time from WLAN        time application       LM-based
  association           [Lee’06]               analysis.[Aipp’06]
  records[Kim’06]
                       • Simple and cost-    • Heterogeneous
• Fine-grain, higher    effective model to     multi-valued
  dimension trace       capture mobility       sensory data is
  data to model         reducing ping-pong     hard to convert to a
  mobility behavior,    effects                single-dimension
  such as RSS                                  behavior text
  beacons trace



                                                                    34
35
• Calculate coordinates for each RSS vector using ―Indoor location‖
   algorithm[1] and generate hot region plot




[1] Lin, et al ―WASP: An enhanced indoor location algorithm for a congested wi-fi environment‖

                                                                                                 36
• Select 10 users with the least cross entropy


                                                 37
• Help Cisco to adopt this model to Mobility Service Engine

• Heterogeneous sensor data fusion
   Network traffic patterns from wireless controllers
   Applications, Memory and battery status
   GPS, accelerometers, gyroscope, temperature, etc

• Advanced Model
   Leverage the internal factorized relationships among various sensors
    • Factor Language Model

• More Applications
   Prediction: resource allocation, energy saving, personalized services
   Anomaly detection: adaptive authentication, patient telemetry




                                                                           38
39
• Confirm similarity between language and behavior

• Multi-dimension to single dimension and n-gram: low complexity
 but good results
• Potential problems:
   •Dimensionality reduction to 1-D to use language approach in modeling may
   cause loss of the relationship among multi-dimensional data

        Sensor 1

        Sensor 2

                                                    State

   •Skipped n-gram approach is dependent on the data and may only have
   marginal improvement or even worse results.

                                                                               40

Weitere ähnliche Inhalte

Ähnlich wie Icccn2011 jiang-0802

Context is King: AR, AI, Salience, and the Constant Next Scenario
Context is King: AR, AI, Salience, and the Constant Next ScenarioContext is King: AR, AI, Salience, and the Constant Next Scenario
Context is King: AR, AI, Salience, and the Constant Next ScenarioClark Dodsworth
 
OSS Presentation Keynote by Hal Stern
OSS Presentation Keynote by Hal SternOSS Presentation Keynote by Hal Stern
OSS Presentation Keynote by Hal SternOpenStorageSummit
 
Mobile oxford open source junction 5 july 2011
Mobile oxford   open source junction 5 july 2011Mobile oxford   open source junction 5 july 2011
Mobile oxford open source junction 5 july 2011Tim Fernando
 
A Context Aware Mobile Social Web
A Context Aware Mobile Social WebA Context Aware Mobile Social Web
A Context Aware Mobile Social Webwasvel
 
MoMIE research overview
MoMIE research overviewMoMIE research overview
MoMIE research overviewTimo Smura
 
Uncovering Remote Peering Interconnections at IXPs
Uncovering Remote Peering Interconnections at IXPsUncovering Remote Peering Interconnections at IXPs
Uncovering Remote Peering Interconnections at IXPsAPNIC
 
From Context-awareness to Human Behavior Patterns
From Context-awareness to Human Behavior PatternsFrom Context-awareness to Human Behavior Patterns
From Context-awareness to Human Behavior PatternsVille Antila
 
An Architecture for Privacy-Sensitive Ubiquitous Computing at Mobisys 2004
An Architecture for Privacy-Sensitive Ubiquitous Computing at Mobisys 2004An Architecture for Privacy-Sensitive Ubiquitous Computing at Mobisys 2004
An Architecture for Privacy-Sensitive Ubiquitous Computing at Mobisys 2004Jason Hong
 
Cps innovation lab kolkata iiest
Cps innovation lab kolkata iiestCps innovation lab kolkata iiest
Cps innovation lab kolkata iiestArpan Pal
 
MindTrek2011 - ContextCapture: Context-based Awareness Cues in Status Updates
MindTrek2011 - ContextCapture: Context-based Awareness Cues in Status UpdatesMindTrek2011 - ContextCapture: Context-based Awareness Cues in Status Updates
MindTrek2011 - ContextCapture: Context-based Awareness Cues in Status UpdatesVille Antila
 
Mobile Oxford - Open Source Junction 29 March 2011
Mobile Oxford - Open Source Junction 29 March 2011Mobile Oxford - Open Source Junction 29 March 2011
Mobile Oxford - Open Source Junction 29 March 2011Tim Fernando
 
Jonathan Lenaghan, VP of Science and Technology, PlaceIQ at MLconf ATL 2016
Jonathan Lenaghan, VP of Science and Technology, PlaceIQ at MLconf ATL 2016Jonathan Lenaghan, VP of Science and Technology, PlaceIQ at MLconf ATL 2016
Jonathan Lenaghan, VP of Science and Technology, PlaceIQ at MLconf ATL 2016MLconf
 
Network Driven Behaviour Modelling for Designing User Centred IoT Services
 Network Driven Behaviour Modelling for Designing User Centred IoT Services Network Driven Behaviour Modelling for Designing User Centred IoT Services
Network Driven Behaviour Modelling for Designing User Centred IoT ServicesFahim Kawsar
 
CarolinaCon Presentation on Streaming Analytics
CarolinaCon Presentation on Streaming AnalyticsCarolinaCon Presentation on Streaming Analytics
CarolinaCon Presentation on Streaming AnalyticsJohn Eberhardt
 
MobiSys Group Presentation
MobiSys Group PresentationMobiSys Group Presentation
MobiSys Group PresentationNeal Lathia
 
Sense networks
Sense networksSense networks
Sense networksBen Allen
 
Review 1 부분5
Review 1 부분5Review 1 부분5
Review 1 부분5희범 구
 

Ähnlich wie Icccn2011 jiang-0802 (20)

Context is King: AR, AI, Salience, and the Constant Next Scenario
Context is King: AR, AI, Salience, and the Constant Next ScenarioContext is King: AR, AI, Salience, and the Constant Next Scenario
Context is King: AR, AI, Salience, and the Constant Next Scenario
 
OSS Presentation Keynote by Hal Stern
OSS Presentation Keynote by Hal SternOSS Presentation Keynote by Hal Stern
OSS Presentation Keynote by Hal Stern
 
Radhika Thesis
Radhika ThesisRadhika Thesis
Radhika Thesis
 
Mobile oxford open source junction 5 july 2011
Mobile oxford   open source junction 5 july 2011Mobile oxford   open source junction 5 july 2011
Mobile oxford open source junction 5 july 2011
 
A Context Aware Mobile Social Web
A Context Aware Mobile Social WebA Context Aware Mobile Social Web
A Context Aware Mobile Social Web
 
MoMIE research overview
MoMIE research overviewMoMIE research overview
MoMIE research overview
 
Uncovering Remote Peering Interconnections at IXPs
Uncovering Remote Peering Interconnections at IXPsUncovering Remote Peering Interconnections at IXPs
Uncovering Remote Peering Interconnections at IXPs
 
From Context-awareness to Human Behavior Patterns
From Context-awareness to Human Behavior PatternsFrom Context-awareness to Human Behavior Patterns
From Context-awareness to Human Behavior Patterns
 
An Architecture for Privacy-Sensitive Ubiquitous Computing at Mobisys 2004
An Architecture for Privacy-Sensitive Ubiquitous Computing at Mobisys 2004An Architecture for Privacy-Sensitive Ubiquitous Computing at Mobisys 2004
An Architecture for Privacy-Sensitive Ubiquitous Computing at Mobisys 2004
 
Cps innovation lab kolkata iiest
Cps innovation lab kolkata iiestCps innovation lab kolkata iiest
Cps innovation lab kolkata iiest
 
OpenSense
OpenSenseOpenSense
OpenSense
 
MindTrek2011 - ContextCapture: Context-based Awareness Cues in Status Updates
MindTrek2011 - ContextCapture: Context-based Awareness Cues in Status UpdatesMindTrek2011 - ContextCapture: Context-based Awareness Cues in Status Updates
MindTrek2011 - ContextCapture: Context-based Awareness Cues in Status Updates
 
Mobile Oxford - Open Source Junction 29 March 2011
Mobile Oxford - Open Source Junction 29 March 2011Mobile Oxford - Open Source Junction 29 March 2011
Mobile Oxford - Open Source Junction 29 March 2011
 
Jonathan Lenaghan, VP of Science and Technology, PlaceIQ at MLconf ATL 2016
Jonathan Lenaghan, VP of Science and Technology, PlaceIQ at MLconf ATL 2016Jonathan Lenaghan, VP of Science and Technology, PlaceIQ at MLconf ATL 2016
Jonathan Lenaghan, VP of Science and Technology, PlaceIQ at MLconf ATL 2016
 
Network Driven Behaviour Modelling for Designing User Centred IoT Services
 Network Driven Behaviour Modelling for Designing User Centred IoT Services Network Driven Behaviour Modelling for Designing User Centred IoT Services
Network Driven Behaviour Modelling for Designing User Centred IoT Services
 
CarolinaCon Presentation on Streaming Analytics
CarolinaCon Presentation on Streaming AnalyticsCarolinaCon Presentation on Streaming Analytics
CarolinaCon Presentation on Streaming Analytics
 
MobiSys Group Presentation
MobiSys Group PresentationMobiSys Group Presentation
MobiSys Group Presentation
 
Pervasive Computing
Pervasive ComputingPervasive Computing
Pervasive Computing
 
Sense networks
Sense networksSense networks
Sense networks
 
Review 1 부분5
Review 1 부분5Review 1 부분5
Review 1 부분5
 

Kürzlich hochgeladen

Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfAijun Zhang
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesDavid Newbury
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdfPedro Manuel
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsSeth Reyes
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1DianaGray10
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaborationbruanjhuli
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Adtran
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostMatt Ray
 
AI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarAI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarPrecisely
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024D Cloud Solutions
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesMd Hossain Ali
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...DianaGray10
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAshyamraj55
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IES VE
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxMatsuo Lab
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1DianaGray10
 
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDEADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDELiveplex
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URLRuncy Oommen
 

Kürzlich hochgeladen (20)

Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdf
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond Ontologies
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdf
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and Hazards
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
 
AI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarAI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity Webinar
 
20230104 - machine vision
20230104 - machine vision20230104 - machine vision
20230104 - machine vision
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
 
201610817 - edge part1
201610817 - edge part1201610817 - edge part1
201610817 - edge part1
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptx
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
 
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDEADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URL
 

Icccn2011 jiang-0802

  • 1. Jiang Zhu and Joy Y. Zhang Carnegie Mellon University August 2nd, 2011 1
  • 2. • Monitor and track user mobility behavior in WLAN environment using RSS trace • Convert mobility traces and other context information to Behavior Text representations • Build n-gram language model with behavior text and use it for anomaly detection to discover loss or theft events 2
  • 3. 3
  • 4. 60% Miami New York 50% Los Angeles 40% Phoenix 30% Sacramento Chicago 20% Dallas 10% Houston 0% Philadelphia Boston Mobile Device Loss or theft San Francisco frequent visited Strategy One Survey conducted among a U.S. sample of 3017 adults age 18 years older in September 21- 28, 2010, with an oversample in the top 20 cities (based on population). 4
  • 5. Business and personal • CAPEX loss applications running together • Data loss Corporate messaging, email on • Recovery effort personal devices •Loss of business Intranet wireless access on personal devices ―The 329 organizations polled had Personal finance and banking on collectively lost more than 86,000 devices … with average cost of lost corporate devices data at $49,246 per device, worth Mobile payments and credentials $2.1 billion or $6.4 million per organization. "The Billion Dollar Lost-Laptop Study," conducted by Intel Corporation and the Ponemon Institute, analyzed the scope and circumstances of missing laptop PCs. 5
  • 6. Detection To discover the Mitigation loss and theft early enough to initiate Revoke access to other steps sensitive data, applications or services Notification Notify owners, administra Recovery tors or authority Rescue device Recover/restore data 6
  • 7. • Mobilityas Behavior • Mobility modeling is a well studied research area • Can be measured and tracked: Wi-Fi, GPS, Cellular, etc • Other contextual information can be combined: Bluetooth, accelerometer, etc • Other motivating applications • Healthcare: Inpatient telemetry. • Education: Young children monitoring • Law reinforcement: Inmates monitoring and control 7
  • 8. 8
  • 9. • Past and current location trigger future locations Hallway A Office Break Room Hallway B Bathroom • User mobility as short sequence of locations [1] [2] • ―Language as action‖: Language vs. streams of sensor data • Composing elements: sensor data vs. words in corpus • Sequence structure: local dependency vs. ―grammar‖ [1] Aipperspach, et al, ―Modeling Human Behavior from Simple sensors in the Home‖, PerCom 2006 [2] Buthpitya, et al, ―n—gram Geo-Trace Modeling‖, Pervasive 2011 9
  • 10. • User location at time t depends only on the last n-1 locations • Sequence of locations can be predicted by n consecutive location in the past • Maximum Likelihood Estimation from training data by counting: • MLE assign zero probability to unseen n-grams Incorporate smoothing function (Katz) Discount probability for observed grams Reserve probability for unseen grams 10
  • 11. • Long distance dependency of words in sentences • tri-grams for ―I hit the tennis ball‖: ―I hit the‖, ―hit the tennis‖ ―the tennis ball‖ • ―I hit ball‖ not captured • Future pseudo location depends on locations far in the past. Intermediate behavior has little relevance or influence • Noise in the data collected: ―ping-pong‖ effect in WLAN association, interference, sampling errors, etc • Model size 11
  • 12. Preprocessing Anomaly Detection RSS N-gram Trace Model Sensing Anomaly Y/N 12
  • 13. • Collect RSS of the devices on multiple WAPs with timestamps • Aggregate and serialize into time series of RSS vectors * Lin, et al ―WASP: An enhanced indoor location algorithm for a congested wi-fi environment‖ 13
  • 14. • Dimensionality in RSS vector – too fine for modeling • Proximity in location results in similar RSS vector • K-means clustering algorithm with distance function similar to WASP[1] and each cluster assigned a pseudo location label [1] Lin, et al ―WASP: An enhanced indoor location algorithm for a congested wi-fi environment‖ 14
  • 15. • Repeating location labels dominate n-gram statistics • Extracting ―duration‖ by counting repeating labels • Only append ―duration‖ label if Mutual Information of locationand duration is high • Dependency - ―Conference Room‖ + ―1 hours‖ infer ―Meeting‖ • Personal - ―Professor’s Office‖ + ―10 minutes‖ infer ―Student’s quick chat‖ • Segment behavior text sequences based on time-of-day • Behavior follows routine and agenda • Varying among users • Cut the boundary based on activity level 15
  • 16. Extract Preprocessing Anomaly Pseudo Detection Location Behavior Text RSS N-gram Generation Trace Model Fusion Extract Sensing Other Features Anomaly Y/N 16
  • 17. • Feed sequence of the past locations in a sliding window of size N to n-gram model for testing • For a testing sequence of pseudo locations • Estimate the average log probability this sequence is generated from the n-gram or skipped n-gram model • If this likelihood drops below a threshold, flag an anomaly alert 17
  • 18. 0. 8 0. 7 Aver age Log Pr obabi l i t y 0. 6 0. 5 0. 4 C D A 0. 3 0. 2 Log Probility B Low Threshold High Threshold 0. 1 0 Sl i di ng W ndow Posi t i on i 18
  • 19. Extract Preprocessing Anomaly Pseudo Detection Location Behavior Text RSS N-gram Generation Trace Model Fusion Extract Sensing Other Features Threshold > Anomaly Y/N 19
  • 20. 20
  • 21. Dataset • RSS vector clustering Users 40 • Run small subset trace with Cisco SJC 14 1F Location Alpha networks different K and evaluate clustering performance by RSS 13 sec average distance to centroids sampling rate Period 5 days • K = 3X #WAPs has the best trade-offs Number of WAPs 87 • Yield ~260 pseudo locations Cisco Aironet Device 1500 + MSE Dataset Size 3.2 mil points 21
  • 22. • Testing samples Positive sample: simulated anomaly by splicing traces from two different users Negative sample: trace from ―owner‖ 22
  • 23. • Train n-gram models with 8 hour data • Continuous 5-gram model and Skipped 3-gram with skipping factor k=2 result in similar accuracy ~ 60% • Model complexity: k-order reduction • Skip factor K is data dependent: particular scenarios in our data set: office with hallways and corridors • Further investigation needed to find the optimal K. • Replacing repeating labels with duration feature improve the model Before collapsing, 5-gram statistics are dominated by several sequences with long repeating locations. Top 200 grams are repeating labels After collapsing, 5-gram statistics are well distributed • Time-of-day has only marginal improvement, <1% 23
  • 24. 1 0.9 0.8 True Positive Rate 0.7 0.6 0.5 0.4 0.3 0.2 Data Size (12 Hrs) 0.1 Data Size (8 Hrs) 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 False Positive Rate Source information is set at 12 points. 24
  • 25. 1 0.9 0.8 0.7 0.6 Accuracy 0.5 0.4 0.3 Data size (4hr) 0.2 Data size (8hr) 0.1 Data size (12hr) 0 0 1 2 3 4 5 6 7 8 9 10 n-gram order Source information is set at 12 points. 25
  • 26. 26
  • 27. Extract Preprocessing Anomaly Pseudo Detection Location Behavior Text RSS N-gram Generation Trace Model Fusion Extract Sensing Other Features Threshold > Anomaly Y/N • Experiments to discover loss or theft event through anomaly detection with 70~80% accuracy with only 8 hours of training data 27
  • 28. Thank you. And special thanks to our sponsors CyLab Mobility Research Center Cisco Systems Inc. Army Research Office
  • 29. 29
  • 30. 30
  • 31. 31
  • 32. 32
  • 33. • Extract Mobility model from real trace in WLAN environment [1] • Extract mobility tracks, duration from WLAN association records • Analyze mobility characteristics: pause time, speed, direction, destination region and their distributions • Build empirical model to generate synthetic trace • Steady state and transient behavior can be modeled with Semi- [2] Markov model Transition probability matrix and sojourn time distribution • Language model to model behavior from sensors in home [3] Show support on similarity between language and behavior Smoothed n-gram model to make single-step prediction on binary sensor readings from smarthome [1] Kim et al, ―Extract a Mobility model from Real User Traces, INFORCOM 2006 [2] Lee and Hou, ―Modeling Steady-State and Transient Behaviors of User Mobility‖, MobiHoc 2006 [3] Aipperspach, et al, ―Modeling Human Behavior from Simple sensors in the Home‖, PerCom 2006 33
  • 34. • Overhead and lack • Model complexity • It is straightforward of granularity in and computational to convert binary inferring user overhead not sensor data to location and pause suitable for real behavior text for time from WLAN time application LM-based association [Lee’06] analysis.[Aipp’06] records[Kim’06] • Simple and cost- • Heterogeneous • Fine-grain, higher effective model to multi-valued dimension trace capture mobility sensory data is data to model reducing ping-pong hard to convert to a mobility behavior, effects single-dimension such as RSS behavior text beacons trace 34
  • 35. 35
  • 36. • Calculate coordinates for each RSS vector using ―Indoor location‖ algorithm[1] and generate hot region plot [1] Lin, et al ―WASP: An enhanced indoor location algorithm for a congested wi-fi environment‖ 36
  • 37. • Select 10 users with the least cross entropy 37
  • 38. • Help Cisco to adopt this model to Mobility Service Engine • Heterogeneous sensor data fusion Network traffic patterns from wireless controllers Applications, Memory and battery status GPS, accelerometers, gyroscope, temperature, etc • Advanced Model Leverage the internal factorized relationships among various sensors • Factor Language Model • More Applications Prediction: resource allocation, energy saving, personalized services Anomaly detection: adaptive authentication, patient telemetry 38
  • 39. 39
  • 40. • Confirm similarity between language and behavior • Multi-dimension to single dimension and n-gram: low complexity but good results • Potential problems: •Dimensionality reduction to 1-D to use language approach in modeling may cause loss of the relationship among multi-dimensional data Sensor 1 Sensor 2 State •Skipped n-gram approach is dependent on the data and may only have marginal improvement or even worse results. 40

Hinweis der Redaktion

  1. Good afternoon everybody.. Thank you for coming. Today.. I’ll be presenting my work on a language approach for detecting anomalies in user mobility behaviors by modeling their WiFi traces
  2. As a quick overview of our work, in order to do anomaly detection, we monitor and track user mobility behavior through the RSS trace from the Wifi environmentAnd then we convert these trace and other context information to behavior text representation. After that we build a n-gram language model and use it to discover anomaly such as device loss or theft
  3. So why we want to study the anomaly detection in such an environment… let me talk about our motivation.So, as we all know, Mobile applications and devices are becoming ubiquitous. On one side, mobile devices make our lives convenient. And people love it. But on the other side, the broad adoption of mobile applications such as email, messaging.. online banking and personal finance expose our identities and privacy to greater risk. The devices are portable and can be used almost everywhere we go, therefore they are also easy to lose or be stolen.
  4. Last year, a survey showed that on average 36% people who participated in the survey have been experienced device loss or theft in the past. Among the regions surveyed, miami and new york have as high as 50% loss rate. Also, it shows that a big portion of the loss happened at the places that we often visit, such as university campus, office buildings, .
  5. Losing a mobile device nowadays is not the same as 10, 20 years ago. With the proliferation of mobile devices in corporate environment, the boundary of personal devices and business devices are so blurry. People are using the same devices to gain intranet wireless access, to check corporate emails, to work on business documents, even to access trade secrets. If the devices are lost, there would be greater risks in term of data loss.Another survey shows that the data loss cost is about 50 thousand dollars per device or 6.4 million dollars per organization in the past.
  6. Given the high device loss rate and high cost associated with these losses, accountable schemes are needed to promptly and accurately discover and detect these undesirable events. Such detections will facilitate subsequent notification, mitigation and recovery process to control or even avoid the damages. And in our research work, we are focusing on the detection part of the whole action chains, namely “Anomaly Detection” First we collect user behavior and build an accountable behavior model. We can monitoruser behavior constantly and compare it with the learned model. if it deviates from the learned model, we can flag an alert.
  7. Behavior is a broad concept. Here, we want to leverage mobility as behavior as the example just shown. The reason is the following , first, mobility modeling has been studied thoroughly in the past and there are a lot of methodology that we can borrow from and lessions we can learn from. Secondly, mobility can be easily measured in the current computing environment, WIFi, GPS, cellular and can be combined with other context information such as bluetooth and other sensors. …Although our focus is on the detection of mobile device loss and theft, there are a lot of other motivating applications of mobility anomaly detection. One interesting example among other listed here is inpatient monitoring or telemetry. Imagine if we can detect anomaly in inpatient’s mobility in a hospital, medical help can be called up to handle the situation promptly.
  8. What we would like our system to do is to ..Sense the WiFI signals of the mobile devices …. Do some preprocessing on the data …. And then feed it into an anomaly detection model which then outputs whether there is an anomaly or not let’s take a look at the assumptions on which our approach is based on.
  9. Our system is based on the assumption that a user will have a unique set of locations which act as triggers for their future locations. For example, an employee exiting the break room may have two destinations: hallway A and hallway BIf we know he is taking hallway A, &lt;&lt; hit enter&gt;&gt;we know that he will be in his office soon. Other wise, &lt;&lt;hit enter&gt;&gt; he may go to bath room insteadPrevious work showed that the user mobility model can be estimated by short sequence of locations … and showed a correlation between human behavior and natural language. … research also showed that language model can be used to effectively detect anomaly in Geo-tracing
  10. So building along this line, we use a continousn-gram model to learn the sequence of locations from user’s wifi traces.N-gram model works under the assumptions that the next location in the sequence .. depends on just the last n-1 locations… Once the n-gram model is trained, we can use it to calculate the probability of all possible next locations given the past n-1 locations…. and see which one is the most likely location.To train the model, we use maximum likelihood estimation on the training sequences to estimate these conditional probability … just by counting. As show in this equation, MLE probability of being in location at time i conditioned on the past n-1 history locations is… just the count of all n sequences in the data divided by the count of all these n-1 sequences. There is one small problem with this approach. Let’s say our model come across a location that has not been seen in the training. It just assumes a zero probability. This may push the system to trigger anomaly alert. Luckily, N-gram model is very robust in handling unseen labels if we use smoothing. Smoothing algorithms such as Katz… are to take some probability mass from the seen locations and reserve them for those unseen locations.
  11. In natural language, words in a sentence may have long-distance dependencies. For example, the sentence “I hit the tennis ball” … has 3 tri-grams.. “I hit the” … “hit the tennis” .. And.. “the tennis ball” It is clear that an equally important tri-gram “I hit ball” is not normally captured by the continuous n-gram… because the separators ‘the” “tennis” is in the middle. If we could skip the separators … and we can form this important tri-gram. I hit ball Similarity, in our continuous n-gram model I just described, user’s next locations is dependent only on his n-1 previous locations. However, in many cases this may not be true.Use the same example, if a user is leaving the break room and entering hallway that leads to his office, we can predict he will be in his office soon. The intermediate locations along the hallway and before entering the office are not that important. Those locations can be skipped in the modeling. As shown in the diagram here, ABC is the break room, ACD is the entrance of the hallway and EDB is the office. Anything in the middle can be skipped and still give the same results. By skipping detracting grams, now… the effective n-gram order becomes (n-d). Therefore, we can reduce the size of the model in terms of computation and storage because the n-gram model has better performance for a lower value of n.
  12. Now we have talked about our language based model on the right hand side. But we can’t feed the wifi traces to n-gram model directly, Because, Firstly n-gram models can’t handle numeric data like signal strength. It can only take discrete sets of symbols. The Second issue is that … even though we represent the RSS trace as vectors, the amount of data required to create a model with reasonable accuracy would be immerse. Because it is not likely there will be repeating signal strength with the exact the same readings. Therefore, we need to take a look at our data and find a way to convert the sensed data into text representation.
  13. The Wifi trace we collect in our system is different from the Dartmouth data set. The management, control and data frames from a device will be heard by multiple APs. In our particular setup, these APs will record the Received signal strength or RSS of those frame along with the Identity of the device and timing information.These traces will be aggregated to a central location .. where we can serialize these traces based on the time stamp and classify them using the device IDs. So.. for a particular device, we can build a time series of RSS vector, each element in the vector is the RSS from a particular AP. These series of RSS vector along with other context information serves as the input to the preprocessing module…. Where we will convert these to a text representation before feed them into our n-gram model.
  14. From the signal propagation model, if two vectors are very similar, we know that the location where this vectors are measured should be within a reasonable proximity. Based on this assumption, we want to partition the RSS vector space into many “pseudo locations” and assign each “pseudo location” a unique label. By pseudo, we mean we don’t need to know the exact location of the reading, we just need to distinguish between two different locationsWell, this can be easily done by clustering algorithm… for example K-means clustering. In the k-mean clustering runs, we use a distance function similar to redpin and WASP in addition to the standard cosine function to reduce the noise caused by interference.Once the clustering is done, we assign labels to all the members belong to the same cluster….
  15. We also incorporated other features.Due to the way how the data is collected and aggregated, there could be a lot of repeating labels in the sequences if a user stay at one location for a long time. To extract one more “duration” feature, we count the repeating labels and remove the repeating sequence and add a new label … with both location and duration information. One minor improvement we did is to only append the duration label if the mutual information between the location and duration is high. Intuitively, we want to capture the correlations between the location and the duration. For example, conference room + 1 hour will imply a meeting. While office + 10 min will imply a quick visit. …Time-of-day features is also quantized into 4 labels and appended to the main pseudo location label. Quantization process is not based on a fixed boundary because we know that user’s mobility also follow certain regularities due to job roles and responsibility. Sometimes it follows a personalized agenda. We choose the boundary ..for time of day.. based on user’s activity level. &lt;&lt; next slide&gt;&gt;Mutual information I(X, Y) = int_yint_xp(x,y) log[ (p(x,y)/(p1(x)p2(y)] I(X,Y) = 0 -&gt; independentI(X,Y) &gt;=0
  16. Now we have the Sensing, Preprocessing and Modeling parts in place, let’s take a look how this system is used to do anomaly detection
  17. We feed the RSS trace to the preprocessing module and then feed it to the n-gram model.. And the n-gram model continuously produces the likelihood estimate for the last N behavior text,… specifically, we will calculate the average log probability of this N behavior text using this equation If this likelihood drops below a certain threshold, the system will trigger an anomaly alert.
  18. This graph shows the anomaly detection process and demonstrate different threshold may cause either detection delay (B) or cause false positives (point C &amp; D) when point A is the actual anomaly point. The way to find the right threshold is to use receiver-operating-characteristic curve or ROC curve. We will look at this in more details later in the talk.
  19. So, this complete the whole system architecture. We have the sensing part that produce RSS traces, we have preprocessing part that convert the traces and other context information to behavior text and we have the modeling training and inference part that is used to do anomaly detection with a design parameter “threshold”
  20. Now, let’s discuss the experiments we did.Before looking at the experiments and results, let me describe the data set we used.
  21. So… we collected the RSS traces from 87 WAPs in an office building over 5 days. The time precision of the RSS sample is at 13 sec level. These traces contain complete data of 40 users and … in total we have about 3.2 mil data points. To determine the number of clusters in the k-means clustering, we took a small subset traces and run the algorithm with different Ks. We evaluated the results by looking at the average distance to centroids and number of iterations. If we choose k as number of Aps, it will be similar to using association records. If K is too large, the clustering algorithm will take long to finish and the resulting n-gram model will have large vocabulary size. We found if we pick K as 3 times of the number of Aps, it will provides reasonable clustering performance and quality compared to 4 times or 5 times. This resulted in about 260 pseudo location labels. Backup data points:Pseudo location from RSS (other schem not very ….) 1500 data points (RSS) per user at average RSS from 3-7 WAPs.assume user up half of the time -&gt; 80k data points per user for 5 days3.2 mil data points collected for 40 users. 20 mils rss readingsFor each of these 40 users, 16K RSS vector total
  22. To validate our system, we need to have some testing data. However, from the trace we collected, there are no recorded anomaly fortunately. We created simulated device stolen events by splicing two users’ trace segments at their intersection points…. where similar label or labels sequences are shared. We combined this simulated traces with normal traces to create a testing data set.
  23. Before we run experiments to explore the design parameter space such as threshold, n-gram order n and training size, we want to gain some insights on whether the model works and whether the ideas in preprocessing ,, we described.. have some impacts. First, we want to how skipped n-gram affect our model. Using 8 hours of data, we train a continuous 5-gram model and skip-2 5-ngram model. Both model can capture similar length of mobility behavior and with similar detection accuracy. But the skip n-gram model has k-order reduction in the model size. This particular scenario works is probably due to the environment where the data is collected. The office floor has hallways and corridors and people have to follow those to walk around. We also found that … removing the repeating labels and adding the duration features help in the model. The 5-gram model was dominated by these repeating labels. Actually top 200 grams are repeating or partially repeating grams. After we enable the duration feature, the 5-grams statistics are better distributed. Lastly, we found the time-of-day feature doesn’t provide much gain as it brings about less than 1% improvement. This is probably due to the length of the training data. 8 hours training may not be able to capture the daily routine that well, so… time-of-day feature doesn’t have significant effect on the results.
  24. Now we gained some insights on our approach. It is time to explore some of the design parameters we mentioned in the beginning. The first set of experiments is to find the best anomaly detection threshold. Actually there is no best threshold, the threshold is depending on the applications we are running. What’s the requirements on the detection accuracy? Can we allow much false positive? Do we have enough training data? To provide a guideline in answering these questions, we plot Receiver Operating Characteristic curve (or ROC curve) Essentially, ROC curve is about the trade-offs between the true-positive rate and false-positive rate in our anomaly detection. We perform the experiments with different training data sizes. We plot the ROC curve by varying the threshold and record the TPR and FPRWith the ROC curve, we can decide the threshold for a particular application depending on The amount of data the model should see before the model can detect anomaly The required TPR Or the acceptable FPRFor example, we want to use 8 hour training size and want to have less than 0.1 false positive rate, then we just need to locate this point and obtain the threshold by which this data point is generated. (0.4) We need to use threshold &lt; 0.4 in order to fulfill the FPR requirement. Another example: let’s say we want to have the same FPR requirement but want to have TPR &gt; 0.8, then we have to use more than 8 hours training size to archive this goal.
  25. We plot this graphs with different training size and n-gram orders. From the graph, we can see several things. A higher order model captures more context and in turn increase accuracy. But…. , accuracy saturates beyond 5, which means in user’s behavior is more likely to be dependent on its last 5 pseudo locations. This resonates with the past work we mentioned in the beginning. It also tells us that increase the model complexity beyond this point will NOT bring about significant improvement.Second, it shows that if the training size is as small as 4 hours, it may not capture users’ mobility behavior thoroughly enough to make an accurate detection. Also, the closeness between 8 hr and 12 hour curves also suggests that our system will provide relative good results if we have observed users’ behavior for 8 hours. One interesting point to make here is the 12 hour and 8 hour curve cross over at the lower n-gram orders. While this could be due to errors in handling the data, our explanation is leaning towards that the bigger training data set will exposure more common locations that are not captured in the shorter training size. With these common locations, people are sharing a lot of shorter sequences, leading to more simulated anomaly are not detected and … bring down the accuracy.
  26. So now lets see what we conclude from this work and the future work we plan to do
  27. In conclusion, we have build a system that we monitor and track user mobility behavior through the RSS trace from the WLAN environmentWe convert these trace and other context information to behavior text representation. And we build a n-gram language model and use it to discover anomaly such as device loss or theft.
  28. Finally, I would like to thank our sponsors from Cylab, Cisco and Army ResearchAnd Thank you all very much for your attention.
  29. Thinking of a simple example, where the red traces in this office floor represent the usual mobility of a user. In this case, this user is finishing a meeting in a conference room and is going back to his cubicle. &lt;&lt; hit enter &gt;&gt;Now, if we look at the another path user is taking, instead of going this way, he is going towards the other direction. &lt;&lt;hit enter&gt;&gt;Then deviating further and further like thisIn such a case, we would want to flag this as an anomaly. It could be a case that a visitor who attend the meeting and took the device the employee forgot in the conference room and went away. the device may still has the access to company internal network and other data source, by receiving this alert, the infrastructure would revoke his authentication credentials temporarily until the user can authentication himself again. &lt;&lt;hit enter&gt;&gt;Now, if in stead of going further away, he is going back to his cubile, just by taking an alternate path. In this case, we probably do not want to flag this as a anomaly
  30. As I just mentioned … mobility modeling is a well studied research area. Before we go into talking about our model, let me talk about some related work.
  31. Mobility Model have been heavily used in networking research esp in Ad hoc networks.Popular models such as random way points dereived from mathematical simplications. Work by a group people in Dartmouth college is among the first attempt to construct a Wifi mobility model from real-world traces. The trace data is basically the association records collected from the wifi environment. Because… the association record may not reflect user’s actual location, they developed methods and heuritics to extract mobility tracks and pause time. They draw distributions for pause time, speed, direction of travel and destination region…. and use this to build an empirical model to generate sythentic traces. There are other works to model mobility using markov models. However, research showed that in real trace, pause-time doesn’t follow exponential distribution, therefore Markov model may not be realistic if the pause duration follows other distributions. Another group in UIUC used the same data set and adopted a semi-markov model to study the steady-state and transient behavior. They constructed transition probability matrix and sojourn time distributions … and built a time location prediction algorithm to handle load balancing in Wfi networks. Another work using Georgia Tech’s smart home data set… captured our attention. In that work, the authors use simple smoothed n-gram model to make single-step prediction on binary sensor readings. It further showed the support on similarity between language and human behavior. It actually inspired us to look at the solutions of mobility modeling using a language approach.
  32. All this existing work motivate us to think more on how to build a simple and effective mobility model to capture human behaviors. First, WiFi association records is one level indirection from the user mobility. We would like to have more direct sensor readings to reflect user mobility tracks. Secondly, semi-markov model or even DBN models are too complex for real time application. For the anomaly detection application that we are interested in, we need to come up with a simpler approach in order to have real time performance. Lastly, language and n-gram approach seems very promising on the simplicity side, however, converting mobility traces. Mostly multi-valued data streams, to a single demension text representation is very challenging. It is even more challenging if we want add other context information to it. With these findings and thoughts in mind, let me start to describe our approach. &lt;&lt;hit enter&gt;&gt;
  33. Since we are reusing other user’s trace for testing, there is a problem that could lead to unfair evaluation. If the users that we used to splice the traces … have very different mobility regions, it should be very easy to detect the simulated anomaly… because their uni-gram statistics are so different. We would like to evaluate the systems using the testing data sets that are generated from users who share mobility behaviors. First, we want to see if user’s mobility areas are separable. We run “indoor location” algorithm and calculate the (x,y) coordinates. This gave us a chance to visualize the mobility patterns and coverage area.As shown in this particular graph, orange and green users are completely separated… and the red and blue have some overlap, but still partitioned. We need to remove user pairs like this in our simulated anomaly generation process.
  34. Of course, we can NOT run the locationing algorithm for all our traces. We want to filter out those users at the pseudo location label level. …Cross entropy provides a way to measure the correlations of two distributions, and it is a good fit for our problem.We calculate the cross entropy of pseudo location labels for all the 40 users …. And we chose the 10 users with least cross entropy. This is to ensure these users mobility paths strongly overlap and it will provide fair evaluation with the simulated anomaly.
  35. For future work… As part of the sponsored research, we will help cisco integrate this model to their MSE as a value-added mobility application.This model will work with existing CCX solution to help in enterprise device security as well as leveraging its prediction capability to improve VoIP roaming performance. We are also looking into obtaining more heterogeneous sensor data from the current system such as traffic pattern, device capability and other external sensors such GPS and temperature to build a more robust sensor fusion framework As mentioned in the previous slides, to solve the problem with the factor relationship among different sensors, we plan to adopt factor language model. Last but not the least, we are looking for opportunities to apply this work to more appealing applications in healthcare and in security.
  36. One big message from this work is that we confirm the similarity between language and behavior again. N-gram model is simple and versatile enough for various applications. We demonstrated that we can combine multi-dimension data into a single dimension and convert to behavior text. We also demonstrated some of our ideas in preprocessing, modeling and testing led to reasonable improvements. Through experiments, we explored the parameters space and gained valuable insights. We also discovered some potential problems with these ideas. Especially with the dimensionality reduction. If the sensors have internal relationship and different factors towards the behavior modeling, reducing them blindly to 1-D may actually lose that information. Also, the skipped n-gram model is dependent on the data and needs further investigation.