SlideShare ist ein Scribd-Unternehmen logo
1 von 27
Scaling Document Clustering in the Cloud Robert Gillen Computer Science Research Cloud Futures 2011
Overview Introduction to Piranha Existing Limitations Current Solution Tracks Early Results & Future Work
Challenge – What to do with mounds of data? What is in there? Are there any threats? What am I missing? How do I connect the “dots”? How do I find the relevant information I need?
Trees Forest Can’t See the for the Traditionally, search methods are used to find information at high volume levels But, those methods won’t get you here easily
Piranha Ability to search AND analyze Organize documents based on content Identify similar & dissimilar documents Identify duplicate and near-duplicate data Incorporate new data as it becomes available 2007 R & D 100 Award winning Awards are based on each achievement's technical significance, uniqueness, and usefulness compared to competing projects and technologies.
Keyword Methods Document 1 Vector Space Model The Army needs sensor technology to help find improvised explosive devices Term List Weight Terms Army Sensor Technology Help Find Improvise Explosive  Device ORNL  develop  homeland  Defense Mitre  won  contract  Document 2 ORNL has developed sensor technology for homeland defense Document 3 Term Frequency – Inverse Document Frequency Mitre has won a contract to develop homeland defense sensors for explosive devices An index into the document list
Textual Clustering Vector Space Model Cluster Analysis Similarity Matrix D1 D2 D3 Documents to Documents Most similar documents Euclidean distance Time Complexity TFIDF O(n2Log n)
Example: Sign of the Crescent1 41 Short intelligence reports about a multi-prong terrorist attack Example: Report Date: 1 April, 2003. FBI: Abdul Ramazi is the owner of the Select Gourmet Foods shop in Springfield Mall, Springfield, VA. [Phone number 703-659-2317]. First Union National Bank lists Select Gourmet Foods as holding account number 1070173749003. Six checks totaling $35,000 have been deposited in this account in the past four months and are recorded as having been drawn on accounts at the Pyramid Bank of Cairo,  Egypt and the Central Bank of Dubai, United Arab Emirates. Both of these banks have just been listed as possible conduits in money laundering schemes 1Intelligence Analysis Case Study by F. J. Hughes, Joint Military Intelligence College
Piranha Cluster View Report Date: 1 April, 2003. FBI: Abdul Ramazi is the owner of the Select Gourmet Foods shop in Springfield Mall, Springfield, VA. [Phone number 703-659-2317]. First Union National Bank lists Select Gourmet Foods as holding account number 1070173749003. Six checks totaling $35,000 have been deposited in this account in the past four months and are recorded as having been drawn on accounts at the Pyramid Bank of Cairo,  Egypt and the Central Bank of Dubai, United Arab Emirates. Both of these banks have just been listed as possible conduits in money laundering schemes
Existing Issues Memory bound Prior distribution approaches were troublesome Extant need to process larger document sets
Current Solution Tracks Traditional HPC (Jaguar) ORNL has unique capabilities in this space Cloud New approaches may broaden the reach of the tool Less-specialized hardware requirements More-accessible programing/extensibility model Ability to utilize core features of cloud platforms to provide key functionality
Design Tenants Utilize cloud primitives wherever possible. Building “Environmentally Aware” algorithms… i.e. such that they are aware of the environment in which they are running. Dynamically fit the platform to the problem Design for use in disparate environments.
Cloud Scaling Approach R C1 C2 C4 C3 Patent Pending
Cloud Scaling Approach R C1 C2 C4 C3a QC1C2 C3b Patent Pending
Pending Issues How frequently to check for memory pressure Work Unit Size (how many documents at a time) Moving from a single machine to distributed model introduces I/O delay (by definition) ~60K docs  increase of 2:30 – bad case, 50min/million docs
Vector Creation/Serialization (local)
Vector Creation/Serialization (cloud)
Patent Pending Real-Time Environment Monitoring
Real-Time Environment Monitoring
Fault Tolerance C1 C1C2 C1C3 C3 C2 C3C4 C4 Patent Pending
Fault Tolerance C1 C1C2 C1C3 C3 C2 C3C4 C4 Patent Pending
Fault Tolerance C1 C1C2 C1C3 C2 C3C4 C4 Patent Pending
Fault Tolerance C1 C1C2 C1C3 C5 C2 C3C4 C4 Patent Pending
Fault Tolerance Queues provide isolation for fault tolerance Two-phase queues are key to success Regular serialization of node state is key Yet how often remains in question Not possible without programmable infrastructure provided by the cloud Patent Pending
Running in Different Environments Same core algorithm (C++ code) runs in Azure, Amazon, and on Jaguar (recompiled) “Scaffolding” code is cloud/jaguar specific Patterns used (Repository, etc) to abstract differences between various vendor storage repositories “Scaling” easier in Azure Raw control/access easier in Amazon
Early Results & Future Work File Packing? Scale vs. Stability vs. Speed Tuning the Work Unit Size Patent Pending
Questions? Rob Gillen gillenre@ornl.gov @argodev

Weitere ähnliche Inhalte

Ähnlich wie Scaling Document Clustering in the Cloud

Digital Forensics best practices with the use of open source tools and admiss...
Digital Forensics best practices with the use of open source tools and admiss...Digital Forensics best practices with the use of open source tools and admiss...
Digital Forensics best practices with the use of open source tools and admiss...
Sagar Rahurkar
 
SplunkLive! Splunk for Insider Threats and Fraud Detection
SplunkLive! Splunk for Insider Threats and Fraud DetectionSplunkLive! Splunk for Insider Threats and Fraud Detection
SplunkLive! Splunk for Insider Threats and Fraud Detection
Splunk
 
Big data and the data quality imperative
Big data and the data quality imperativeBig data and the data quality imperative
Big data and the data quality imperative
Trillium Software
 
Project 3 – Hollywood and IT· Find 10 incidents of Hollywood p.docx
Project 3 – Hollywood and IT· Find 10 incidents of Hollywood p.docxProject 3 – Hollywood and IT· Find 10 incidents of Hollywood p.docx
Project 3 – Hollywood and IT· Find 10 incidents of Hollywood p.docx
stilliegeorgiana
 
Best Practices: Complex Discovery in Corporations and Law Firms | Ryan Baker ...
Best Practices: Complex Discovery in Corporations and Law Firms | Ryan Baker ...Best Practices: Complex Discovery in Corporations and Law Firms | Ryan Baker ...
Best Practices: Complex Discovery in Corporations and Law Firms | Ryan Baker ...
Rob Robinson
 

Ähnlich wie Scaling Document Clustering in the Cloud (20)

Digital forensics research: The next 10 years
Digital forensics research: The next 10 yearsDigital forensics research: The next 10 years
Digital forensics research: The next 10 years
 
Digital Forensics best practices with the use of open source tools and admiss...
Digital Forensics best practices with the use of open source tools and admiss...Digital Forensics best practices with the use of open source tools and admiss...
Digital Forensics best practices with the use of open source tools and admiss...
 
Big data in Private Banking
Big data in Private BankingBig data in Private Banking
Big data in Private Banking
 
SplunkLive! Splunk for Insider Threats and Fraud Detection
SplunkLive! Splunk for Insider Threats and Fraud DetectionSplunkLive! Splunk for Insider Threats and Fraud Detection
SplunkLive! Splunk for Insider Threats and Fraud Detection
 
Data mining with big data implementation
Data mining with big data implementationData mining with big data implementation
Data mining with big data implementation
 
李育杰/The Growth of a Data Scientist
李育杰/The Growth of a Data Scientist李育杰/The Growth of a Data Scientist
李育杰/The Growth of a Data Scientist
 
IRJET- A Study on Data Mining in Software
IRJET- A Study on Data Mining in SoftwareIRJET- A Study on Data Mining in Software
IRJET- A Study on Data Mining in Software
 
Using Data Riches A tale of two projects - Ajay Vinze
Using Data Riches A tale of two projects - Ajay VinzeUsing Data Riches A tale of two projects - Ajay Vinze
Using Data Riches A tale of two projects - Ajay Vinze
 
1. The Importance of Graphs in Government
1. The Importance of Graphs in Government1. The Importance of Graphs in Government
1. The Importance of Graphs in Government
 
A Novel Methodology for Offline Forensics Triage in Windows Systems
A Novel Methodology for Offline Forensics Triage in Windows SystemsA Novel Methodology for Offline Forensics Triage in Windows Systems
A Novel Methodology for Offline Forensics Triage in Windows Systems
 
DevOps for Highly Regulated Environments
DevOps for Highly Regulated EnvironmentsDevOps for Highly Regulated Environments
DevOps for Highly Regulated Environments
 
2008 Trends
2008 Trends2008 Trends
2008 Trends
 
Big data and the data quality imperative
Big data and the data quality imperativeBig data and the data quality imperative
Big data and the data quality imperative
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
 
Good Guys vs Bad Guys: Using Big Data to Counteract Advanced Threats
Good Guys vs Bad Guys: Using Big Data to Counteract Advanced ThreatsGood Guys vs Bad Guys: Using Big Data to Counteract Advanced Threats
Good Guys vs Bad Guys: Using Big Data to Counteract Advanced Threats
 
Meletis Belsis -CSIRTs
Meletis Belsis -CSIRTsMeletis Belsis -CSIRTs
Meletis Belsis -CSIRTs
 
David valovcin big data - big risk
David valovcin big data - big riskDavid valovcin big data - big risk
David valovcin big data - big risk
 
Project 3 – Hollywood and IT· Find 10 incidents of Hollywood p.docx
Project 3 – Hollywood and IT· Find 10 incidents of Hollywood p.docxProject 3 – Hollywood and IT· Find 10 incidents of Hollywood p.docx
Project 3 – Hollywood and IT· Find 10 incidents of Hollywood p.docx
 
Best Practices: Complex Discovery in Corporations and Law Firms | Ryan Baker ...
Best Practices: Complex Discovery in Corporations and Law Firms | Ryan Baker ...Best Practices: Complex Discovery in Corporations and Law Firms | Ryan Baker ...
Best Practices: Complex Discovery in Corporations and Law Firms | Ryan Baker ...
 
El contexto de la integración masiva de datos
El contexto de la integración masiva de datosEl contexto de la integración masiva de datos
El contexto de la integración masiva de datos
 

Mehr von Rob Gillen

A Comparison of AWS and Azure - Part2
A Comparison of AWS and Azure - Part2A Comparison of AWS and Azure - Part2
A Comparison of AWS and Azure - Part2
Rob Gillen
 
A Comparison of AWS and Azure - Part 1
A Comparison of AWS and Azure - Part 1A Comparison of AWS and Azure - Part 1
A Comparison of AWS and Azure - Part 1
Rob Gillen
 
Intro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaIntro to GPGPU Programming with Cuda
Intro to GPGPU Programming with Cuda
Rob Gillen
 

Mehr von Rob Gillen (20)

CodeStock14: Hiding in Plain Sight
CodeStock14: Hiding in Plain SightCodeStock14: Hiding in Plain Sight
CodeStock14: Hiding in Plain Sight
 
What's in a password
What's in a password What's in a password
What's in a password
 
How well do you know your runtime
How well do you know your runtimeHow well do you know your runtime
How well do you know your runtime
 
Software defined radio and the hacker
Software defined radio and the hackerSoftware defined radio and the hacker
Software defined radio and the hacker
 
So whats in a password
So whats in a passwordSo whats in a password
So whats in a password
 
Hiding in plain sight
Hiding in plain sightHiding in plain sight
Hiding in plain sight
 
DevLink - WiFu: You think your wireless is secure?
DevLink - WiFu: You think your wireless is secure?DevLink - WiFu: You think your wireless is secure?
DevLink - WiFu: You think your wireless is secure?
 
You think your WiFi is safe?
You think your WiFi is safe?You think your WiFi is safe?
You think your WiFi is safe?
 
Anatomy of a Buffer Overflow Attack
Anatomy of a Buffer Overflow AttackAnatomy of a Buffer Overflow Attack
Anatomy of a Buffer Overflow Attack
 
Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)
 
AWS vs. Azure
AWS vs. AzureAWS vs. Azure
AWS vs. Azure
 
A Comparison of AWS and Azure - Part2
A Comparison of AWS and Azure - Part2A Comparison of AWS and Azure - Part2
A Comparison of AWS and Azure - Part2
 
A Comparison of AWS and Azure - Part 1
A Comparison of AWS and Azure - Part 1A Comparison of AWS and Azure - Part 1
A Comparison of AWS and Azure - Part 1
 
Intro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaIntro to GPGPU Programming with Cuda
Intro to GPGPU Programming with Cuda
 
Windows Azure: Lessons From The Field
Windows Azure: Lessons From The FieldWindows Azure: Lessons From The Field
Windows Azure: Lessons From The Field
 
Amazon Web Services for the .NET Developer
Amazon Web Services for the .NET DeveloperAmazon Web Services for the .NET Developer
Amazon Web Services for the .NET Developer
 
05561 Xfer Research 02
05561 Xfer Research 0205561 Xfer Research 02
05561 Xfer Research 02
 
05561 Xfer Research 01
05561 Xfer Research 0105561 Xfer Research 01
05561 Xfer Research 01
 
05561 Xfer Consumer 01
05561 Xfer Consumer 0105561 Xfer Consumer 01
05561 Xfer Consumer 01
 
Cloud Storage Upload Tests 02
Cloud Storage Upload Tests 02Cloud Storage Upload Tests 02
Cloud Storage Upload Tests 02
 

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 

Scaling Document Clustering in the Cloud

  • 1. Scaling Document Clustering in the Cloud Robert Gillen Computer Science Research Cloud Futures 2011
  • 2. Overview Introduction to Piranha Existing Limitations Current Solution Tracks Early Results & Future Work
  • 3. Challenge – What to do with mounds of data? What is in there? Are there any threats? What am I missing? How do I connect the “dots”? How do I find the relevant information I need?
  • 4. Trees Forest Can’t See the for the Traditionally, search methods are used to find information at high volume levels But, those methods won’t get you here easily
  • 5. Piranha Ability to search AND analyze Organize documents based on content Identify similar & dissimilar documents Identify duplicate and near-duplicate data Incorporate new data as it becomes available 2007 R & D 100 Award winning Awards are based on each achievement's technical significance, uniqueness, and usefulness compared to competing projects and technologies.
  • 6. Keyword Methods Document 1 Vector Space Model The Army needs sensor technology to help find improvised explosive devices Term List Weight Terms Army Sensor Technology Help Find Improvise Explosive Device ORNL develop homeland Defense Mitre won contract Document 2 ORNL has developed sensor technology for homeland defense Document 3 Term Frequency – Inverse Document Frequency Mitre has won a contract to develop homeland defense sensors for explosive devices An index into the document list
  • 7. Textual Clustering Vector Space Model Cluster Analysis Similarity Matrix D1 D2 D3 Documents to Documents Most similar documents Euclidean distance Time Complexity TFIDF O(n2Log n)
  • 8. Example: Sign of the Crescent1 41 Short intelligence reports about a multi-prong terrorist attack Example: Report Date: 1 April, 2003. FBI: Abdul Ramazi is the owner of the Select Gourmet Foods shop in Springfield Mall, Springfield, VA. [Phone number 703-659-2317]. First Union National Bank lists Select Gourmet Foods as holding account number 1070173749003. Six checks totaling $35,000 have been deposited in this account in the past four months and are recorded as having been drawn on accounts at the Pyramid Bank of Cairo, Egypt and the Central Bank of Dubai, United Arab Emirates. Both of these banks have just been listed as possible conduits in money laundering schemes 1Intelligence Analysis Case Study by F. J. Hughes, Joint Military Intelligence College
  • 9. Piranha Cluster View Report Date: 1 April, 2003. FBI: Abdul Ramazi is the owner of the Select Gourmet Foods shop in Springfield Mall, Springfield, VA. [Phone number 703-659-2317]. First Union National Bank lists Select Gourmet Foods as holding account number 1070173749003. Six checks totaling $35,000 have been deposited in this account in the past four months and are recorded as having been drawn on accounts at the Pyramid Bank of Cairo, Egypt and the Central Bank of Dubai, United Arab Emirates. Both of these banks have just been listed as possible conduits in money laundering schemes
  • 10. Existing Issues Memory bound Prior distribution approaches were troublesome Extant need to process larger document sets
  • 11. Current Solution Tracks Traditional HPC (Jaguar) ORNL has unique capabilities in this space Cloud New approaches may broaden the reach of the tool Less-specialized hardware requirements More-accessible programing/extensibility model Ability to utilize core features of cloud platforms to provide key functionality
  • 12. Design Tenants Utilize cloud primitives wherever possible. Building “Environmentally Aware” algorithms… i.e. such that they are aware of the environment in which they are running. Dynamically fit the platform to the problem Design for use in disparate environments.
  • 13. Cloud Scaling Approach R C1 C2 C4 C3 Patent Pending
  • 14. Cloud Scaling Approach R C1 C2 C4 C3a QC1C2 C3b Patent Pending
  • 15. Pending Issues How frequently to check for memory pressure Work Unit Size (how many documents at a time) Moving from a single machine to distributed model introduces I/O delay (by definition) ~60K docs  increase of 2:30 – bad case, 50min/million docs
  • 18. Patent Pending Real-Time Environment Monitoring
  • 20. Fault Tolerance C1 C1C2 C1C3 C3 C2 C3C4 C4 Patent Pending
  • 21. Fault Tolerance C1 C1C2 C1C3 C3 C2 C3C4 C4 Patent Pending
  • 22. Fault Tolerance C1 C1C2 C1C3 C2 C3C4 C4 Patent Pending
  • 23. Fault Tolerance C1 C1C2 C1C3 C5 C2 C3C4 C4 Patent Pending
  • 24. Fault Tolerance Queues provide isolation for fault tolerance Two-phase queues are key to success Regular serialization of node state is key Yet how often remains in question Not possible without programmable infrastructure provided by the cloud Patent Pending
  • 25. Running in Different Environments Same core algorithm (C++ code) runs in Azure, Amazon, and on Jaguar (recompiled) “Scaffolding” code is cloud/jaguar specific Patterns used (Repository, etc) to abstract differences between various vendor storage repositories “Scaling” easier in Azure Raw control/access easier in Amazon
  • 26. Early Results & Future Work File Packing? Scale vs. Stability vs. Speed Tuning the Work Unit Size Patent Pending
  • 27. Questions? Rob Gillen gillenre@ornl.gov @argodev