IP Clustering as Exploratory Data Analysis

•

1 gefällt mir•466 views

Katerena Kuksenok

Class project presentation; MATH331 Optimization, Fall '09

Bildung Technologie

Clustering Using
Integer Programming
Katie Kuksenok
Michael Brooks

What are we doing?
• Investigating IP formulations of the clustering
problem

• Developing a web application for clustering

• Applying clustering to several datasets

• Visualizing resulting clusters

The Clustering Problem
• Input:
▫ N entities
▫ Each entity has a value for each of M features

• Goal:
▫ Partition the entities “optimally”

• Issues with clustering:
▫ Definition of “optimal”
▫ Different feature types don’t work well together

The Clustering Problem
Exposes high-level properties of data

Important for exploratory data analysis in data
mining

Finding the optimal clustering is NP-Complete

Grotschel and Wakabayashi IP
• Recall the Grotschel-Wakabayashi formulation
(clustering wild cats):

• Variables: xij = 1 if i and j are in the same cluster

• Data: wij = disagreements(i,j) – agreements(i,j)

• Minimize: SUM { wij xij : over all i, j }

Formulation: Constraints
• Constraints for reflexivity, symmetry, transitivity
reduce to the following constraints:

Exploratory Clustering Application
• Currently in development

• User provides data to website
• Web server processes data
▫ Produces OPL code
▫ Runs the model in LPSolve (another MIP solver)
• Results are returned to the user, with visuals

Preliminary Exploration
• With working parts of application:
• can cluster and visualize datasets

• verified Grotschel/Wakabayashi clustering results
• used clustering to explore senate voting records

Data Sets (Cats)
• Cluster 30 varieties of wild cats (from paper)
• Cats have 14 features
• Reproduced results from paper

• Model has 12180 constraints and 435 variables
• Runs in less than 1 second
• No branching

Data Sets (Senate Voting Records)
• 100 senators
• Senate records for 8 passed bills from this year

• Model has 485100 constraints and 4950 variables
• Runs in 1 minute
• No branching

Data Sets (Senate Voting By State)
• Grouped senate voting by state
• If senators for a state disagreed, ‘Maybe’

• Model has 58800 constraints and 1225 variables
• Runs in 1.5 seconds
• No branching

What now?
• The Application
▫ We’d like to put it all together
• Branch and Bound
▫ No real-life datasets so far have required branching
▫ We have constructed datasets that did
▫ Can we determine under what conditions branching
will be required?

This is a guepard (cheetah in French?):

…any other questions?

Approaches to Clustering

• What makes a good clustering
algorithm?
▫ Runtime and space efficiency (when N is
large)
Many
algorithms
▫ Ability to quickly incorporate new data
▫ Ability to deal with many-dimensional
IP data (when M is large)
▫ Optimality (closeness to)

Weitere ähnliche Inhalte

Andere mochten auch

Helping Computers Find Meaning they Lost in TranslationKaterena Kuksenok

Games on Social Networks: Balancing FriendshipsKaterena Kuksenok

Storyboard(printscreen)Sam Arblaster

Uncertainty in Chronic Illness and Patients' Online ExperienceKaterena Kuksenok

Michel Foucaultkhalfyard

Brand management Full notesversatileBschool

Mishel's Uncertainty in Illness TheorySujata Mohapatra

Patricia Benner (Novice to Expert Theory)Mary Ann Adiong

Andere mochten auch (8)

Helping Computers Find Meaning they Lost in Translation

Games on Social Networks: Balancing Friendships

Storyboard(printscreen)

Uncertainty in Chronic Illness and Patients' Online Experience

Michel Foucault

Brand management Full notes

Mishel's Uncertainty in Illness Theory

Patricia Benner (Novice to Expert Theory)

Ähnlich wie IP Clustering as Exploratory Data Analysis

Entity embeddings for categorical dataPaul Skeie

BitBootCamp Evening ClassesMenish Gupta

Machine Learning With ML.NETDev Raj Gautam

InfoEducatie - Face Recognition ArchitectureBogdan Bocse

Scaling Face Recognition with Big Data - Key Notes at DevTalks Bucharest 2017VisageCloud

Lessons from lhcdrsm79

Master guide to become a data scientist zekeLabs Technologies

Master guide to become a Data Scientist -by zekeLabszekeLabs Technologies

[RSS2023] Local Object Crop Collision Network for Efficient SimulationDongwonSon1

Research network infrastructure engineersJisc

Computer Vision for BeginnersSanghamitra Deb

Machine Learning Essentials Demystified part1 | Big Data DemystifiedOmid Vahdaty

Zestimate Lambda ArchitectureSteven Hoelscher

SMART International Symposium for Next Generation Infrastructure: The roles o...SMART Infrastructure Facility

How we integrate Machine Learning Algorithms into our IT Platform at OutfitteryOUTFITTERY

Webinar: SQL for Machine Data?Crate.io

H2O World - Intro to Data Science with Erin LedellSri Ambati

A data analyst view of Bigdata Venkata Reddy Konasani

Counting Unique Users in Real-Time: Here's a Challenge for You!DataWorks Summit

Big DataMahesh Bmn

Ähnlich wie IP Clustering as Exploratory Data Analysis (20)

Entity embeddings for categorical data

BitBootCamp Evening Classes

Machine Learning With ML.NET

InfoEducatie - Face Recognition Architecture

Scaling Face Recognition with Big Data - Key Notes at DevTalks Bucharest 2017

Lessons from lhc

Master guide to become a data scientist

Master guide to become a Data Scientist -by zekeLabs

[RSS2023] Local Object Crop Collision Network for Efficient Simulation

Research network infrastructure engineers

Computer Vision for Beginners

Machine Learning Essentials Demystified part1 | Big Data Demystified

Zestimate Lambda Architecture

SMART International Symposium for Next Generation Infrastructure: The roles o...

How we integrate Machine Learning Algorithms into our IT Platform at Outfittery

Webinar: SQL for Machine Data?

H2O World - Intro to Data Science with Erin Ledell

A data analyst view of Bigdata

Counting Unique Users in Real-Time: Here's a Challenge for You!

Big Data

Kürzlich hochgeladen

psychiatric nursing HISTORY COLLECTION .docxPoojaSen20

2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptxMaritesTamaniVerdade

TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...Nguyen Thanh Tu Collection

Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhikauryashika82

Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Shubhangi Sonawane

Measures of Dispersion and Variability: Range, QD, AD and SDThiyagu K

Sociology 101 Demonstration of Learning Exhibitjbellavia9

The basics of sentences session 3pptx.pptxheathfieldcps1

Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural ResourcesShubhangi Sonawane

Python Notes for mca i year students osmania university.docxRamakrishna Reddy Bijjam

Advanced Views - Calendar View in Odoo 17Celine George

Class 11th Physics NEET formula sheet pdfAyushMahapatra5

Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande

Unit-IV- Pharma. Marketing Channels.pptxVishalSingh1417

Seal of Good Local Governance (SGLG) 2024Final.pptxnegromaestrong

microwave assisted reaction. General introductionMaksud Ahmed

INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxRAM LAL ANAND COLLEGE, DELHI UNIVERSITY.

This PowerPoint helps students to consider the concept of infinity.christianmathematics

Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB

Mixin Classes in Odoo 17 How to Extend Models Using Mixin ClassesCeline George

Kürzlich hochgeladen (20)

psychiatric nursing HISTORY COLLECTION .docx

2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx

TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...

Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi

Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...

Measures of Dispersion and Variability: Range, QD, AD and SD

Sociology 101 Demonstration of Learning Exhibit

The basics of sentences session 3pptx.pptx

Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources

Python Notes for mca i year students osmania university.docx

Advanced Views - Calendar View in Odoo 17

Class 11th Physics NEET formula sheet pdf

Web & Social Media Analytics Previous Year Question Paper.pdf

Unit-IV- Pharma. Marketing Channels.pptx

Seal of Good Local Governance (SGLG) 2024Final.pptx

microwave assisted reaction. General introduction

INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx

This PowerPoint helps students to consider the concept of infinity.

Beyond the EU: DORA and NIS 2 Directive's Global Impact

Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes

IP Clustering as Exploratory Data Analysis

1. Clustering Using Integer Programming Katie Kuksenok Michael Brooks

2. What are we doing? • Investigating IP formulations of the clustering problem • Developing a web application for clustering • Applying clustering to several datasets • Visualizing resulting clusters

3. The Clustering Problem • Input: ▫ N entities ▫ Each entity has a value for each of M features • Goal: ▫ Partition the entities “optimally” • Issues with clustering: ▫ Definition of “optimal” ▫ Different feature types don’t work well together

4. The Clustering Problem Exposes high-level properties of data Important for exploratory data analysis in data mining Finding the optimal clustering is NP-Complete

5. Grotschel and Wakabayashi IP • Recall the Grotschel-Wakabayashi formulation (clustering wild cats): • Variables: xij = 1 if i and j are in the same cluster • Data: wij = disagreements(i,j) – agreements(i,j) • Minimize: SUM { wij xij : over all i, j }

6. Formulation: Constraints • Constraints for reflexivity, symmetry, transitivity reduce to the following constraints:

7. Exploratory Clustering Application • Currently in development • User provides data to website • Web server processes data ▫ Produces OPL code ▫ Runs the model in LPSolve (another MIP solver) • Results are returned to the user, with visuals

8. Preliminary Exploration • With working parts of application: • can cluster and visualize datasets • verified Grotschel/Wakabayashi clustering results • used clustering to explore senate voting records

9. Data Sets (Cats) • Cluster 30 varieties of wild cats (from paper) • Cats have 14 features • Reproduced results from paper • Model has 12180 constraints and 435 variables • Runs in less than 1 second • No branching

10.

11. Data Sets (Senate Voting Records) • 100 senators • Senate records for 8 passed bills from this year • Model has 485100 constraints and 4950 variables • Runs in 1 minute • No branching

12.

13. Data Sets (Senate Voting By State) • Grouped senate voting by state • If senators for a state disagreed, ‘Maybe’ • Model has 58800 constraints and 1225 variables • Runs in 1.5 seconds • No branching

14.

15. What now? • The Application ▫ We’d like to put it all together • Branch and Bound ▫ No real-life datasets so far have required branching ▫ We have constructed datasets that did ▫ Can we determine under what conditions branching will be required?

16. This is a guepard (cheetah in French?): …any other questions?

17.

18. Approaches to Clustering • What makes a good clustering algorithm? ▫ Runtime and space efficiency (when N is large) Many algorithms ▫ Ability to quickly incorporate new data ▫ Ability to deal with many-dimensional IP data (when M is large) ▫ Optimality (closeness to)

IP Clustering as Exploratory Data Analysis

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Andere mochten auch

Andere mochten auch (8)

Ähnlich wie IP Clustering as Exploratory Data Analysis

Ähnlich wie IP Clustering as Exploratory Data Analysis (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

IP Clustering as Exploratory Data Analysis