Clustering - Gruppieren von Datenpunkten

•Als ODP, PDF herunterladen•

1 gefällt mir•672 views

Wie Gruppiere ich meine Daten? Wie finde ich heraus, welche Personen, Sensorwerte, Koordinaten zusammen gehören? Dieser Vortrag behandelt vier einfache Algorithmen, die darauf Antwort geben. Im Rahmen von Jugend Hackt http://jugendhackt.de/ .

Ingenieurwesen

1
Clustering
Gruppieren von Datenpunkten
Programmiererversion
Nicco Kunzmann nicco kunzmann
@gmail.com
Jugend Hackt 2014

2
Clustering
Gruppieren von Datenpunkten
Programmiererversion
Nicco Kunzmann nicco kunzmann
@gmail.com
Jugend Hackt 2014

3
Clustering
Gruppieren von Datenpunkten
Programmiererversion
Nicco Kunzmann nicco kunzmann
@gmail.com
Jugend Hackt 2014

4
● Datamining
– Unsupervised Learning
● Clustering
● Statistik
● Information Retrieval (Film: „Brazil“)

5
Daten
Name Alter vegetarier Geschwister
Benni 12.4 ja 1
Horst 14.2 nein 0
Irmel 16.0 nein 5
Lichtintensität
1
2
12
3
21
21
2
31
66
21
3
12
1
3
1
3
21
3
21
11
23
4 Features

8
Abstand
5 2
3
2
?
1 0
Was ist sinnvoll?

11
Abstand
Manhattan
A ja ja ja ja X ja ja ja ja ja
B X ja ja ja X ja X ja X ja
C X X X X X X X X X X
Stellt euch an dieser Stelle ein 10-Dimensionales Bild vor.

14
Abstand
Es gibt auch noch
- Pearson correlation für Lineare Abhängigkeit
- Jaccard similarity für Mengen (Buchstaben)

15
Algorithmen
● Single Link
● Complete Link
● K-Means
● Mean Shift
● Connected Components
● Gaussian Mixture Model
● DB-Scan

16
Single Link & Complete Link
➢ Jeder Punkt in einen neuen Cluster
➢ Bis es wenig Cluster gibt, tue:
➢ Finde die beiden Cluster mit min. dist(c1, c2)
➢ Erzeuge einen neuen Cluster aus c1 + c2
Single Link:
dist(c1, c2) = min({dist(x1, x2) | x1 ∈ c1, x2 ∈ c2})
Complete Link:
dist(c1, c2) = max({dist(x1, x2) | x1 ∈ c1, x2 ∈ c2})

20
Complete Link & Single Link
Problem: Ich will 2 Cluster

26
K-Means
➢ Platziere eine Anzahl an Mittelpunkten zufällig
➢ Bis sich nichts ändert, tue:
➢ Erzeuge für jeden Mittelpunkt einen leeren
Cluster
➢ Füge die Punkte in den Cluster vom
nächstliegendsten Mittelpunkt
➢ Bilde die Mittelpunkte aus den Clustern

28
Mean-Shift
Row 1 Row 2 Row 3 Row 4
12
10
8
6
4
2
0
Column 1
Column 2
Column 3

30
Mean-Shift
➢ Verteile zufällig Punkte
➢ Solange sich was ändert, tue:
➢ Für jeden Mittelpunkt p, tue:
➢ p := Durchschnitt aus allen Daten nahe p
Gewichteter Durchschnitt für Normalverteilte Daten

32
Algorithmen
● Single Link
● Complete Link
● K-Means
● Mean Shift
● Connected Components (für Bilder)
● Gaussian Mixture Model (besseres K-Means)
● DB-Scan

33
Featureanpassung
Beispiel: Lichtsensorwerte:
– Weiß: 1-6
– Grau: 7-100
– Schwarz: 101 - 10000
Feature := log(Lichtsensorwert)
Daten anpassen, da Algorithmen doofe
Annahmen treffen.

34
Implementieren
● Implementierung := Algorithmus +
Featureauswahl + Featureanpassung +
Abstandsfunktion + Leere Cluster behandeln

35
Quellen
● Vorlesung Datamining 2013/14 am HPI
– I. H. Witten, E. Frank, M. A. Hall: Data Mining - Practical
Machine Learning Tools and Techniques (Chapters 1 – 6)
– C. Bishop: Pattern Recognition and Machine Learning
(Chapters 1 – 4, 8, 9)
– T. M. Mitchell: Machine Learning (Chapters 3 – 6, 8, 10)
– P. Flach: Machine Learning – The Art and Science of
Algorithms that make Sense of Data (Chapters 1 – 3, 5 – 11)
– D. J. C. MacKay: Information Theory, Inference and Learning
Algorithms (Chapters 1 – 6)

Empfohlen

MTGI Zahlensystemeedu.support

Learning Similarity Metrics for Event Identification in Social MediaHila Becker

Data mining slidessmj

Sensor auslesenniccokunzmann

01 Led Steuernniccokunzmann

00 Einführungniccokunzmann

2024 State of Marketing Report – by HubspotMarius Sescu

Everything You Need To Know About ChatGPTExpeed Software

Empfohlen

MTGI Zahlensystemeedu.support

Learning Similarity Metrics for Event Identification in Social MediaHila Becker

Data mining slidessmj

Sensor auslesenniccokunzmann

01 Led Steuernniccokunzmann

00 Einführungniccokunzmann

2024 State of Marketing Report – by HubspotMarius Sescu

Everything You Need To Know About ChatGPTExpeed Software

Product Design Trends in 2024 | Teenage EngineeringsPixeldarts

How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow

AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork

Skeleton Culture CodeSkeleton Technologies

PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley

Content Methodology: A Best Practices Report (Webinar)contently

How to Prepare For a Successful Job Search for 2024Albert Qian

Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)

Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal

5 Public speaking tips from TED - Visualized summarySpeakerHub

ChatGPT and the Future of Work - Clark Boyd Clark Boyd

Getting into the tech field. what next Tessa Mero

Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray

How to have difficult conversations Rajiv Jayarajah, MAppComm, ACC

Introduction to Data ScienceChristy Abraham Joy

Time Management & Productivity - Best PracticesVit Horky

The six step guide to practical project managementMindGenius

Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36

Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools

12 Ways to Increase Your Influence at WorkGetSmarter

Weitere ähnliche Inhalte

Empfohlen

Product Design Trends in 2024 | Teenage EngineeringsPixeldarts

How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow

AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork

Skeleton Culture CodeSkeleton Technologies

PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley

Content Methodology: A Best Practices Report (Webinar)contently

How to Prepare For a Successful Job Search for 2024Albert Qian

Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)

Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal

5 Public speaking tips from TED - Visualized summarySpeakerHub

ChatGPT and the Future of Work - Clark Boyd Clark Boyd

Getting into the tech field. what next Tessa Mero

Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray

How to have difficult conversations Rajiv Jayarajah, MAppComm, ACC

Introduction to Data ScienceChristy Abraham Joy

Time Management & Productivity - Best PracticesVit Horky

The six step guide to practical project managementMindGenius

Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36

Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools

12 Ways to Increase Your Influence at WorkGetSmarter

Empfohlen (20)

Product Design Trends in 2024 | Teenage Engineerings

How Race, Age and Gender Shape Attitudes Towards Mental Health

AI Trends in Creative Operations 2024 by Artwork Flow.pdf

Skeleton Culture Code

PEPSICO Presentation to CAGNY Conference Feb 2024

Content Methodology: A Best Practices Report (Webinar)

How to Prepare For a Successful Job Search for 2024

Social Media Marketing Trends 2024 // The Global Indie Insights

Trends In Paid Search: Navigating The Digital Landscape In 2024

5 Public speaking tips from TED - Visualized summary

ChatGPT and the Future of Work - Clark Boyd

Getting into the tech field. what next

Google's Just Not That Into You: Understanding Core Updates & Search Intent

How to have difficult conversations

Introduction to Data Science

Time Management & Productivity - Best Practices

The six step guide to practical project management

Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...

Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...

12 Ways to Increase Your Influence at Work

Clustering - Gruppieren von Datenpunkten

1. 1 Clustering Gruppieren von Datenpunkten Programmiererversion Nicco Kunzmann nicco kunzmann @gmail.com Jugend Hackt 2014

2. 2 Clustering Gruppieren von Datenpunkten Programmiererversion Nicco Kunzmann nicco kunzmann @gmail.com Jugend Hackt 2014

3. 3 Clustering Gruppieren von Datenpunkten Programmiererversion Nicco Kunzmann nicco kunzmann @gmail.com Jugend Hackt 2014

4. 4 ● Datamining – Unsupervised Learning ● Clustering ● Statistik ● Information Retrieval (Film: „Brazil“)

5. 5 Daten Name Alter vegetarier Geschwister Benni 12.4 ja 1 Horst 14.2 nein 0 Irmel 16.0 nein 5 Lichtintensität 1 2 12 3 21 21 2 31 66 21 3 12 1 3 1 3 21 3 21 11 23 4 Features

6. 6 Abstand Wer gehört zusammen?

7. 7 Abstand

8. 8 Abstand 5 2 3 2 ? 1 0 Was ist sinnvoll?

9. 9 Abstand Euklidischer Abstand

10. 10 Abstand Manhattan

11. 11 Abstand Manhattan A ja ja ja ja X ja ja ja ja ja B X ja ja ja X ja X ja X ja C X X X X X X X X X X Stellt euch an dieser Stelle ein 10-Dimensionales Bild vor.

12. 12 Abstand Maximum

13. 13 Abstand Cosinus

14. 14 Abstand Es gibt auch noch - Pearson correlation für Lineare Abhängigkeit - Jaccard similarity für Mengen (Buchstaben)

15. 15 Algorithmen ● Single Link ● Complete Link ● K-Means ● Mean Shift ● Connected Components ● Gaussian Mixture Model ● DB-Scan

16. 16 Single Link & Complete Link ➢ Jeder Punkt in einen neuen Cluster ➢ Bis es wenig Cluster gibt, tue: ➢ Finde die beiden Cluster mit min. dist(c1, c2) ➢ Erzeuge einen neuen Cluster aus c1 + c2 Single Link: dist(c1, c2) = min({dist(x1, x2) | x1 ∈ c1, x2 ∈ c2}) Complete Link: dist(c1, c2) = max({dist(x1, x2) | x1 ∈ c1, x2 ∈ c2})

17. 17 Single Link & Complete Link

18. 18 Single Link

19. 19 Complete Link

20. 20 Complete Link & Single Link Problem: Ich will 2 Cluster

21. 21 K-Means

22. 22 K-Means

23. 23 K-Means

24. 24 K-Means

25. 25 K-Means

26. 26 K-Means ➢ Platziere eine Anzahl an Mittelpunkten zufällig ➢ Bis sich nichts ändert, tue: ➢ Erzeuge für jeden Mittelpunkt einen leeren Cluster ➢ Füge die Punkte in den Cluster vom nächstliegendsten Mittelpunkt ➢ Bilde die Mittelpunkte aus den Clustern

27. 27 K-Means ● Probleme

28. 28 Mean-Shift Row 1 Row 2 Row 3 Row 4 12 10 8 6 4 2 0 Column 1 Column 2 Column 3

29. 29 Mean-Shift für Maxima & Minima

30. 30 Mean-Shift ➢ Verteile zufällig Punkte ➢ Solange sich was ändert, tue: ➢ Für jeden Mittelpunkt p, tue: ➢ p := Durchschnitt aus allen Daten nahe p Gewichteter Durchschnitt für Normalverteilte Daten

31. 31 Mean-Shift ● Probleme

32. 32 Algorithmen ● Single Link ● Complete Link ● K-Means ● Mean Shift ● Connected Components (für Bilder) ● Gaussian Mixture Model (besseres K-Means) ● DB-Scan

33. 33 Featureanpassung Beispiel: Lichtsensorwerte: – Weiß: 1-6 – Grau: 7-100 – Schwarz: 101 - 10000 Feature := log(Lichtsensorwert) Daten anpassen, da Algorithmen doofe Annahmen treffen.

34. 34 Implementieren ● Implementierung := Algorithmus + Featureauswahl + Featureanpassung + Abstandsfunktion + Leere Cluster behandeln

35. 35 Quellen ● Vorlesung Datamining 2013/14 am HPI – I. H. Witten, E. Frank, M. A. Hall: Data Mining - Practical Machine Learning Tools and Techniques (Chapters 1 – 6) – C. Bishop: Pattern Recognition and Machine Learning (Chapters 1 – 4, 8, 9) – T. M. Mitchell: Machine Learning (Chapters 3 – 6, 8, 10) – P. Flach: Machine Learning – The Art and Science of Algorithms that make Sense of Data (Chapters 1 – 3, 5 – 11) – D. J. C. MacKay: Information Theory, Inference and Learning Algorithms (Chapters 1 – 6)

Hinweis der Redaktion

Andere Clustersicht
Distanzen ausrechnen!