머신러닝의 개념과 실습

머신러닝의 개념과 실습
광운대학교
2016. 12. 06 (화)
서울대학교 바이오지능연구실
김 병 희 연구원
bhkim@bi.snu.ac.kr

Contents
 인공지능과 머신러닝 개요
 강의 구성과 의도
 Part 1: 머신러닝 기초 개념 다지기
 Part 2: 머신러닝 실습: Weka로 분류 문제 풀기
 Part 3: 신경망 체험하기: Neural Network Playground
 부록
ⓒ 2016 SNU Biointelligence Laboratory, http://bi.snu.ac.kr/ 2

인공지능과 머신러닝 개요

Intro
Slide by Jiqiong Qiu at DevFest 2016
http://www.slideshare.net/SfeirGroup/first-step-deep-learning-by-jiqiong-qiu-devfest-2016
인공지능
머신러닝
(기계학습)
딥러닝

Image source: https://blogs.nvidia.com/blog/2016/07/29/whats-difference-artificial-intelligence-machine-learning-deep-learning-ai/

인공지능(AI)
 “사람처럼 생각하고 사람처럼 행동하는 기계”(컴퓨터, SW)를 개발
 사람이 기계보다 잘 하는 일을 기계가 할 수 있도록 하는 연구
 지능을 필요로 하는 일을 기계가 할 수 있도록 하는 연구
 1950: Turing’s Paper, 1956: “Artificial Intelligence (AI)”
6(KIPS 2016 Conference Keynote by Byoung-Tak Zhang)

인공지능 구현의 전략: 학습
 AI 연구의 목표는 ‘똑똑한‘ 컴퓨터/기계를 만드는 것
 컴퓨터의 동작이 지적이고(intelligent), 상황에 따라 적절히
변하며(adaptive), 안정적(robust)이길 원함
 사람이 일일이 프로그램을 작성하는 것은 불가능
 해법? 컴퓨터에게 우리가 원하는 동작의 예를 보여주고,
스스로 프로그램을 작성하도록 하자!
 이것이 AI 구현 전략 중 ‘학습(learning)‘ 기법
cat car
(Sam Roweis, MLSS’05 Lecture Note)

어떻게 똑똑한 컴퓨터 시스템을 만들 것인가?
세상(world)을 인식하고 이해할 필요가 있다
기본적인 음성 및 시각 능력
언어 이해
행동 예측
컴퓨터 비전
Computer Vision
자연언어 처리/이해
Natural Language
Processing / Understanding
음성 인식
Speech Recognition
+ 지능적인 행동을 해야 한다 지능형 에이전트
Intelligent Agent
지능형 에이전트
Intelligent Agent

강의 구성과 의도

어떻게 똑똑한 컴퓨터 시스템을 만들 것인가?
머신러닝 입장에서 문제의 난이도
세상(world)을 인식하고 이해할 필요가 있다
+ 지능적인 행동을 해야 한다
(패턴) 인식
(recognition)
(패턴) 이해
(recognition)
(패턴) 생성
(generation)
≪

기존 머신러닝 대비 딥러닝의 기여 요인
(패턴) 인식
(recognition)
(패턴) 이해
(recognition)
(패턴) 생성
(generation)
≪
패턴 인식 성능을 비약적으로 향상
패턴 이해와 생성을 실용 가능한 수준으로 견인
다양한 문제를 해결하는 공통의 도구를 제공

데이터 분석(data analysis)의 과정
 문제 정의(define the question)
 데이터 준비(dataset)
▪ 이상적인 데이터셋 정의(define the ideal data set)
▪ 수집할 데이터 결정(determine what data you can access)
▪ 데이터 수집(obtain the data)
▪ 데이터 정리(clean the data)
 탐색적 데이터 분석(exploratory data analysis)
▪ 군집화/데이터 가시화(Clustering / Data visualization)
 (통계적) 예측/모델링((statistical) prediction/modeling)
▪ 분류/예측(Classification / Prediction)
 결과 해석(interpret results)
▪ 다양한 평가 척도 및 방법(evaluation), 학습 결과 모델 선정(model selection)
 모든 과정 및 결과에 대한 이의 제기 및 점검(challenge results)
 결과 정리 및 보고서 작성(synthesize/write up results)
 결과 재현 가능한 프로그램 작성(create reproducible code)
(c) 2008-2015, B.-H. Kim 12
J. Leek, Data Analysis – Structure of a Data Analysis, Lecture at Coursera, 2013
인식 이해 생성 추출 발견
Data mining
Data Science
Analytics
…
머신러닝과
딥러닝이
핵심 도구로
활용되는 단계

머신러닝 & 딥러닝을 이용한 패턴 인식
Problem &
Dataset
Tool
Process &
Algorithm
(Programming)

머신러닝 & 딥러닝을 이용한 패턴 인식 가이드
1. 학습 과정
1. 머신 러닝의 기본 개념을 배우고 익힌 상태에서
2. 머신 러닝 모델 중 신경망 모델에 대한 이해를 높이고
3. 이를 기반으로 딥러닝 활용을 위한 새로운 개념을 추가로 학습
2. 다루기 쉬운 도구로 체험하기 (오늘의 강의)
1. Weka로 머신러닝 체험 (NO 프로그래밍!)
2. NN Playground로 신경망 체험 (NO 프로그래밍!)
3. 관련 강좌/튜토리얼 수강하기
1. ‘모두를 위한 머신러닝/딥러닝 강의’: http://hunkim.github.io/ml/
2. 한국인지과학산업협회(http://nacsi.kr/) 인지기술 튜토리얼

이제 시작합니다.

머신러닝과 기초 개념 다지기
Part 1

Machine Learning

머신러닝 (Machine Learning, ML)
Q. If the season is dry and the pavement is slippery, did it rain?
A. Unlikely, it is more likely that the sprinkler was ON
기계가 자동으로
대규모 데이터에서
중요한 패턴과 규칙을 학습하고
의사결정, 예측 등을 수행하는 기술
음성 인식
영상 인식
추론, 질의응답
사진 태그 생성
사진 설명 성생
얼굴 기반 감정 인식

머신러닝의 세 가지 기본 학습 모드
 Supervised Learning (감독학습, 지도학습, 교사학습)
▪ 레이블(미리 정해놓은 정답) 달린 예제로 학습하기
▪ 예제가 매우 많은 경우 효과적인 학습이 가능하다
▪ 예) 분류(classification): 레이블이 이산적인(discrete) 경우.
▪ 예) 회귀(regression): 레이블이 연속적 값을 가지는 경우
 Unsupervised Learning (무감독학습, 비지도학습, 비교사학습)
▪ 데이터에 내재된 패턴, 특성, 구조를 학습을 통해 발견. 레이블은 고려하지
않음
▪ 학습 데이터는 개체에 대한 입력 속성만으로 구성됨 D={(x)}
▪ 예: 차원 축소(dimension reduction), 군집화(clustering)
 Reinforcement Learning (강화학습)
▪ 시스템의 동작이 적절성(right/wrong)에 대한 피드백이 있는 학습
▪ 소프트웨어 에이전트가, 환경(environment) 내에서 보상(rewards)이 최대화
되는 일련의 행동(action)을 수행하도록 학습하는 기법
▪ 환경의 상태, 에이전트의 행동, 상태 전이 규칙 및 보상, 관측 범위를 고려한
학습
▪ Action selection, planning, policy learning

Img source: http://nkonst.com/machine-learning-explained-simple-words/

연결주의(connectionism)
 인공지능, 인지과학 등에서 사용하는 신경 정보 처리(neural
information processing)에 대한 대표적 모델 중 하나
▪ 단순한 요소로 구성된 망(network)에서 지적 능력의 발현과 설명을
시도
 인공신경망(artificial neural networks)
▪ 신경세포(뉴런)의 망으로 구성된 뇌를 모사하여 계산 모델을 구성
입력 출력
정보의 흐름(feedforward)

딥러닝(Deep Learning)
(출처: Y. Bengio, Deep Learning Summer School 2015)
흰 부분: 사람이 개입, 회색 부분: 컴퓨터가 자동으로 수행
GoogLeNet
DeepFace
Convolutional NN
Stacked RBM
Neural
Machine Translator
딥 러닝 기술은 사진 속 물체 인식/판별, 음성 인식, 자연언어 처리와 이해,
자동 번역 등 인공지능의 난제에서 획기적인 성능 향상을 보이고 있다.
• (좁은 의미)인공신경망의 다층 구조가 심화된 알고리즘
• 여러 단계의 정보 표현과 추상화를 학습하는
최신 머신러닝 알고리즘의 총칭
Deep Q Network

머신러닝 맛보기:
WEKA로 분류 문제 풀기
Part 2

[ Problem &
Dataset ]
붓꽃 종류 판별
[ Tool ]
Weka
[ Process &
Algorithm ]
결정트리
인공신경망
(No
Programming)

Outline
❖Part 2-1: Weka 소개
❖Part 2-2: Weka로 분류 문제 풀기
❖Weka 프로그램을 이용한 데모/실습 방식으로
진행하되, 개별 복습과 실습이 가능하도록
슬라이드를 준비하였습니다.
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr
25

Outline
❖Weka Explorer를 이용한 전처리 및 기초 분석
 Filter, Visualize
 Dataset: diabetes
❖Weka Explorer를 이용한 Classification
 Dataset: Iris, diabetes
 Classifier: 결정트리(ID3, J48, SimpleCart), Random Forest
❖Weka Experimenter를 이용한 일괄 분석
 Dataset: diabetes
 T-test를 이용한 모델 성능 비교
26

실습 1 구성
❖ 분류 문제 바로 풀어보기
 문제: 붓꽃 종류 판별, 당뇨병 진단
 목표: 예측적
 핵심 과정: 통계적 예측/모델링, 결과 해석
 도구: Weka의 Explorer, Experimenter
❖ 분류 문제를 더 잘 이해하고 풀기 위한 사전 작업
 목표: 탐색적
 핵심 과정: 탐색적 데이터 분석
 주요 작업
 데이터 전처리
 (요인별 분류 기여도 평가 및 선별)
 데이터 가시화
 도구: Weka의 Explorer
27

Weka 소개
❖ 대표적인 기계학습 알고리즘 모음, 데이터 마이닝 도구
❖ Weka의 주요 기능
 데이터 전처리(data pre-processing), 특성값 선별(feature selection)
 군집화(clustering), 가시화(visualization)
 분류(classification), 회귀분석(regression), 시계열 예측(forecast)
 연관 규칙 학습(association rules)
❖ S/W 특성
 무료 및 소스 공개 소프트웨어(free & open source GNU General Public License)
 주요 analytics S/W의 모체가 됨: RapidMiner, MOA
 Java로 구현. 다양한 플랫폼에서 실행 가능
 Python, C, R, Matlab 등 주요 도구와 연동 가능
❖ 다운로드
 Google에서 Weka로 검색, 첫 번째 검색 결과
 http://www.cs.waikato.ac.nz/ml/weka/
(c)2008-2016, SNU Biointelligence Lab. 28Weka (bird): http://www.arkive.org/weka/gallirallus-australis/video-au00.html

Top 20 Most Popular Tools for
Big Data, Data Mining, and Data Science
(c)2008-2016, SNU Biointelligence Lab. 29Source: http://www.kdnuggets.com/2015/06/data-mining-data-science-tools-associations.html
Red: Free/Open Source tools
Green: Commercial tools
Fuchsia: Hadoop/Big Data tools

Weka를 구성하는 인터페이스
(c)2008-2016, SNU Biointelligence Lab.
30
Explorer: 다양한 분석 작업을 한 단계씩
분석 수행 및 결과 확인 가능.
일반적으로, 가장 먼저 실행
KnowledgeFlow:
데이터 처리 과정의 주요 모듈을 그래프로
가시화하여 구성하고 실험 수행
Experimenter: 분류 및 회귀 분석을
일괄 처리. 결과 비교 분석.
- 다양한 알고리즘 및 파라미터 설정
- 여러 데이터-알고리즘 조합 동시 분석
- 분석 모델 간 통계적 비교
- 대규모 통계적 실험 수행
Simple CLI: 다른 인터페이스를
컨트롤하는 스크립트 입력창. Weka의
모든 기능을 명령어로 수행 가능
그 외 주요 도구
Workbench: 다른 인터페이스를 통합한
통합 인터페이스 (3.8.0부터 등장)

사례: 붓꽃(iris) 분류(classification)
31
Iris virginicaIris versicolorIris setosa
분류에 사용할 특성값

사례: 붓꽃(iris) 분류(classification)
❖ 특성값 정의(Define features or attributes)
 Sepal length, sepal width, petal length, petal width
 분류 라벨(Class label): 붓꽃(iris)의 세 아종. Setosa, versicolor, 및 virginica
❖ 샘플 수집 및 데이터 구성
 각 붓꽃 아종 별로 50개씩 샘플 수집 (1935년)
 Data table : 150 samples (or instances) * 5 attributes
 R. Fisher 경은 1936년 발표 논문에서 이 데이터에 linear discriminant model 을 적용함
❖ 학습: 분류 알고리즘 선택 및 파라미터 설정
 세 가지 분류 알고리즘으로 실습: 신경망, 결정 트리, SVM
 각 알고리즘 별 파라미터 설정은 기본값을 적용
❖ 학습 결과 평가 및 모델 선정
 다양한 평가 척도 확인
 평가 척도를 기준으로 학습 결과 모델(algorithm + parameter setting) 비교
및 선정
32

실습: 붓꽃(iris) 데이터셋
❖ Just open “iris.arff” in ‘data’ folder
33

Weka 데이터 양식 (.ARFF)
@RELATION iris
@ATTRIBUTE sepallength REAL
@ATTRIBUTE sepalwidth real
@ATTRIBUTE petallength NUMERIC
@ATTRIBUTE petalwidth numeric
@ATTRIBUTE class {Iris-setosa, Iris-versicolor, Iris-virginica}
@DATA
5.1, 3.5, 1.4, 0.2, Iris-setosa
4.9, 3.0, 1.4, 0.2, Iris-setosa
4.7, 3.2, 1.3, 0.2, Iris-setosa
…
7.0, 3.2, 4.7, 1.4, Iris-versicolor
…
데이터
(CSV format)
헤더
34
Note: Excel을 이용하여 CSV 파일 생성 후, 헤더만 추가하면 쉽게 arff 포맷의 파일 생성 가능
Dataset name Attribute name Attribute type

ARFF Example
35
%
% ARFF file for weather data with some numeric features
%
@relation weather
@attribute outlook {sunny, overcast, rainy}
@attribute temperature numeric
@attribute humidity numeric
@attribute windy {true, false}
@attribute play? {yes, no}
@data
sunny, 85, 85, false, no
sunny, 80, 90, true, no
overcast, 83, 86, false, yes
...
Slide from Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2)

Preprocess 탭에서 탐색적 데이터 분석 수행
❖ 데이터 구성(current relation)
❖ 특성값 삭제(remove attributes)
❖ 특성값 별 기초적 통계 분석(selected attribute)
❖ 모든 특성값을 대상으로 class label 분포 가시화(Visualize All)
❖ ‘Filter’를 이용한 preprocessing
(c)2008-2016, SNU Biointelligence Lab. 36

예측 모델 학습 및 평가 과정 예시
?
? ?
?
?
Data matrix
(row: instance)
(col: feature)
Feature
Selection
PCA
SVM
Decision
Tree
Neural
Networks
Accuracy +
Cross-
validation
ROC
Curve
AUC
(Area
Under ROC
Curve)
feature
instance
normalization
Dataset
준비/Cleaning
Feature
Manipulation
Classification
Regression
Evaluation
Fill missing values
standardization
문제 해결을 위한 다양한 프로세스 구성 및 테스트
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr 37

Approaching (Almost) Any Machine Learning Problem
FIGURE FROM: A. THAKUR AND A. KROHN-GRIMBERGHE, AUTOCOMPETE: A FRAMEWORK FOR MACHINE LEARNING
COMPETITIONS, AUTOML WORKSHOP, INTERNATIONAL CONFERENCE ON MACHINE LEARNING 2015.
http://blog.kaggle.com/2016/07/21/approaching-almost-any-machine-learning-problem-abhishek-thakur/

실습: 피마 인디안 당뇨병 진단
❖ Description
 Pima Indians have the highest prevalence of diabetes in the world
 We will build classification models that diagnose if the patient shows signs of
diabetes
 http://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes
❖ Configuration of the data set
 768 instances
 8 attributes
 age, number of times pregnant, results of medical tests/analysis
 all numeric (integer or real-valued)
 Class label = 1 (Positive example )
 Interpreted as "tested positive for diabetes"
 268 instances
 Class label = 0 (Negative example)
 500 instances

WEKA를 이용한 데이터 전처리
및 기술적 분석(DESCRIPTIVE ANALYSIS)
Part Ⅰ

Preprocess 탭에서 기술적 분석 수행
❖ 데이터 구성(current relation)
❖ 특성값 삭제(remove attributes)
❖ 특성값 별 기초적 통계 분석(selected attribute)
❖ 모든 특성값을 대상으로 class label 분포 가시화(Visualize All)
❖ ‘Filter’를 이용한 preprocessing: Part V에서 설명

Weka - Explorer – Preprocess –Filter
를 이용한 데이터 전처리
❖ 주요 연습 대상 전처리기
 Fill in missing values
 weka.filters.unsupervised.attribute.ReplaceMissingValues 이용
 Standardization for all the attributes: x를 z 로 변환
 weka.filters.unsupervised.attribute.Standardize 이용
 Data reduction using PCA
 weka.filters.unsupervised.attribute.PrincipalComponents 이용
 파라미터 중 maximumAttributes를 10~50 사이 중 하나의
숫자로 임의로 설정
 Check the effect of PCA using ‘Visualize-Plot Matrix’ 다음 장
(c)2008-2016, SNU Biointelligence
Laboratory, http://bi.snu.ac.kr
42

Weka - Explorer – Visualize 를 이용한
기술적 분석 (descriptive analysis)
❖ Check the effect of PCA using ‘Visualize-Plot Matrix’
 PCA 적용 전/후 두 plot matrix를 대조하여 PCA의 효과를
확인하는 작업
 PCA 적용 전(401차원 데이터) 및 후(주성분의 수만큼 차원 축소된
데이터)에 대한 Plot Matrix 화면을 각각 캡처하고, 비교 및 해석을
수행

WEKA EXPLORER를 이용한
CLASSIFICATION
Part Ⅱ

45
분류 알고리즘 – 결정트리(Decision Trees)
❖J48 (C4.5의 Java 구현 버전)
 학습 결과 모델에서 분류 규칙을 ‘트리’ 형태로 얻을 수 있음
 Weka에서 찾아가기: classifiers-trees-J48

46
비교: 신경망(Neural Networks)
❖MLP (Multilayer Perceptron)
 실용적으로 매우 폭넓게 쓰이는 대표적 분류 알고리즘
 Weka에서 찾아가기: classifiers-functions-MultilayerPerceptron
(c)2008-2016, SNU Biointelligence
Laboratory, http://bi.snu.ac.kr
Figure from Andrew Ng’s Machine Learning Lecture Notes, on Coursera, 2013-1

Weka로 분류 수행하기 – 신경망의 예
47
click • load a file that contains the
training data by clicking
‘Open file’ button
• ‘ARFF’ or ‘CSV’ formats are
readable
• Click ‘Classify’ tab
• Click ‘Choose’ button
• Select ‘weka – function
- MultilayerPerceptron
• Click ‘MultilayerPerceptron’
• Set parameters for MLP
• Set parameters for Test
• Click ‘Start’ for learning

48
분류 알고리즘의 파라미터 설정
❖ 파라미터 설정(Parameter Setting) = 자동차 튜닝(Car Tuning)
 많은 경험 또는 시행착오 필요
 파라미터 설정에 따라 동일한 알고리즘에서도 최악에서 최고의 성능을
모두 보일 수도 있다
❖ 결정트리의 주요 파라미터 (J48, SimpleCart in Weka)
 트리의 크기에 직접적 영향을 주는 파라미터: confidenceFactor, pruning,
minNumObj 등
❖ Random Forest의 주요 파라미터 (RandomForest in Weka)
 numTrees: 학습 및 예측에 참여할 tree의 수를 지정. 대체로 많을 수록
좋으나, overfitting에 주의해야 한다.
❖ 참고: 신경망의 주요 파라미터 (MultilayerPerceptron in Weka)
 구조 관련: hiddenLayers,
 학습 과정 관련: learningRate, momentum, trainingTime (epoch), seed

Test Options and Classifier Output
49(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr
There are
various metrics
for evaluation
Setting the
data set used
for evaluation

Classifier Output
❖Run information
❖Classifier model (full
training set)
❖Evaluation results
 General summary
 Detailed accuracy by
class
 Confusion matrix
The output
depends on
the classifier

인공신경망 (ANN) 연습
❖ ANN의 주요 파라미터 설명(functions-MultilayerPerceptron 기준)
 learningRate -- The amount the weights are updated.
 momentum -- Momentum applied to the weights during updating.
 hiddenLayers –
 This defines the hidden layers of the neural network. This is a list of positive whole
numbers. 1 for each hidden layer. Comma seperated.
 Ex) 3: one hidden layer with 3 hidden nodes
 Ex) 5,3; two hidden layers with 5 and 3 hidden nodes, respectively
 To have no hidden layers put a single 0 here. This will only be used if autobuild is
set. There are also wildcard values 'a' = (attribs + classes) / 2, 'i' = attribs, 'o' =
classes , 't' = attribs + classes.
 trainingTime -- The number of epochs to train through. If the validation set is
non-zero then it can terminate the network early
❖ Experiments
 실습 목표: 주요 파라미터의 효과를 확인하기 위해 극단적인 값을 설정하고 성능
평가 수행

SVM 연습
❖ SVM의 주요 파라미터 설명(functions-SMO 기준)
 c -- The complexity parameter C.
 kernel -- The kernel to use.
 PolyKernel -- The polynomial kernel : K(x, y) = <x, y>^p or K(x, y) = (<x,
y>+1)^p.
 “exponent” represents p in the equations.
 RBFKernel -- K(x, y) = e^-(gamma * <x-y, x-y>^2)
 gamma (γ) controls the width (range of neighborhood) of the kernel
❖ Experiments
 실습 목표: 대표적 커널 연습. 커널별 핵심 파라미터 설정 연습
 PolyKernel: testing several exponents. {1, 2, 5}
 RBF kernel: “grid-search" on C and γ using cross-validation.
 C = {0.1, 1, 10}, γ = {0.1, 1, 10}
❖ Reference
 A practical guide to SVM classification (http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf)

WEKA EXPERIMENTER를
이용한 CLASSIFICATION
Part Ⅲ

Using Experimenter in Weka
❖ Tool for ‘Batch’ experiments
54
click
• Set experiment type/iteration
control
• Set datasets / algorithms
Click ‘New’
• Select ‘Run’ tab and click ‘Start’
• If it has finished successfully, click
‘Analyse’ tab and see the summary

Usages of the Experimenter
❖ Model selection for classification/regression
 Various approaches
 Repeated training/test set split
 Repeated cross-validation (c.f. double cross-validation)
 Averaging
 Comparison between models / algorithms
 Paired t-test
 On various metrics: accuracies / RMSE / etc.
❖ Batch and/or Distributed processing
 Load/save experiment settings
 http://weka.wikispaces.com/Remote+Experiment
 Multi-core support : utilize all the cores on a multi-core machine

Experimenter 실습

Experimenter 실습
❖ 실험 결과 화면 예시(Analyse 탭에서 1~4 순서대로 선택)
(C) 2014-2015, B.-H Kim 58
• Accuracy: percent_correct 선택
• F1-measure: F_measure 선택
• ROC Area: Area Under ROC 선택
1
2
3
4

참고: Package Manager
❖ Explorer에 다양한 기능 추가 가능

신경망 체험하기:
NEURAL NETWORK PLAYGROUND
Part 3

[ Problem &
Dataset ]
평면 패턴 인식
[ Tool ]
Web Demo
[ Process &
Algorithm ]
인공신경망
(No
Programming)

Neural Network 실습
(c)2008-2016, SNU Biointelligence Lab. 62
http://playground.tensorflow.org/
설명, 문제별 풀이: https://cloud.google.com/blog/big-data/2016/07/understanding-neural-networks-with-tensorflow-playground

Neural Network Activation Functions
(A. Graves, 2012)
𝑅𝑒𝐿𝑈 𝑥 = max(𝑥, 0)
Rectified Linear Unit
[장점]
• Hidden unit에 sparsity가 나타난다
• Gradient vanishing 문제가 없다
• 계산이 단순해져 학습이 빠르면서
성능 향상에도 일조

부록
• Data Mining, Data Science & Machine Learning

DATA MINING,
DATA SCIENCE &
MACHINE LEARNING
(c) 2008-2015, B.-H. Kim 66

What is Data?
❖‘data’의 정의
(c) 2008-2015, B.-H. Kim
67
“Data is a set of values of qualitative or quantitative
variables, belonging to a set of items.”
Variables: A measurement or characteristic of an item.
Qualitative: Country of origin, gender, treatment
Quantitative: Height, weight, blood pressure

What Is Data Mining?
❖ Data mining (knowledge discovery from data)
 Extraction of ‘interesting’ patterns or knowledge from huge
amount of data
 ‘Interesting’ means: non-trivial, implicit, previously unknown
and potentially useful
❖ Alternative names
 Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data dredging,
information harvesting, business intelligence, etc.
❖ Watch out: Is everything “data mining”?
 Simple search and query processing
 (Deductive) expert systems
68Slide from Lecture Slide of Ch. 1 by J. Han, et al., for Data Mining: Concepts and Techniques
(c) 2008-2015, B.-H. Kim

Related Research Fields
(c) 2008-2015, B.-H. Kim
Data
Mining
Artificial
Intelligence
(AI)
Machine
Learning
(ML)
Deep
Learning
Data
Science
Information
Retrieval
(IR)
Knowledge
Discovery
from Data
(KDD)
Big Data
Analytics
Business
Intelligence
69

Machine Learning & Data Mining
(c) 2008-2015, B.-H. Kim
70
Slide from GECCO 2009 Tutorial on ‘Large Scale Data Mining using
Genetics-Based Machine Learning’, by Jaume Bacardit and Xavier Llorà

Data Science
❖ 데이터 사이언스
 데이터 기반
의사 결정 및
문제 해결을
목적으로 하는
학제간 연구
분야
(c) 2008-2015, B.-H. Kim
71

Data Science
(c) 2008-2015, B.-H. Kim
72
Figure source: http://nirvacana.com/thoughts/becoming-a-data-scientist/

데이터 분석을 위한 질문의 유형과 목표
❖ 기술적(descriptive)
 Describe a set of data
❖ 탐색적(exploratory)
 Find relationships you didn't know about
 Correlation does not imply causation
❖ 추론적(inferential)
 Use a relatively small sample of data to say something about a bigger population
 Inference is commonly the goal of statistical models
❖ 예측적(predictive)
 To use the data on some objects to predict values for another object
 Accurate prediction depends heavily on measuring the right variables
❖ 인과적(causal)
 To find out what happens to one variable when you make another variable change
❖ 기계론적(mechanistic)
 Understand the exact changes in variables that lead to changes in other variables for
individual objects
(c) 2008-2015, B.-H. Kim
73

분석 목적에 따른 데이터셋 선택
❖ 기술적(descriptive)
 전체 모수 필요(a whole population)
❖ 탐색적(exploratory)
 무작위 추출 후 다양한 변수 측정(a random sample with many variables
measured)
❖ 추론적(inferential)
 모집단을 정확히 선별 후 무작위 추출(the right population, randomly
sampled)
❖ 예측적(predictive)
 동일한 모집단에서 학습 데이터와 테스트 데이터 획득(a training and test
data set from the same population)
❖ 인과적(causal)
 무작위적 기법을 적용한 연구에서 데이터 획득(data from a randomized
study)
❖ 기계론적(mechanistic)
 시스템의 모든 요소를 아우르는 데이터 획득(data about all components of
the system)(c) 2008-2015, B.-H. Kim
74

데이터 분석(data analysis)의 과정
❖ 문제 정의(define the question)
❖ 데이터 준비(dataset)
 이상적인 데이터셋 정의(define the ideal data set)
 수집할 데이터 결정(determine what data you can access)
 데이터 수집(obtain the data)
 데이터 정리(clean the data)
❖ 탐색적 데이터 분석(exploratory data analysis)
 클러스터링/데이터 가시화(Clustering / Data visualization)
❖ 통계적 예측/모델링(statistical prediction/modeling)
 분류/예측(Classification / Prediction)
❖ 결과 해석(interpret results)
 다양한 평가 척도 및 방법(evaluation), 학습 결과 모델 선정(model selection)
❖ 모든 과정 및 결과에 대한 이의 제기 및 점검(challenge results)
❖ 결과 정리 및 보고서 작성(synthesize/write up results)
❖ 결과 재현 가능한 프로그램 작성(create reproducible code)
(c) 2008-2015, B.-H. Kim
75

Raw versus processed data
❖Raw data
 The original source of the data
 Often hard to use for data analyses
 Data analysis includes processing
 Raw data may only need to be processed once
❖Processed data
 Data that is ready for analysis
 Processing can include merging, subsetting, transforming,
etc.
 There may be standards for processing
 All steps should be recorded
(c) 2008-2015, B.-H. Kim
76

Raw versus processed data
❖Raw data ❖Processed data
(c) 2008-2015, B.-H. Kim 77

머신러닝의 개념과 실습

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (8)

Ähnlich wie 머신러닝의 개념과 실습

Ähnlich wie 머신러닝의 개념과 실습 (20)

머신러닝의 개념과 실습