빅데이터, big data

빅데이터 기술개요
2014.10.31
윤형기
hky@openwith.net
1

목 차
• 도입
• 빅데이터 기술 개요
• Hadoop
• NoSQL
• 분석
• 맺음말
2

도입
• 변화하는 세상
• 데이터의 힘
3

빅데이터
6
그림출처: zdnet

배경 – 3V
• Tidal Wave – 3VC
• Supercomputer
– High-throughput computing
– 2가지 방향:
• 원격, 분산형 대규모 컴퓨팅 (grid computing)
• 중앙집중형 (MPP)
• Scale-Up vs. Scale-Out
• BI (Business Intelligence)
– 특히 DW/OLAP/데이터 마이닝
8

Hadoop
• Hadoop의 탄생?
– 배경
• Google!
• Nutch/Lucene 프로젝트에서 2006년 독립
– Doug Cutting
– Apache의 top-level 오픈소스 프로젝트
– 특징
• 대용량 데이터 분산처리 프레임워크
– http://hadoop.apache.org – 순수 S/W
• 프로그래밍 모델의 단순화로 선형 확장성 (Flat linearity)
– “function-to-data model vs. data-to-function” (Locality)
– KVP (Key-Value Pair)
9

Hadoop 탄생의 배경
1990년대 – Excite,
Alta Vista, Yahoo,
…
2000 – Google ;
PageRank,
GFS/MapReduce
2003~4 –
Google Paper
2005 – Hadoop
탄생
(D. Cutting &
Cafarella)
2006 – Apache
프로젝트에 등재
10

• Hadoop Kernel
• Hadoop 배포판
– Apache 버전
• 2.x.x : 0.23.x 기반
– 3rd Party 배포판
• Cloudera, HortonWorks와 MapR
13

• Hadoop 배포판?
– Apache 재단의 Hadoop은 0.10에서 시작하여 현재 0.23
– 현재 – Apache
• 2.x.x : 0.23.x 개반
• 1.1.x : 현재 안정버전 (0.22기반)
• 0.20.x: 아직도 많이 사용되는 legacy 안정버전
– 현재 – 3rd Party 배포판
• Cloudera
– CDH
• HortonWorks
• MapR
• …
14

Hadoop
– HDFS & MapReduce –

요구사항
• Commodity hardware
– 잦은 고장은 당연한 일
• 수 많은 대형 파일
– 수백 GB or TB
– 대규모 streaming reads – Not random access
• “Write-once, read-many-times”
• High throughput 이 low latency보다 더 중요
• “Modest” number of HUGE files
– Just millions; Each > 100MB & multi-GB files typical
• Large streaming reads
– …

HDFS의 해결책
• 파일을 block 단위로 저장
– 통상의 파일시스템 (default: 64MB)보다 훨씬 커짐
• Replication 을 통한 신뢰성 증진
– Each block replicated across 3+ DataNodes
• Single master (NameNode) coordinates access,
metadata
– 단순화된 중앙관리
• No data caching
– Streaming read의 경우 별 도움이 안됨
• Familiar interface, but customize the API
– 문제를 단순화하고 분산 솔루션에 주력

GFS 아키텍처
그림출처: Ghemawat et.al., “Google File System”, SOSP, 2003

HDFS 이용환경
• 명령어 Interface
• Java API
• Web Interface
• REST Interface (WebHDFS REST API)
• HDFS를 mount하여 사용

HDFS 명령어 Interface
• Create a directory
$ hadoop fs -mkdir /user/idcuser/data
• Copy a file from the local filesystem to HDFS
$ hadoop fs -copyFromLocal cit-Patents.txt
/user/idcuser/data/.
• List all files in the HDFS file system
$ hadoop fs -ls data/*
• Show the end of the specified HDFS file
$ hadoop fs -tail /user/idcuser/data/cit-patents-
copy.txt
• Append multiple files and move them to HDFS (via
stdin/pipes)
$ cat /data/ita13-tutorial/pg*.txt | hadoop fs -
put- data/all_gutenberg.txt

• File/Directory 명령어:
– copyFromLocal, copyToLocal, cp, getmerge, ls, lsr
(recursive ls),
– moveFromLocal, moveToLocal, mv, rm, rmr (recursive
rm), touchz,
– mkdir
• Status/List/Show 명령어:
– stat, tail, cat, test (checks for existence of path,
file, zero length files), du, dus
• Misc 명령어:
– setrep, chgrp, chmod, chown, expunge (empties trash
folder)

HDFS Java API
• Listing files/directories (globbing)
• Open/close inputstream
• Copy bytes (IOUtils)
• Seeking
• Write/append data to files
• Create/rename/delete files
• Create/remove directory
• Reading Data from HDFS
org.apache.hadoop.fs.FileSystem (abstract)
org.apache.hadoop.hdfs.DistributedFileSystem
org.apache.hadoop.fs.LocalFileSystem
org.apache.hadoop.fs.s3.S3FileSystem

HDFS 정리
• 다수의 저가 H/W 위에서 대규모 작업에 중점
– 잦은 고장에 대처
– 대형 파일 (주로 appended and read)에 중점
– 개발자들에 촛점맞춘 filesystem interface
• Scale-out & Batch Job
– 최근 여러 보완 프로젝트

MapReduce – 프로그래밍 모델

WordCount 예의 개선
• 문제: 단 한 개의 reducer가 병목을 일으킴
– Work can be distributed over multiple nodes
(work balance 개선)
– All the input data has to be sorted before processing
– Question: Which data should be send to which reducer ?
• 해결책:
– Arbitrary distributed, based on a hash function (default mode)
– Partitioner Class, to determine for every output tuple the
corresponding reducer

unix 명령어와 Streaming API
• Question: How many cities has each country ?
hadoop jar /mnt/biginsights/opt/ibm/biginsights/pig/test/e2e/
pig/lib/hadoop-streaming.jar
-input input/city.csv
-output output
-mapper "cut -f2 -d,"
-reducer "uniq -c"
-numReduceTasks 5
• Explanation:
cut -f2 -d, # Extract 2nd col. in a CSV
uniq -c # Filter adjacent matches matching lines from INPUT,
# -c: prefix lines by the number of occurrences
additional remark: # numReduceTasks=0: no shuffle & sort phase!!

Use the right tool for the right job

Hadoop의 장단점과 대응
• Haddop의 장점
– commodity h/w
– scale-out
– fault-tolerance
– flexibility by MR
• Hadoop의 단점
– MR!
– Missing! - schema와
optimizer, index, view, ...
– 기존 tool과의 호환성 결여
• 해결책: Hive
– SQL to MR
– Compiler + Execution 엔진
– Pluggable storage layer
(SerDes)
• 미해결 숙제: Hive
– ANSI SQL, UDF, ...
– MR Latency overhead
– 계속 작업 중...!
38

SQL-on-MapReduce
• 방향
– SQL로 HDFS에 저장된 데이터를 빠르게 조회하고, 분석
– MR을 사용하지 않는 (low latency) 실시간 분석을 목표
– 대규모 batch 및 실시간 interactive 분석에 사용
– HDFS, 기타 데이터에 대한 ETL, Ad-hoc 쿼리, 온라인통합
• New Architecture for SQL on Hadoop
– Data Locality
– (MR대신) Real-timer Query
– Schema-on-Read
– SQL ecosystem과 tight 통합

• SQL on Hadoop 프로젝트 예
– Google Dremel
– Apache Drill
– Cloudera Impala
– Citus Data
• Tajo
– 2013년 3월 Apache Incubator Project에 선정
• APL V2.0
– 국내기업 적용 – SK텔레콤 등
40

NoSQL?
• NoSQL도 DBMS이다.
– 기존 RDBMS:
• Table
• More functionality, Less Performance
– OLAP
• Cube
– NoSQL
• Collections
• Less Functionality, More Performance
• 주안점: Scalability, Performance, HA
42
Structured
Data
Structured/
Unstructured
Data

NoSQL 종류
• Key-Value Stores
– 원천기술: DHTs / Amazon’s Dynamo paper
– 예: Memcached, Coherence, Redis
• Column Store
– 원천기술: Google의 BigTable 논문
– 예: Hbase, Cassandra, Hypertable
• Document Store
– 원천기술: Lotus Notes
– 예: CouchDB, MongoDB, Cloudant
• Graph Database
– 원천기술: Euler & graph 이론
– 예: Neo4J, FlockDB

NoSQL 특징
• Missing?
– Joins 지원 없음
– Complex Transaction 지원 없음 (ACID)
– Constraint 지원 없음
• Available?
– Query Langauge
– 높은 성능
– Horizontal Scalability
NoSQL
SQL
성능
기능

MongoDB의 예
• MongoDB packages
– mongodb-org
– mongodb-org-server
– mongodb-org-mongos
– mongodb-org-shell
– mongodb-org-tools
• 설치
– sudo yum install -y mongodb-org
• 수행과 정지
– sudo service mongod start
– sudo service mongod stop
47

• Sharding
– 데이터를 여러 기기에 걸쳐서 보관하는 것.
• Vertical scaling
• Horizontal scaling
53

NoSQL 사용 – When?
• 대용량 데이터
• Element간의 relationship이 중요치 않을 때
• 비정형 데이터 (log, 는, twitter, blog, …)
• 신속한 prototyping
• 데이터의 변경이 빠를 때
• Business Logic을 DBMS가 아닌 Application에서 구현
55

NoSQL 결론
• 특징
– 기존 RDB 제한을 완화하여 단순화, 성능향상, 유연화
• 현황
– 2014.1월 현재 150여 종
56

분석도구 – Big Bang
• 기능특화
58

분석기법
• Data Mining
• Predictive Analysis
• Data Analysis
• Data Science
• OLAP
• BI
• Analytics
• Text Mining
• SNA (Social Network Analysis)
• Modeling
• Prediction
• Machine Learning
• Statistical/Mathematical
Analysis
• KDD (Knowledge Discovery)
• Decision Support System
• Simulation
편의상
(데이터) 분석(Data Analysis), 마이닝 (Data Mining)으로 혼용
59

• 통계기초이론 Taxonomy
60

• 기계학습이론 Taxonomy

R
• open-source 수리/통계 분석도구 및 프로그래밍 언어
– S 언어에서 기원하였으며 수 많은 package
• CRAN: http://cran.r-project.org/
• 현재 > 5,100 packages
– 뛰어난 성능과 시각화 (visualization) 기능
62

분석 기법
• 일반적인 기계학습 절차
63

• 분석 알고리즘
– 탐색
– 모델링
64

Data 즉, 객관적 증거 중심!
• The Fox and the hedgehog in a project life…
– 지도자 vs. 전문가
– 문제는 hedgehog 식 사고에서의 risk 문제…
• Systems Thinking: A Foxy Approach
– OODA: A fox dressed like a hedgehog
66

맺음말
• “Big Data is All Data”
• 개방성의 문제
– 오픈소스, Naver vs. Google
• 교육
– 교육 일반, BD 교육 (www.coursera.org/course/mmds )
• 빅데이터 4V vs. 4P's (hurdles)
– Practicality, privacy, power, Privilege
• 기타
– 복잡계 이론, System Dynamics, 데이터 잔해 (Data Exhaust)
67

빅데이터, big data

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie 빅데이터, big data

Ähnlich wie 빅데이터, big data (20)

Mehr von H K Yoon

Mehr von H K Yoon (8)

빅데이터, big data