GRUTER가 들려주는 Big Data Platform 구축 전략과 적용 사례: Bioinformatics Data를 위한 Hadoop기반 NoSQL 구축사례

© 2013 Gruter. All rights reserved.
Bioinformatics Data 를 위한
Hadoop 기반 NoSQL 구축사례
2013.08.28
김진호 책임 연구원

© 2013 Gruter. All rights reserved.2
Introduction
• Bioinformatics
– 생물체로부터 얻어진 대량의 데이터로부터 유용한 지
식을 얻어내기 위한 이론물리/전산/통계/수학적인 도
구를 이용하여 생명현상을 연구하는 분야
• Bioinformatics as a computer science
– 생명공학(BT:BioTechnology) + 정보공학(IT:Infor
mation Technology)

DNA Structure

Human genome
• 유전자(Gene) 와 염색체(chromosome) 의 합
성어
• 게놈(독일어: genom, 영어: genome 지놈)
• 한 개체의 유전자의 총 염기서열
– 사람의 유전자 약 3만 개는, 대략 30억 쌍의 염기대
의 DNA에 기록되어 있다. DNA의 염기 배열이 어느
유전자에 대응하는지를 조사함으로써 사람의 모든 유
전자를 해독하는 것을 “인간 게놈 계획”이라고 부르
고 있다.
http://ko.wikipedia.org/wiki/%EA%B2%8C%EB%86%88

Human genome
http://en.wikipedia.org/wiki/Chromosome

Conserved segments in the human and mouse gen
ome
Nature, Human Genome, Figure 46

요구사항
• 저비용의 데이터 저장소
• 다양한 Bioinformatics data 지원
• File type 에 의존적이지 않는 meta 관리
• SQL query(JDBC) 지원
• 빠른 검색 및 대용량 검색 결과에 대한 성능
• 저장된 데이터 분석
• 확장 및 안정성 보장
• 클러스터 관리 및 모니터링

1000 Genomes Browser
http://www.ncbi.nlm.nih.gov/variation/tools/1000genomes

Challenges
• 도메인 이해의 어려움
– 생소한 용어들
• Sequencing and mapping
• Pairwise Alignment
• AATCTATA AATCTATA AATCTATA …
• 수 많은 알고리즘 및 수식
– Maxam-Gilbert sequencing
– Needleman and Wunsch Algorithm
– Phred quality score
– R-Tree
• 다양한 Data format
– FASTA, SAM, BAM, SNP, CNV, Inversion, Large InDel, Small InDel
• 대용량 레코드 저장과 검색 (Read only)
– 약 30억 레코드 와 단계별 실험 데이터들
– 한사람당 RDB 데이터 저장시 최소 5GB 이상 필요.
– 1,000,000 유저는 5PB -> 비용 문제로 서비스 가치 상실
– Hadoop 사용 시 500 ~ 1000대 구성으로 서비스 가능

DNA Sequencing Costs
http://www.nature.com/news/2009/091021/full/464670a.html

Open Source 참조
• Picard - Java base SamTools
– Command line 프로그램 and SAM-JDK
– 다양한 format 처리 구현
– Index Model 제공
• Binning index
– The binning scheme is essentially a
representation of R-tree
– Combining with linear index
http://picard.sourceforge.net

Index
bin array
0 1 2 … 585 … 4681 … 37449
Chunk1: file offset
Chunk2: file offset
Linear index0 1 2 … 32770
bin-4681
(해당 bin의 min chunk start 저장)

아키텍쳐 구성
Hadoop DataNode
Data Server #1
Genome Unit #1
Disk
Index
Data
File
Memory
Index
Data
File
Index
File
Data
File
Index
File
Data
File
Index
File
Data
File
Index
File
Hadoop DataNode
Data
File
Index
File
Data
File
Index
File
Data
File
Index
File
Data
File
Index
File
ZooKeeper
Server Cluster Membership
Cluster Configuration
Master Election
Meta Infomation
Index
File
Index
File
Index
File
Master Server
Genome Allocation
Data Server Failover
Meta Management
Hadoop DataNode
Data
File
Index
File
Data
File
Index
File
Data
File
Index
File
Data
File
Index
File
…
…
Application Server
Genome Browser
Client
JDBC
Uploader
Indexer
Uploader

구성 요소
• Genome Unit
– 검색 관리 대상 단위
– 메타정보와 데이터, 인덱스 파일로 구성
• Data Server
– 검색 및 결과제공
– 사용자 Program 및 MR 실행
– 하나의 Data Server는 N개의 Genome Unit을 서비스.
• Master Server
– 시스템 관리 및 Genome Unit 할당
– 모니터링 제공
• Hadoop
– 데이터 파일 및 인덱스 파일 저장소
• ZooKeeper
– 클러스터 멤버쉽 및 메타정보 관리

특징
• 확장성
– 하나의 Data Server 는 N개의 Genome Unit을 서비스
– 데이터 증가/감소에 따라 선형적으로 서버 추가/제거 가능
– 서버 추가/제거 작업 중에도 데이터 서비스 및 시스템 성능에는 영향 없음
– Index 크기에 따라 메모리 로딩, 로컬 디스크 저장 등 선택적 사용 가능
• 안정성
– 하나의 Genome Unit은 N대의 Data Server 복제
– Data Server 장애시 자동 재할당.
• 성능
– 빠른 index 검색을 위해 index 크기, 빈도에 따라 index 위치 설정
– count, average 등은 별도의 index 제공
• 기타
– 웹기반 관리도구 제공

GRUTER: YOUR PARTNER
IN THE BIG DATA REVOLUTION
Phone +82-70-8129-2950
Fax +82-70-8129-2952
E-mail contact@gruter.com
Web www.gruter.com
© 2013 Gruter. All rights reserved.
Gruter, Inc.
5F Sehwa Office Building 889-70 Daechi-dong, Gangnam-gu, Seoul, South Korea 135-839

GRUTER가 들려주는 Big Data Platform 구축 전략과 적용 사례: Bioinformatics Data를 위한 Hadoop기반 NoSQL 구축사례

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (9)

Similar to GRUTER가 들려주는 Big Data Platform 구축 전략과 적용 사례: Bioinformatics Data를 위한 Hadoop기반 NoSQL 구축사례

Similar to GRUTER가 들려주는 Big Data Platform 구축 전략과 적용 사례: Bioinformatics Data를 위한 Hadoop기반 NoSQL 구축사례 (20)

More from Gruter

More from Gruter (20)

GRUTER가 들려주는 Big Data Platform 구축 전략과 적용 사례: Bioinformatics Data를 위한 Hadoop기반 NoSQL 구축사례