Quoc Le, Stanford & Google - Tera Scale Deep Learning
Kogo 2013-ngs galaxy
1. NGS Analysis using Galaxy
2013 한국유전체학회 동계심포지엄 생물정보분석교육 워크샵
김형용, 이규열, 이성찬 _ 2013. 02. 05 ~ 2013.02.06
R&D Center, Insilicogen, Inc.
2. Index
목차 있을 시 간지
01 Galaxy introduction
NGS Analysis using Galaxy
02 Galaxy examples 1,2
03 Galaxy installation
04 Galaxy function details
05 Galaxy examples 3,4
06 Galaxy tools
07 Galaxy on Grid
08 Galaxy on Cloud
3. Agenda
구분 시간 강의내용 비고
15:00 ~ 15:20 Galaxy 소개 진행 김형용
15:20 ~ 15:50 Galaxy 분석예제 시연 1. Human exon 가운데 가장 SNP 많은 ex
on 찾기
1부: 2. NGS QC and assembly 예제
Introduction 16:00 ~ 16:20 Galaxy 설치 진행 이성찬
and 16:20 ~ 17:10 Galaxy 설치 및 분석예제 실습 1. Galaxy 설치 실습
Application 2. Human exon 가운데 가장 SNP가 많은
exon 찾기 실습
3. NGS QC and assembly 예제 실습
17:20 ~ 17:50 Galaxy 세부 기능에 대한 설명 진행 김형용
09:00 ~ 09:20 Galaxy 분석예제 시연 진행 김형용
1. RNA-seq 분석 예제
2. NGS 분석예제 2
19:20 ~ 09:50 Galaxy 분석예제 실습 1. RNA-seq 분석 예제
2. NGS 분석예제 2
2부: 10:00 ~ 10:20 Galaxy tool의 이해 진행 김형용
Custom 10:20 ~ 11:00 Galaxy tool 작성 실습 1. Primer design
operation 11:10 ~ 11:30 Galaxy on Grid 진행 이규열
1. 그리드의 이해
2. 분산작업 시연
11:30 ~ 11:50 Galaxy on Cloud 진행 김형용
1. 클라우드의 이해
2. Galaxy on Amazon EC2
Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved. 3
5. Sequencer Comparison
Illumina 454 SOLiD
5500 5500xl 5500xl
HiSeq 2000 HiSeq 1000 HiScan SQ GAIIx GS FLX
microbeads microbeads nanobeads
Mate pair : 60 bp X60 bp
Read
2X100 bp 2X150 bp 400 bp Paired-end : 75 bp X35 bp
length
Fragment : 75 bp
Gb/day 55 35 17.5 6.5 10h 10-15 20-30 30-45
Yield 600Gb 300Gb 150Gb 95Gb 35Mb 90Gb 180Gb 300Gb
Required 50 ng with Nextera
input 100 ng – 1 μg with TruSeq
85% (2X50 bp, >Q30)
Accuracy 99% (>Q20) 99.99%
80% (2X100 bp, >Q30)
Illumina의 Gb/day는 2X100 bp run 결과
Illumina read length : 1X35, 2X50, 2X100
GA : 1X35, 2X50, 2X100, 2X150
Copyrightⓒ Insilicogen, Inc. 2011. All rights reserved. 5
6. Applications
Application of NGS Technique
Personal Genomics Environmentology
Microbiology Toxicology
Personal Genomics Chemical Biology
Mutation Detection
Structure Variation
Transcriptional Control
Interaction of DNA and Protein
Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved. 6
7. Issue of New Genomic Era.
many researchers,
having invested
in next generation
sequencing
instruments,
now face
a computational bottleneck
in their research
work-flow.
BGI
Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved. 7
8. Most Significant Improvement to Your Next Generation Sequencing
Workflow
(출처: The Global Outlook for Next
Generation Sequencing: Usage, Platform
Drivers & Workflow, October 31, 2011.
BioInformatics, LLC)
Copyrightⓒ Insilicogen, Inc. 2010. All rights reserved. 8
9. Issue of New Genomic Era.
Bioinformatics
•DNA shearing
•Insert into high and • Big Dye • FTP
/or low copy • ABI 3730 • Gene prediction • Web browser
number vectors • Data compliation • BLAST search • Commercial software
Library Sequence Sequence
Data delivery
construction delineation annotation
Template Finishing & Secondary
purification Assembly annotation
• PCR Amplicons • Primer walking • SNP
• BACs • Transposon insertion methods • Comparative genomics
• Cosmids/ Fosmids • Proprietary & commercial assembly • Expression analysis
Cost
Process
Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved. 9
10. Application of Next Genomic Data
Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved. 10
12. What kind of?
• Biological Features
• Framework (Enterprise/Informatics) Features
• Service
• Price
13. List of NGS Frameworks
Copyrightⓒ Insilicogen,Inc. 2012. All rights reserved. 13
14. 유전변이 추출 전문 파이프라인 HugeSeq
Copyrightⓒ Insilicogen,Inc. 2012. All rights reserved. 14
15. 사용자 친화적 GUI환경을 제공하는 CLC Genomics Server
CLC Genomics Server
1
- 3계층 시스템 구조의 데이터 분석 및 공유, 관리를 위한 엔터프라이즈 솔루션
② ⑤ CLC Bioinformatics Database
2
- 데이터의 중앙 집중 방식의 저장 및 공유 관리를 위한 데이터베이스
CLC Assembly Cell
3
- NGS 데이터의 초고속 assembly 분석 솔루션 (커맨드라인 기반)
① CLC Genomics Workbench
4
- NGS 데이터의 다양한 생물정보 분석 솔루션 (GUI 기반)
③ ④
CLC Developer Kit
5
- 사용자가 원하는 생물정보 분석 툴과 워크플로우 커스터마이징 솔루션
Copyrightⓒ Insilicogen,Inc. 2012. All rights reserved. 15
22. What is Galaxy
Galaxy, a web-based genome analysis
platform http://usegalaxy.org
• An open-source framework for integrating various computational tools and
databases into a cohesive workspace
• A web-based service we provide, integrating many popular tools and
resources for comparative genomics
• A completely self-contained application for building your own Galaxy style
sites
Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved. 22
23. Galaxy Usage
• One of the fastest growing open source bioinformatics projects,
a highly successful high throughput data analysis platform for
Life Sciences with over 15,000 users worldwide
• Annual Galaxy Community Conference
Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved. 23
24. Galaxy visualization
External Genome Browser
UCSC
Ensembl
GBrowse
Trackster
Track/data viewer in web browser
HTML5 Canvas, jQuery
Renders in browser, not on server
Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved. 24
26. Trackster
Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved. 26
27. Trackster
Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved. 27
28. Trackster
Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved. 28
29. Galaxy 구성요소
Galaxy 주요구성 요소
Datasources : 입력 데이터 지정. 별도의 지역
시스템이나, 외부 웹사이트의 데이터를 등록 가능
Tool : 기본적인 분석의 최소 단위, 지역설치시
원하는 툴을 만들어 넣을 수 있음
History : 입력데이터가 Tool의 조합을 거쳐 얻어진
중간 결과물 목록
Workflow : History 는 입력데이터 및 파라메터만
바꾸면 새로운 데이터 결과를 얻을 수 있다. 이를
별도로 프로세스 등록
Visualization : 분석결과를 가시화 도구와 연결
Page : 위 요소들을 종합한 보고서 작성 기능
Eprimer3 tool 을 별도로 만들어 등록한 예제
Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved. 29
30. Galaxy tool 은
입력 출력
Tool
포맷 포맷
입력 데이터를 (포맷에 맞게) 작업하여 (포맷에 맞게) 출력 데이터를 만드는 역할
조합하면 Workflow가 된다
Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved. 30
31. Galaxy formats
Auto-detect 데이터가 어떤 형식인지 자동으로 인식
A binary sequence file in 'ab1' format with a '.ab1' file extension. You must manually select this 'File Format' when uploadi
Ab1
ng the file.
blastz pairwise alignment format. Each alignment block in an axt file contains three lines: a summary line and 2 sequence li
Axt nes. Blocks are separated from one another by blank lines. The summary line contains chromosomal position and size infor
mation about the alignment. It consists of 9 required fields.
Bam A binary file compressed in the BGZF format with a '.bam' file extension.
Bed Tab delimited format (tabular). Does not require header line
A sequence in FASTA format consists of a single-line description, followed by lines of sequence data. The first character of
Fasta
the description line is a greater-than (">") symbol in the first column. All lines should be shorter than 80 characters
FastqSolexa Illumina (Solexa) variant of the Fastq format, which stores sequences and quality scores in a single file
Gff GFF lines have nine required fields that must be tab-separated.
The GFF3 format addresses the most common extensions to GFF, while preserving backward compatibility with previous fo
Gff3
rmats.
Interval (Genomic
Tab delimited format (tabular)
Intervals)
Lav Lav is the primary output format for BLASTZ. The first line of a .lav file begins with #:lav..
TBA and multiz multiple alignment format. The first line of a .maf file begins with ##maf. This word is followed by white-sp
MAF
ace-separated "variable=value pairs". There should be no white space surrounding the "=".
A binary sequence file in 'scf' format with a '.scf' file extension. You must manually select this 'File Format' when uploading
Scf
the file.
Sff A binary file in 'Standard Flowgram Format' with a '.sff' file extension.
Tabular (tab delimi
Any data in tab delimited format (tabular)
ted)
The wiggle format is line-oriented. Wiggle data is preceded by a track definition line, which adds a number of options for
Wig
controlling the default display of this track.
Other text type Any text file
Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved. 31
32. Galaxy 특징 한번 더
최근 Galaxy 사용 추세
Biologist
NGS 관련 분석기능 탑재 논문에 Galaxy URL 제공
Amazon Cloud 이용 Transparent analysis
Bioinformatician
Galaxy 특징 한번 더
파이썬으로 만들어져 있으나, 확장시 파이썬이 아니어도 됨
“투명한” 분석 플로우를 만들고 공유하고 확장할 수 있다.
거의 모든 생물정보 분석을 Galaxy 로 할 수 있다.
Galaxy만 잘 써도 뽑겠다 (NCBI)
…
Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved. 32
34. Example 1.
Finding Human Exons with the highest number of SNPs
1. Download all Human Exons from NCBI or Ensembl BioMart or UCSC
TableBrowser
2. Download all Human SNPs from …
3. Scripting
Join 1, 2 according to position
Group by Exon id
Sort by SNP count
Filter Exon which has more than 10 SNPs
Have to do programming! (Python, Perl, …)
Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved. 34
35. On Galaxy
http://usegalaxy.org
Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved. 35
36. On Galaxy
Get data UCSC main : Exon 데이터 가져오기
Get data UCSC main : SNP 데이터 가져오기
Operate on Genomic Interval Join : 영역이 겹치는 Exon 추출하기
Join, Substract and Group Group : Exon 이름으로 그룹핑하고 SNP 세기
Filter and Sort Sort : SNP 개수로 Exon 정렬하기
Text Manipulation Select first : SNP 개수가 많은 top 5 exon 추출하기
Join, Substract and Group
Compare two Datasets : 잃어버린 exon 정보 회복하기
Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved. 36
38. Example 2.
Human NGS data QC and assembly
1. NGS Quality Control
2. NGS Single End Mapping
3. SNP Calling
4. Compare with dbSNP
Have to do in Unix and need
programming! (Python, Perl, …)
Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved. 38
39. On Galaxy
http://usegalaxy.org
Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved. 39
40. On Galaxy
NGS 분석을 위해서는
프로그램 추가 설치해야 함
( http:// http://wiki.galaxyproject.org/Admin/NGS%20Local%20Setup )
프로그램 사용되는 곳 설치방법
Fastx-toolkit NGS QC Ubuntu apt-get
Gnuplot NGS QC boxplot Ubuntu apt-get
Bowtie2 Reference assembly 복사 후 PATH 설정
SAMTools SNP calling Ubuntu apt-get
Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved. 40
41. On Galaxy
Get data Upload File : human illumina fastq 파일 업로드
NGS: QC and minipulation : fastsanger 포맷을 변경
FASTQ Groomer
NGS: QC and minipulation
: fastq quality 통계정보 보기
Compute quality statistics
NGS: QC and minipulation
Draw quality score boxplot : fastq quality 통계정보로 boxplot 그리기
NGS: QC and minipulation
: 의미없는 부분 잘라내기, 가리기
FASTQ Trimmer, Quality Trimer, Masker
Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved. 41
42. On Galaxy
Get data Upload File : Reference assembly를 위한 레퍼런스 서열 입력
NGS: Mapping Bowtie2 : Bowtie2를 이용한 assembly
NGS: SAM Tools MPileup : BAM 파일에서 SNP, indel 정보 추출하기
NGS: SAM Tools Filter pileup : 추출된 SNP, indel 가운데 높은 점수 추출하기
NGS: SAM Tools Pileup-to-interval : Genomic interval 형식으로 변경
Get data UCSC Main : dbSNP 정보 가져오기
Operate on Genomic Interval Join : 영역이 겹치는 SNP 추출하기
Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved. 42
44. Install Virtualbox - Ubuntu
1. USB에서 Virtualbox와 Galaxy 폴더를 복사합니다.
2. Virtualbox를 설치합니다.
3. Virtualbox를 실행한 후, Galaxy 이미지를 Import합니다.
4. 설정에서 네트워크를 브릿지(Bridge)로 변경합니다.
5. Ubuntu 실행 후, Network 설정 파일을 삭제합니다.
rm /etc/udev/rules.d/70-persistent-net.rules
6. Linux(ubuntu) 를 재 시작합니다.
sudo shutdown –h now
Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved. 44
45. Creating your own Galaxy
Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved. 45
46. Running Galaxy in an production environment
By default, Galaxy uses
SQLite database
Built-in HTTP server for all tasks
Local job runnser
Single process
Simplest error-proof configuration
Change configuration for service
Disable the developer settings use_interactive = False, use_debug = False
Get a real database PostgresSQL
Offload the menial tasks: Proxy Nginix, Apache
Let your tools free: Cluster Move intensive processing to other host, TORQUE, GRID, DRMAA
Other advanced settings
Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved. 46
47. Galaxy on Cluster
Intensive processes to other hosts
TORQUE
GRID
DRMAA
Working with Galaxy on the Cloud
Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved. 47
49. Virtualization
가상화
• 컴퓨터 자원의 추상화를 일컫는 말
• 가상의 물리적 리소스를 만들어 냄.
•물리적인 1대의 하드웨어 자원을 논리적으로 여러 개로 나누어 사용하거나,
•여러대의 하드웨어 자원을 논리적으로 통합하여 이용하는 기술
• 하드웨어 관리, 재난에 대한 시스템 복구 등 여러 문제를 해결할 수 있는 방법으로 최근 각광
받고 있음
50. Virtualization
가상화의 장점!!
• 비용절감
서버 한 대를 분할하여 여러 대의 서버를 구성할 수 있음
서버 구입비용 절감, 전기, 상면비용, 서버관리비용이 절감
• 자원의 효율적인 사용
서버의 비 활용되는 자원을 이용하여 가상머신을 만듬으로써 효율적인 자원사용이 가능
• 안정적인 운영
서버를 이미지로 백업, 손쉬운 서버 이전으로 장애에 대한 신속한 대처 가능
• SW의 지속적인 운영
서버 HW의 수명 주기가 끝나면 OS 벤더는 장치 드라이버 지원이 중단됨
-> 마이그레이션 문제가 발생
가상머신에 기존의 시스템을 가상머신에 올리기 때문에 장치 드라이버에 대한 문제
가 발생하지 않음
Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved. 50
51. 클라우드 서비스에 기본적으로 활용
Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved. 51
53. Example of Cloud
출처 : iSC 2012 Amazon HPC session
Copyrightⓒ Insilicogen,Inc. 2012. All rights reserved. 53
54. Running Galaxy Web server
1. 자신의 컴퓨터의 IP Address를 확인합니다.
ifconfig
2. Galaxy 폴더로 이동합니다.
cd galaxy-dist
3. Galaxy web server를 실행합니다.
sh run.sh
4. 자신의 호스트 OS (windows) 에서 웹브라우저에서 주소창에 다음을 입력합니다.
IP Address:8080 (예, 172.20.8.162:8080)
Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved. 54
66. Example 3.
Human RNA-seq
1. RNA-seq result: adrenal_1,2.fastq, brain_1,2.fastq
2. Reference: iGenome UCSC hg19, chr19 gene notation (GTF format)
Have to do in Unix and need
programming! (Python, Perl, …)
Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved. 66
67. On Galaxy
http://usegalaxy.org
Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved. 67
68. On Galaxy
RNA-seq 분석을 위해서는
프로그램 추가 설치해야 함
( http://wiki.galaxyproject.org/Admin/NGS%20Local%20Setup )
프로그램 사용되는 곳 설치방법
java FastQC Ubuntu apt-get install openjdk-7-jre
FastQC NGS QC tool-data/shared/jars/ 로 복사
Tophat RNA-seq mapping (다음페이지 참고)
Cufflinks RNA-seq assembly Ubuntu apt-get install cufflinks
Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved. 68
69. Tophat install in Ubuntu
$ cp samtools-0.1.18.tar.gz2 ~/work
$ bzip2 –d samtools-0.1.18.tar.gz2
$ tar xvf samtools-0.1.18.tar
$ cd samtools-0.1.18
$ make
$ cd ..
$ cp tophat-1.4.1.tar.gz ~/work
$ tar zxvf tophat-1.4.1.tar.gz
$ cd tophat-1.4.1
$ apt-get install libboost libbam libboost-thread-dev
$ cp ../samtools-0.1.18/libbam.a /usr/local/lib
$ sudo mkdir /usr/local/include/bam
$ cp ../samtools-0.1.18/*.h /usr/local/include/bam
$ configure
$ make
$ make install
Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved. 69
70. On Galaxy
Get data Upload File : fastq, chr19.fa, gtf 파일 업로드
NGS: QC and minipulation : fastqsanger 포맷으로 변경
FASTQ Groomer
NGS: QC and minipulation
: fastq quality 통계정보 보기
FastQC:Read QC
NGS: RNA Analysis : RNA-seq fastq 데이터에서 splice junction 찾기
Tophat for Illumina 레퍼런스로 chr19.fa 이용
NGS: RNA Analysis
: Transcript assembly, FPKM 추정
Cufflinks
Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved. 70
71. On Galaxy
NGS: RNA Analysis Cuffmerge : brain, adrenal 데이터를 reference에 맞게 합치기
NGS: RNA Analysis Cuffdiff : 유의한 발현변화 찾기
Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved. 71
73. Galaxy tool 은
입력 출력
Tool
포맷 포맷
입력 데이터를 (포맷에 맞게) 작업하여 (포맷에 맞게) 출력 데이터를 만드는 역할
조합하면 Workflow가 된다
Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved. 73
74. Galaxy formats
Auto-detect 데이터가 어떤 형식인지 자동으로 인식
A binary sequence file in 'ab1' format with a '.ab1' file extension. You must manually select this 'File Format' when uploadi
Ab1
ng the file.
blastz pairwise alignment format. Each alignment block in an axt file contains three lines: a summary line and 2 sequence li
Axt nes. Blocks are separated from one another by blank lines. The summary line contains chromosomal position and size infor
mation about the alignment. It consists of 9 required fields.
Bam A binary file compressed in the BGZF format with a '.bam' file extension.
Bed Tab delimited format (tabular). Does not require header line
A sequence in FASTA format consists of a single-line description, followed by lines of sequence data. The first character of
Fasta
the description line is a greater-than (">") symbol in the first column. All lines should be shorter than 80 characters
FastqSolexa Illumina (Solexa) variant of the Fastq format, which stores sequences and quality scores in a single file
Gff GFF lines have nine required fields that must be tab-separated.
The GFF3 format addresses the most common extensions to GFF, while preserving backward compatibility with previous fo
Gff3
rmats.
Interval (Genomic
Tab delimited format (tabular)
Intervals)
Lav Lav is the primary output format for BLASTZ. The first line of a .lav file begins with #:lav..
TBA and multiz multiple alignment format. The first line of a .maf file begins with ##maf. This word is followed by white-sp
MAF
ace-separated "variable=value pairs". There should be no white space surrounding the "=".
A binary sequence file in 'scf' format with a '.scf' file extension. You must manually select this 'File Format' when uploading
Scf
the file.
Sff A binary file in 'Standard Flowgram Format' with a '.sff' file extension.
Tabular (tab delimi
Any data in tab delimited format (tabular)
ted)
The wiggle format is line-oriented. Wiggle data is preceded by a track definition line, which adds a number of options for
Wig
controlling the default display of this track.
Other text type Any text file
Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved. 74
75. Creating your own Galaxy
Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved. 75
76. Primer design tool
Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved. 76
77. Primer3
Primer3
• Primer design program
• http://primer3.sourceforge.net/releases.php
• Download from
http://sourceforge.net/projects/primer3/files/primer3/1.1.4/prim
er3-1.1.4.tar.gz
• make & copy to PATH
eprimer3
• Wrapper for Primer3, it’s used in EMBOSS package
• Easy command line interface
• http://emboss.sourceforge.net/apps/release/6.4/emboss/apps/
eprimer3.html
• apt-get install emboss
Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved. 77
78. erimer3
# EPRIMER3 RESULTS FOR GL020027.1
$ eprimer3 # Start Len Tm GC% Sequence
–sequence INPUT_FASTA_FILE 1 PRODUCT SIZE: 199
–outfile PRIMER_DESIGN_RESULT FORWARD PRIMER 571071 20 60.06 45.00 CTTGCCAATAGCGAATGGAT
-osize OSIZE REVERSE PRIMER 571250 20 59.99 55.00 GACGGCGTAGATCTTCAAGC
-gcclamp GCCLAMP 2 PRODUCT SIZE: 199
… FORWARD PRIMER 55074 20 60.05 55.00 TAACACCACTGCTCCTGCTG
REVERSE PRIMER 55253 20 59.97 50.00 CATTGCATGGTCAGAACCAC
3 PRODUCT SIZE: 200
FORWARD PRIMER 71990 20 60.03 45.00 GGGGTTGATTTTCATTGTGG
이 결과 형식을 수정하여
REVERSE PRIMER 72170 20 59.88 45.00 GTTTGCACCAACCTGGTTTT
다른 Galaxy tool의 입력
으로 쓰고 싶다. 4 PRODUCT SIZE: 200
FORWARD PRIMER 427182 20 59.83 50.00 CTGATGTGCTCTGTGGGAAA
REVERSE PRIMER 427362 20 60.01 55.00 CCGTGTATGTAGCCCGAGTT
5 PRODUCT SIZE: 197
직접 Primer design FORWARD PRIMER 427185 20 59.97 50.00 ATGTGCTCTGTGGGAAAACC
Galaxy tool 만들기 REVERSE PRIMER 427362 20 60.01 55.00 CCGTGTATGTAGCCCGAGTT
Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved. 78
79. erimer3.xml
Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved. 79
80. erimer3.py
Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved. 80
82. EMBOSS eprimer3 tool added
Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved. 82
83. 실습
Install Primer3 : make 명령으로 컴파일 후, primer3_core PATH 설정
Install EMBOSS : sudo apt-get install emboss
Install Biopython : sudo apt-get install python-biopython
Copy eprimer3.py, eprimer3.xml to
galaxy-dist/tools/mytools/ : mytools 디렉토리는 직접 생성
Edit tool_conf.xml : mytools/eprimer3.xml 설정
Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved. 83
85. Grid vs Cluster
대용량 데이터에 대한 연산을 작은 소규모 연산들로 나누
공통점 어 작은 여러대의 컴퓨터로 분산시켜 수행
WAN상에서 서로 다른 기종의 머신들을 연결
차이점 다양한 플랫폼을 서로 연결함
연결대수에 제한이 없음
Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved. 85
86. Grid
Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved. 86
87. Globus Toolkit
대표적인 계산 그리드 미들웨어
Open source toolkit for building computing grids
developed and provided by Globus Alliance
Standards implementation
• Open Grid Service Architecture (OGSA)
• Open Grid Service Infrastructure (OGSI)
• Web Services Resource Framework (WSRF)
• Job Submission Description Language
(JSDL)
• Distributed Resource Management
Application API (DRMAA)
• SOAP
• WSDL
• Grid Security Infrastructure
Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved. 87
88. High level Open Grid Forum API specification for submission and control of jobs
to a Distributed Resource Management (DRM, Job scheduler) system, such as a
Cluster or Grid computing infrastructure
Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved. 88
89. PBS (Portable Batch System)
Computer software that performs job scheduling in Unix cluster environment
A component of the Globus Toolkit
Originally developed by NASA
Following versions
• OpenPBS
• TORQUE – a fork of OpenPBS
• PBS Professional (PBS pro) - commercial
Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved. 89
90. TORQUE
Distributed resource manager providing control
over batch jobs and distributed compute node
It stands for Terascale Open Source Resource
and QUEue Manager
Slave 노드의 CPU개수, core 개수, RAM사이즈, 임
시저장소 등의 설정정보를 가지고 스케줄러에 의해
요청이 왔을 때 클러스터 리소스를 분배함
Slave 1
Master
Slave 2
NFS
Slave 3
> qsub a.sh
a.sh 명령을 스케줄러에 따라 slave로 넘김
Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved. 90
93. Cloud computing
Delivery of computing and
storage capacity as a service to
a heterogeneous community of
end-recipients.
Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved. 93
95. VPS (Virtual Private Server)
Internet hosting services to refer a virtual machine in a cloud
Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved. 95
96. Amazon EC2 (Amazon Elastic Compute Cloud)
Virtualization + Grid(Cluster)
computing in a Cloud
96
101. Galaxy on Cloud
Using Amazon EC2 + S3
Select AMIs in Community AMIs
101
Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.
102. Galaxy on Cloud
102
Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.
103. Galaxy on Cloud
103
Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.
104. Galaxy on Cloud
104
Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.
105. Galaxy on Cloud
105
Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.
106. Galaxy on Cloud
106
Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.
107. Galaxy on Insilicogen
Galaxy localization on cluster
Tool development
Workflow development
107
Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.