SlideShare ist ein Scribd-Unternehmen logo
1 von 46
Downloaden Sie, um offline zu lesen
Managing data

Joachim Jacob
8 and 15 November 2013
Bioinformatics data
Historically, bioinformatics has always used text files
to store data.
PDB file excerpt

Genbank record

HMM profile
NGS data
The NGS machines spit a lot of data, stored in plain
text files. These files are multiple gigabytes in size.
Tips for managing NGS data
1. When you move the data, do it in its smallest form.
→ Compress the data.
2. When you unpack the data, leave it where it is.
→ Symbolic links point to the data in different
folders.

3. Provide enough storage for your data.
→ choose your file system type wisely
Compression: tools in Linux

And some more exist...

http://www.linuxlinks.com/article/20110220091109939/CompressionTools.html
Tips
Widely used compression tools:
● GNU zip (gzip)
● Block Sorting compression (bzip2)
Typically, compression tools work on one file.
How to compress directories and their contents?
Tar without compression
Tar (Tape Archive) is a tool for bundling a set of files
or directories into a single archive. The resulting
file is called a tar ball.
Syntax to create a tarball:
$ tar -cf archive.tar file1 file2
Syntax to extract:
$ tar -xvf /path/to/archive.tar
Compression: a typical case
Archiving and compression mostly occur together.
The most used formats are tar.gz or tar.bz. These
files are the result of two processes.

Archiving (tar)
Compressing
(gzip or bzip2)
Compression: on your desktop
Compression: on your desktop
Compression: on the command line
Tar is the tool for creating .tar archives, but it can
compress in one go, with the z or j option.
Creating a compressed tar archive:
$ tar cvfz mytararchive.tar.gz
$ tar cvfj mytararchive.tar.bz
create

Compression technique

Decompressing a compressed tar archive
$ tar xvfz mytararchive.tar.gz
$ tar xvfj mytararchive.tar.bz
extract

files

verbose

docs/
docs/
De-/compression
To compress one or more files:
$ gzip [options] file
$ bzip2 [options] file
To decompress one or more files:
$ gunzip [options] file(s)
$ bunzip2 [options] file(s)
Tips
Many compression tools on the command line allow
to read compressed files (instead of first unpacking
then reading).

$ zcat file(s)
$ bzcat file(s)
Compression is always a balance between time and
compression ratio. Gzip is faster, bzip2 compresses
harder.
If compression is important to you: benchmark!
Exercise
→ a little compression exercise.
Symlinks
Pay attention. Something very convenient!
A symbolic link (or symlink) is a file which points to the
location of the linked-to file. You can do anything with the
symlink that you can do on the original file. As you move
the original file from its location, the symlink is 'dead'.
Downloads/

~

Annotation/
Rice/
Projects/
Butterfly/

Sequences/ alignment.sam
Symlinks
To create a symlink, move to the folder in where the symlink
must be created, and execute ln.
~/Projects $ cd Butterfly
~/Butterfly $ ln -s ../Rice/Sequences/alignment.sam
Link_to_alignment.sam

Downloads/

~

Annotation/
Rice/
Projects/
Butterfly/

Sequences/ alignment.sam
Symlinks
The symlink is created. You can check with ls.
To delete a symlink, use unlink.
~/Projects $ cd Butterfly
~/Butterfly $ ln -s ../Rice/Sequences/alignment.sam
Link_to_alignment.sam
~/Butterfly $ ls -lh Link_to_alignment.sam
lrwxrwxrwx 1 joachim joachim 44 Oct 22 14:47
Link_to_alignment.sam -> ../Sequences/alignment.sam

Downloads/

~

Annotation/
Rice/
Projects/

Sequences/ alignment.sam

Butterfly/Link_to_alignment.sam
Exercise
→ a little symlink exercise
Disks and storage
If you dive into bioinformatics, you will have to
manage disks and storage.
Two types of disks
- solid state disks
Low capacity, high speed, random writes
- spinning hard disks
High capacity, 'normal' speed,
sequential writes.
http://en.wikipedia.org/wiki/Solid-state_drive
http://en.wikipedia.org/wiki/Hard_disk
A disk is a device
Via the terminal, show the disks using
$ sudo fdisk -l
[sudo] password for joachim:
Disk /dev/sda: 13.4 GB, 13408141312
bytes
...
Disk /dev/sdb: 3997 MB, 3997171712 bytes
...
A disk is divided into partitions
A disk can be divided in parts, called partitions.
An internal disk which runs an operating system is
usually divided in partitions, one for each functions.
An external disk is usually not divided in partitions.
Check out the disk utility tool
The system disk

Name of the disk
The system disk

Name currently highlighted partition
The system disk

Place in the directory structure
where the partition can be accessed
An example of an USB disk
-

Place in the directory structure
where the partition can be accessed
An example of an USB disk
The USB disk is 'mounted' automatically on the
directory tree under /media.
An example of an USB disk
-

This is the type of file system
on the partition.
The partition is said to be formatted
in FAT32 (in this case).
File system formats
By default, many USB flash disks are formatted in
FAT32.
Other types are NTFS, ext4, ZFS.
FAT32 – max 4GB files
NTFS – maximum portability (also for use under windows)
Ext4 – default file system in Linux,

http://en.wikipedia.org/wiki/File_system#File_systems_and_operating_systems
An example of an USB disk
First unmount the device.
Next, choose format the device.

1

2
Format disks with disk utility
Choose the type of file
system you want to be on
that device.
Format disks with disk utility
Format disks with disk utility
You don't want to know all the commands that work
behind the gnome-disk-utility for you.
But if you do:
- mount
- umount
- fdisk
- mkfs
You can read the man pages and search for guides on
the internet if you want to get to know these (out of
scope for this course).
Checking storage space
By default 'disk usage analyzer'.
Checking storage space
Bonus: K4DirStat. Not installed by default.
Checking storage space
Bonus: K4DirStat. Not installed by default.
K4Dirstat is a KDE package
Rehearsal: what is KDE?
Bonus: what happens when you install this
package on our system?
Space left on disks with df
To check the storage that is used on the
different disks.
~/ $ df -h

Filesystem
/dev/sda1
udev
tmpfs
none
none
/dev/sdb1
~/ $ df -h .

Size
12G
490M
200M
5.0M
498M
3.8G

Used Avail Use% Mounted on
5.3G 5.7G 49% /
4.0K 490M
1% /dev
920K 199M
1% /run
0 5.0M
0% /run/lock
76K 498M
1% /run/shm
20M 3.7G
1% /media/test
The size of directories
To check the size of files or directories.
~/ $ du -sh *
520K bin
281M Compression_exercise
4.0K Desktop
4.0K Documents
5.0M Downloads
4.0K Music
4.0K Pictures
4.0K Public
373M Rice Example
4.0K Templates
4.0K test
17M
test.img
114M ugene-1.12.2
4.0K Videos
Wildcards on the command line
Wildcards are used to describe the names of files/dirs.
[]
On that position, the character may be one of the
characters between [ ],
e.g. saniti[sz]ation matches: sanitisation and sanitization
?
On that position, any character is allowed.
e.g. saniti?ation matches: sanitisation, sanitiration, ...
*
On that position, any length of string is allowed
e.g. s* matches: san, sdd, sanitisation, sam.alignment,...
Wildcards on the command line
Many tools that require an argument to point to
files or directories accept these wildcards.
~/ $ du -sh Do*
Wildcards on the command line
Many tools that require an argument to point to
files or directories accept these wildcards.
~/ $ du -sh Do*
4.0K Documents
20G
Downloads
Wildcards on the command line
Many tools that require an argument to point to
files or directories accept these wildcards.
~/ $ ls *.fastq
Wildcards on the command line
Many tools that require an argument to point to
files or directories accept these wildcards.
~/ $ ls *.fastq
ERR148552_1.fastq
testout.fastq
ERR148552_1_prinseq_good_zzwI.fastq

ERR148552_2.fastq
test.fastq
Keywords
Compression
Archive
Symbolic link
mounting
File system format
partition
Recursively
df
du
unlink
Write in your own words what the terms mean
Break

Weitere Àhnliche Inhalte

Was ist angesagt?

Linux presentation
Linux presentationLinux presentation
Linux presentation
Nikhil Jain
 

Was ist angesagt? (20)

BITS: Introduction to Linux - Text manipulation tools for bioinformatics
BITS: Introduction to Linux - Text manipulation tools for bioinformaticsBITS: Introduction to Linux - Text manipulation tools for bioinformatics
BITS: Introduction to Linux - Text manipulation tools for bioinformatics
 
Linux Introduction (Commands)
Linux Introduction (Commands)Linux Introduction (Commands)
Linux Introduction (Commands)
 
Part 1 of 'Introduction to Linux for bioinformatics': Introduction
Part 1 of 'Introduction to Linux for bioinformatics': IntroductionPart 1 of 'Introduction to Linux for bioinformatics': Introduction
Part 1 of 'Introduction to Linux for bioinformatics': Introduction
 
A Quick Introduction to Linux
A Quick Introduction to LinuxA Quick Introduction to Linux
A Quick Introduction to Linux
 
Terminal Commands (Linux - ubuntu) (part-1)
Terminal Commands  (Linux - ubuntu) (part-1)Terminal Commands  (Linux - ubuntu) (part-1)
Terminal Commands (Linux - ubuntu) (part-1)
 
Basic linux commands
Basic linux commandsBasic linux commands
Basic linux commands
 
Linux
Linux Linux
Linux
 
Linux presentation
Linux presentationLinux presentation
Linux presentation
 
Linux class 8 tar
Linux class 8   tar  Linux class 8   tar
Linux class 8 tar
 
Linux Getting Started
Linux Getting StartedLinux Getting Started
Linux Getting Started
 
Linux Command Suumary
Linux Command SuumaryLinux Command Suumary
Linux Command Suumary
 
Linux command for beginners
Linux command for beginnersLinux command for beginners
Linux command for beginners
 
Linux Commands
Linux CommandsLinux Commands
Linux Commands
 
Basic 50 linus command
Basic 50 linus commandBasic 50 linus command
Basic 50 linus command
 
Linux basics part 1
Linux basics part 1Linux basics part 1
Linux basics part 1
 
Linux Shell Basics
Linux Shell BasicsLinux Shell Basics
Linux Shell Basics
 
50 most frequently used unix linux commands (with examples)
50 most frequently used unix   linux commands (with examples)50 most frequently used unix   linux commands (with examples)
50 most frequently used unix linux commands (with examples)
 
Linux commands
Linux commands Linux commands
Linux commands
 
Linux Administration
Linux AdministrationLinux Administration
Linux Administration
 
Linux Directory Structure
Linux Directory StructureLinux Directory Structure
Linux Directory Structure
 

Andere mochten auch

Lokala banksystem utan vinstkrav - för tillvÀxt och hÄllbar utveckling
Lokala banksystem utan vinstkrav - för tillvÀxt och hÄllbar utvecklingLokala banksystem utan vinstkrav - för tillvÀxt och hÄllbar utveckling
Lokala banksystem utan vinstkrav - för tillvÀxt och hÄllbar utveckling
Jonas Lagander
 
Besök kimstad rapport förstudie
Besök kimstad   rapport förstudieBesök kimstad   rapport förstudie
Besök kimstad rapport förstudie
Jonas Lagander
 
Vnti11 basics course
Vnti11 basics courseVnti11 basics course
Vnti11 basics course
BITS
 

Andere mochten auch (20)

RNA-seq: Mapping and quality control - part 3
RNA-seq: Mapping and quality control - part 3RNA-seq: Mapping and quality control - part 3
RNA-seq: Mapping and quality control - part 3
 
BITS - Comparative genomics on the genome level
BITS - Comparative genomics on the genome levelBITS - Comparative genomics on the genome level
BITS - Comparative genomics on the genome level
 
RNA-seq: analysis of raw data and preprocessing - part 2
RNA-seq: analysis of raw data and preprocessing - part 2RNA-seq: analysis of raw data and preprocessing - part 2
RNA-seq: analysis of raw data and preprocessing - part 2
 
BITS - Introduction to comparative genomics
BITS - Introduction to comparative genomicsBITS - Introduction to comparative genomics
BITS - Introduction to comparative genomics
 
RNA-seq for DE analysis: the biology behind observed changes - part 6
RNA-seq for DE analysis: the biology behind observed changes - part 6RNA-seq for DE analysis: the biology behind observed changes - part 6
RNA-seq for DE analysis: the biology behind observed changes - part 6
 
RNA-seq for DE analysis: extracting counts and QC - part 4
RNA-seq for DE analysis: extracting counts and QC - part 4RNA-seq for DE analysis: extracting counts and QC - part 4
RNA-seq for DE analysis: extracting counts and QC - part 4
 
BITS - Protein inference from mass spectrometry data
BITS - Protein inference from mass spectrometry dataBITS - Protein inference from mass spectrometry data
BITS - Protein inference from mass spectrometry data
 
BITS - Genevestigator to easily access transcriptomics data
BITS - Genevestigator to easily access transcriptomics dataBITS - Genevestigator to easily access transcriptomics data
BITS - Genevestigator to easily access transcriptomics data
 
BITS - Comparative genomics: the Contra tool
BITS - Comparative genomics: the Contra toolBITS - Comparative genomics: the Contra tool
BITS - Comparative genomics: the Contra tool
 
RNA-seq quality control and pre-processing
RNA-seq quality control and pre-processingRNA-seq quality control and pre-processing
RNA-seq quality control and pre-processing
 
RNA-seq for DE analysis: detecting differential expression - part 5
RNA-seq for DE analysis: detecting differential expression - part 5RNA-seq for DE analysis: detecting differential expression - part 5
RNA-seq for DE analysis: detecting differential expression - part 5
 
RNA-seq: general concept, goal and experimental design - part 1
RNA-seq: general concept, goal and experimental design - part 1RNA-seq: general concept, goal and experimental design - part 1
RNA-seq: general concept, goal and experimental design - part 1
 
Lokala banksystem utan vinstkrav - för tillvÀxt och hÄllbar utveckling
Lokala banksystem utan vinstkrav - för tillvÀxt och hÄllbar utvecklingLokala banksystem utan vinstkrav - för tillvÀxt och hÄllbar utveckling
Lokala banksystem utan vinstkrav - för tillvÀxt och hÄllbar utveckling
 
Genevestigator
GenevestigatorGenevestigator
Genevestigator
 
BITS: Introduction to Linux - Software installation the graphical and the co...
BITS: Introduction to Linux -  Software installation the graphical and the co...BITS: Introduction to Linux -  Software installation the graphical and the co...
BITS: Introduction to Linux - Software installation the graphical and the co...
 
Projekt sociala ekonomin i motala - slutrapport 2015
Projekt sociala ekonomin i motala - slutrapport 2015Projekt sociala ekonomin i motala - slutrapport 2015
Projekt sociala ekonomin i motala - slutrapport 2015
 
Besök kimstad rapport förstudie
Besök kimstad   rapport förstudieBesök kimstad   rapport förstudie
Besök kimstad rapport förstudie
 
BITS training - UCSC Genome Browser - Part 2
BITS training - UCSC Genome Browser - Part 2BITS training - UCSC Genome Browser - Part 2
BITS training - UCSC Genome Browser - Part 2
 
Vnti11 basics course
Vnti11 basics courseVnti11 basics course
Vnti11 basics course
 
BITS: UCSC genome browser - Part 1
BITS: UCSC genome browser - Part 1BITS: UCSC genome browser - Part 1
BITS: UCSC genome browser - Part 1
 

Ähnlich wie Managing your data - Introduction to Linux for bioinformatics

beginner.en.print
beginner.en.printbeginner.en.print
beginner.en.print
aniruddh Tyagi
 
beginner.en.print
beginner.en.printbeginner.en.print
beginner.en.print
aniruddh Tyagi
 
beginner.en.print
beginner.en.printbeginner.en.print
beginner.en.print
Aniruddh Tyagi
 
Linux Presentation
Linux PresentationLinux Presentation
Linux Presentation
Muhammad Qazi
 
File system discovery
File system discovery File system discovery
File system discovery
MOHAMED Elshawaf
 
Management file and directory in linux
Management file and directory in linuxManagement file and directory in linux
Management file and directory in linux
Zkre Saleh
 
101 4.1 create partitions and filesystems
101 4.1 create partitions and filesystems101 4.1 create partitions and filesystems
101 4.1 create partitions and filesystems
AcĂĄcio Oliveira
 
Compression
CompressionCompression
Compression
aswathyu
 
Ch12 system administration
Ch12 system administration Ch12 system administration
Ch12 system administration
Raja Waseem Akhtar
 

Ähnlich wie Managing your data - Introduction to Linux for bioinformatics (20)

C) ICT Application
C) ICT ApplicationC) ICT Application
C) ICT Application
 
Unix Administration 4
Unix Administration 4Unix Administration 4
Unix Administration 4
 
beginner.en.print
beginner.en.printbeginner.en.print
beginner.en.print
 
beginner.en.print
beginner.en.printbeginner.en.print
beginner.en.print
 
beginner.en.print
beginner.en.printbeginner.en.print
beginner.en.print
 
Linux Presentation
Linux PresentationLinux Presentation
Linux Presentation
 
Module 3 Using Linux Softwares.
Module 3 Using Linux Softwares.Module 3 Using Linux Softwares.
Module 3 Using Linux Softwares.
 
Root file system for embedded systems
Root file system for embedded systemsRoot file system for embedded systems
Root file system for embedded systems
 
File system discovery
File system discovery File system discovery
File system discovery
 
Management file and directory in linux
Management file and directory in linuxManagement file and directory in linux
Management file and directory in linux
 
Command Line Tools
Command Line ToolsCommand Line Tools
Command Line Tools
 
Unix Administration
Unix AdministrationUnix Administration
Unix Administration
 
File Management (1).pptx
File Management (1).pptxFile Management (1).pptx
File Management (1).pptx
 
Edubooktraining
EdubooktrainingEdubooktraining
Edubooktraining
 
101 4.1 create partitions and filesystems
101 4.1 create partitions and filesystems101 4.1 create partitions and filesystems
101 4.1 create partitions and filesystems
 
Disk and File System Management in Linux
Disk and File System Management in LinuxDisk and File System Management in Linux
Disk and File System Management in Linux
 
Linux
LinuxLinux
Linux
 
Compression
CompressionCompression
Compression
 
Ch12 system administration
Ch12 system administration Ch12 system administration
Ch12 system administration
 
Data hiding and finding on Linux
Data hiding and finding on LinuxData hiding and finding on Linux
Data hiding and finding on Linux
 

Mehr von BITS

Bits protein structure
Bits protein structureBits protein structure
Bits protein structure
BITS
 

Mehr von BITS (10)

BITS - Comparative genomics: gene family analysis
BITS - Comparative genomics: gene family analysisBITS - Comparative genomics: gene family analysis
BITS - Comparative genomics: gene family analysis
 
BITS - Overview of sequence databases for mass spectrometry data analysis
BITS - Overview of sequence databases for mass spectrometry data analysisBITS - Overview of sequence databases for mass spectrometry data analysis
BITS - Overview of sequence databases for mass spectrometry data analysis
 
BITS - Search engines for mass spec data
BITS - Search engines for mass spec dataBITS - Search engines for mass spec data
BITS - Search engines for mass spec data
 
BITS - Introduction to proteomics
BITS - Introduction to proteomicsBITS - Introduction to proteomics
BITS - Introduction to proteomics
 
BITS - Introduction to Mass Spec data generation
BITS - Introduction to Mass Spec data generationBITS - Introduction to Mass Spec data generation
BITS - Introduction to Mass Spec data generation
 
Marcs (bio)perl course
Marcs (bio)perl courseMarcs (bio)perl course
Marcs (bio)perl course
 
Basics statistics
Basics statistics Basics statistics
Basics statistics
 
Cytoscape: Integrating biological networks
Cytoscape: Integrating biological networksCytoscape: Integrating biological networks
Cytoscape: Integrating biological networks
 
Cytoscape: Gene coexppression and PPI networks
Cytoscape: Gene coexppression and PPI networksCytoscape: Gene coexppression and PPI networks
Cytoscape: Gene coexppression and PPI networks
 
Bits protein structure
Bits protein structureBits protein structure
Bits protein structure
 

KĂŒrzlich hochgeladen

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

KĂŒrzlich hochgeladen (20)

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 

Managing your data - Introduction to Linux for bioinformatics

  • 1. Managing data Joachim Jacob 8 and 15 November 2013
  • 2. Bioinformatics data Historically, bioinformatics has always used text files to store data. PDB file excerpt Genbank record HMM profile
  • 3. NGS data The NGS machines spit a lot of data, stored in plain text files. These files are multiple gigabytes in size.
  • 4. Tips for managing NGS data 1. When you move the data, do it in its smallest form. → Compress the data. 2. When you unpack the data, leave it where it is. → Symbolic links point to the data in different folders. 3. Provide enough storage for your data. → choose your file system type wisely
  • 5. Compression: tools in Linux And some more exist... http://www.linuxlinks.com/article/20110220091109939/CompressionTools.html
  • 6. Tips Widely used compression tools: ● GNU zip (gzip) ● Block Sorting compression (bzip2) Typically, compression tools work on one file. How to compress directories and their contents?
  • 7. Tar without compression Tar (Tape Archive) is a tool for bundling a set of files or directories into a single archive. The resulting file is called a tar ball. Syntax to create a tarball: $ tar -cf archive.tar file1 file2 Syntax to extract: $ tar -xvf /path/to/archive.tar
  • 8. Compression: a typical case Archiving and compression mostly occur together. The most used formats are tar.gz or tar.bz. These files are the result of two processes. Archiving (tar) Compressing (gzip or bzip2)
  • 11. Compression: on the command line Tar is the tool for creating .tar archives, but it can compress in one go, with the z or j option. Creating a compressed tar archive: $ tar cvfz mytararchive.tar.gz $ tar cvfj mytararchive.tar.bz create Compression technique Decompressing a compressed tar archive $ tar xvfz mytararchive.tar.gz $ tar xvfj mytararchive.tar.bz extract files verbose docs/ docs/
  • 12. De-/compression To compress one or more files: $ gzip [options] file $ bzip2 [options] file To decompress one or more files: $ gunzip [options] file(s) $ bunzip2 [options] file(s)
  • 13. Tips Many compression tools on the command line allow to read compressed files (instead of first unpacking then reading). $ zcat file(s) $ bzcat file(s) Compression is always a balance between time and compression ratio. Gzip is faster, bzip2 compresses harder. If compression is important to you: benchmark!
  • 14. Exercise → a little compression exercise.
  • 15. Symlinks Pay attention. Something very convenient! A symbolic link (or symlink) is a file which points to the location of the linked-to file. You can do anything with the symlink that you can do on the original file. As you move the original file from its location, the symlink is 'dead'. Downloads/ ~ Annotation/ Rice/ Projects/ Butterfly/ Sequences/ alignment.sam
  • 16. Symlinks To create a symlink, move to the folder in where the symlink must be created, and execute ln. ~/Projects $ cd Butterfly ~/Butterfly $ ln -s ../Rice/Sequences/alignment.sam Link_to_alignment.sam Downloads/ ~ Annotation/ Rice/ Projects/ Butterfly/ Sequences/ alignment.sam
  • 17. Symlinks The symlink is created. You can check with ls. To delete a symlink, use unlink. ~/Projects $ cd Butterfly ~/Butterfly $ ln -s ../Rice/Sequences/alignment.sam Link_to_alignment.sam ~/Butterfly $ ls -lh Link_to_alignment.sam lrwxrwxrwx 1 joachim joachim 44 Oct 22 14:47 Link_to_alignment.sam -> ../Sequences/alignment.sam Downloads/ ~ Annotation/ Rice/ Projects/ Sequences/ alignment.sam Butterfly/Link_to_alignment.sam
  • 19. Disks and storage If you dive into bioinformatics, you will have to manage disks and storage. Two types of disks - solid state disks Low capacity, high speed, random writes - spinning hard disks High capacity, 'normal' speed, sequential writes. http://en.wikipedia.org/wiki/Solid-state_drive http://en.wikipedia.org/wiki/Hard_disk
  • 20. A disk is a device Via the terminal, show the disks using $ sudo fdisk -l [sudo] password for joachim: Disk /dev/sda: 13.4 GB, 13408141312 bytes ... Disk /dev/sdb: 3997 MB, 3997171712 bytes ...
  • 21. A disk is divided into partitions A disk can be divided in parts, called partitions. An internal disk which runs an operating system is usually divided in partitions, one for each functions. An external disk is usually not divided in partitions.
  • 22. Check out the disk utility tool
  • 23. The system disk Name of the disk
  • 24. The system disk Name currently highlighted partition
  • 25. The system disk Place in the directory structure where the partition can be accessed
  • 26. An example of an USB disk - Place in the directory structure where the partition can be accessed
  • 27. An example of an USB disk The USB disk is 'mounted' automatically on the directory tree under /media.
  • 28. An example of an USB disk - This is the type of file system on the partition. The partition is said to be formatted in FAT32 (in this case).
  • 29. File system formats By default, many USB flash disks are formatted in FAT32. Other types are NTFS, ext4, ZFS. FAT32 – max 4GB files NTFS – maximum portability (also for use under windows) Ext4 – default file system in Linux, http://en.wikipedia.org/wiki/File_system#File_systems_and_operating_systems
  • 30. An example of an USB disk First unmount the device. Next, choose format the device. 1 2
  • 31. Format disks with disk utility Choose the type of file system you want to be on that device.
  • 32. Format disks with disk utility
  • 33. Format disks with disk utility You don't want to know all the commands that work behind the gnome-disk-utility for you. But if you do: - mount - umount - fdisk - mkfs You can read the man pages and search for guides on the internet if you want to get to know these (out of scope for this course).
  • 34. Checking storage space By default 'disk usage analyzer'.
  • 35. Checking storage space Bonus: K4DirStat. Not installed by default.
  • 36. Checking storage space Bonus: K4DirStat. Not installed by default.
  • 37. K4Dirstat is a KDE package Rehearsal: what is KDE? Bonus: what happens when you install this package on our system?
  • 38. Space left on disks with df To check the storage that is used on the different disks. ~/ $ df -h Filesystem /dev/sda1 udev tmpfs none none /dev/sdb1 ~/ $ df -h . Size 12G 490M 200M 5.0M 498M 3.8G Used Avail Use% Mounted on 5.3G 5.7G 49% / 4.0K 490M 1% /dev 920K 199M 1% /run 0 5.0M 0% /run/lock 76K 498M 1% /run/shm 20M 3.7G 1% /media/test
  • 39. The size of directories To check the size of files or directories. ~/ $ du -sh * 520K bin 281M Compression_exercise 4.0K Desktop 4.0K Documents 5.0M Downloads 4.0K Music 4.0K Pictures 4.0K Public 373M Rice Example 4.0K Templates 4.0K test 17M test.img 114M ugene-1.12.2 4.0K Videos
  • 40. Wildcards on the command line Wildcards are used to describe the names of files/dirs. [] On that position, the character may be one of the characters between [ ], e.g. saniti[sz]ation matches: sanitisation and sanitization ? On that position, any character is allowed. e.g. saniti?ation matches: sanitisation, sanitiration, ... * On that position, any length of string is allowed e.g. s* matches: san, sdd, sanitisation, sam.alignment,...
  • 41. Wildcards on the command line Many tools that require an argument to point to files or directories accept these wildcards. ~/ $ du -sh Do*
  • 42. Wildcards on the command line Many tools that require an argument to point to files or directories accept these wildcards. ~/ $ du -sh Do* 4.0K Documents 20G Downloads
  • 43. Wildcards on the command line Many tools that require an argument to point to files or directories accept these wildcards. ~/ $ ls *.fastq
  • 44. Wildcards on the command line Many tools that require an argument to point to files or directories accept these wildcards. ~/ $ ls *.fastq ERR148552_1.fastq testout.fastq ERR148552_1_prinseq_good_zzwI.fastq ERR148552_2.fastq test.fastq
  • 45. Keywords Compression Archive Symbolic link mounting File system format partition Recursively df du unlink Write in your own words what the terms mean
  • 46. Break