1. Parsing a File with Perl
Regexp, substr and oneliners
Paolo Marcatili - Programmazione 09-10
2. Agenda
Today we will see how to
> Extract information from a file
> Substr and regexp
We already know how to use:
> Scalar variables $ and arrays @
> If, for, while, open, print, close…
2
Paolo Marcatili - Programmazione 09-10
4. Protein Structures
1st task:
> Open a PDB file
> Operate a symmetry transformation
> Extract data from file header
4
Paolo Marcatili - Programmazione 09-10
5. Zinc Finger
2nd task:
> Open a fasta file
> Find all occurencies of Zinc Fingers
(homework?)
5
Paolo Marcatili - Programmazione 09-10
7. Rationale
Biological data -> human readable files
If you can read it, Perl can read it as well
*BUT*
It can be tricky
7
Paolo Marcatili - Programmazione 09-10
8. Parsing flow-chart
Open the file
For each line{
look for “grammar”
and store data
}
Close file
Use data
8
Paolo Marcatili - Programmazione 09-10
14. PDB
ATOM 4 O ASP L 1 43.716 -12.235 68.502 1.00 70.05 O
ATOM 5 N ILE L 2 44.679 -10.569 69.673 1.00 48.19 N
…
COLUMNS DATA TYPE FIELD DEFINITION
-------------------------------------------------------------------------------------
1 - 6 Record name "ATOM "
7 - 11 Integer serial Atom serial number.
13 - 16 Atom name Atom name.
17 Character altLoc Alternate location indicator.
18 - 20 Residue name resName Residue name.
22 Character chainID Chain identifier.
23 - 26 Integer resSeq Residue sequence number.
27 AChar iCode Code for insertion of residues.
31 - 38 Real(8.3) x Orthogonal coordinates for X in Angstroms
39 - 46 Real(8.3) y Orthogonal coordinates for Y in Angstroms
47 - 54 Real(8.3) z Orthogonal coordinates for Z in Angstroms
55 - 80 Bla Bla Bla (not useful for our purposes)
14
16. Rotation
#! /usr/bin/perl -w
use strict;
open(IG, "<IG.pdb") || die "cannot open IG.pdb:$!";
open(IGR, ">IG_rotated.pdb") || die "cannot open IG_rotated.pdb:$!";
while (my $line=<IG>){
if (substr($line,0,4) eq "ATOM"){
my $X= substr($line,30,8);
my $Y= substr($line,38,8);
my $Z= substr($line,46,8);
print IGR substr($line,0,30).$Z.$X.$Y.substr($line,54);
}
else{
print IGR $line;
}
}
close IG;
close IGR;
16
18. Regular Expressions
PDB have a “fixed” structures.
What if we want to do something like
“check for a valid email address”…
18
19. Regular Expressions
PDB have a “fixed” structures.
What if we want to do something like
“check for a valid email address”…
1. There must be some letters or numbers
2. There must be a @
3. Other letters
4. .something
paolo.marcatili@gmail.com is good
paolo.marcatili@.com is not good
19
20. Regular Expressions
$line =~ m/^[a-z |1-9| .| _]+@[^.]+.[a-z]{2,}$/
WHAAAT???
This means:
Check if $line has some chars at the beginning, then @, then
some non-points, then a point, then at least two letters
….
Ok, let’s start from something simpler :)
20
21. Regular Expressions
$line =~ m/^[a-z |1-9| .| _]+@[^.]+.[a-z]{2,}$/
WHAAAT???
This means:
Check if $line has some chars at the beginning, then @, then
some non-points, then a point, then at least two letters
….
Ok, let’s start from something simpler :)
21
22. Regular Expressions
$line =~ m/^ATOM/
Line starts with ATOM
$line =~ m/^ATOMs+/
Line starts with ATOM, then there are some spaces
$line =~ m/^ATOMs+[-|0-9]+/
Line starts with ATOM, then there are some spaces, then there are some
digits or -
$line =~ m/^ATOMs+-?[0-9]+/
Line starts with ATOM, then there are some spaces, then there can be a
minus, then some digits
22
25. PDB Header
We want to find %id for L and H chain
$pidL= $1 if ($line=~m/REMARK SUMMARY-ID_GLOB_L:([.|0-9])/);
$pidH= $1 if ($line=~m/REMARK SUMMARY-ID_GLOB_H:([.|0-9])/);
ONELINER!!
cat IG.pdb | perl -ne ‘print “$1n”
if ($_=~m/^REMARK SUMMARY-ID_GLOB_([LH]:[.|0-9]+)/);’
25
27. Zinc Finger
A zinc finger is a large superfamily of protein
domains that can bind to DNA.
A zinc finger consists of two antiparallel β
strands, and an α helix.
The zinc ion is crucial for the stability of this
domain type - in the absence of the metal
ion the domain unfolds as it is too small to
have a hydrophobic core.
The consensus sequence of a single finger is:
C-X{2-4}-C-X{3}-[LIVMFYWC]-X{8}-H-X{3}-H
27
28. Homework
Find all occurencies of ZF motif in
zincfinger.fasta
Put them in file ZF_motif.fasta
e.g.
weofjpihouwefghoicacvgnfglapglhtylhyuiui
28
29. Homework
Find all occurencies of ZF motif in
zincfinger.fasta
Put them in file ZF_motif.fasta
e.g.
Weofjpihouwefghoicacvgnfglapglifhtylhyuiui
cacvgnfglapglifhtylh
29