SlideShare ist ein Scribd-Unternehmen logo
1 von 29
Parsing a File with Perl

                Regexp, substr and oneliners




Bioinformatics master course, ‘11/’12   Paolo Marcatili
Agenda
 Today we will see how to
 • Extract information from a file
 • Substr and regexp

 We already know how to use:
 • Scalar variables $ and arrays @
 • If, for, while, open, print, close…

Bioinformatics master course, ‘11/’12
 2                                      Paolo Marcatili
Task Today




Bioinformatics master course, ‘11/’12   Paolo Marcatili
Protein Structures
 1st task:
 • Open a PDB file
 • Operate a symmetry transformation
 • Extract data from file header




Bioinformatics master course, ‘11/’12
 4                                      Paolo Marcatili
Zinc Finger
 2nd task:
 • Open a fasta file
 • Find all occurencies of Zinc Fingers

     (homework?)




Bioinformatics master course, ‘11/’12
 5                                      Paolo Marcatili
Parsing




Bioinformatics master course, ‘11/’12    Paolo Marcatili
Rationale
 Biological data -> human readable files

 If you can read it, Perl can read it as well
 *BUT*
 It can be tricky




Bioinformatics master course, ‘11/’12
 7                                      Paolo Marcatili
Parsing flow-chart
 Open the file
 For each line{
     look for “grammar”
     and store data
 }
 Close file
 Use data




Bioinformatics master course, ‘11/’12
 8                                      Paolo Marcatili
Substr




Bioinformatics master course, ‘11/’12            Paolo Marcatili
Substr
 substr($data, start, length)
 returns a substring from the expression supplied as first
    argument.




Bioinformatics master course, ‘11/’12
 10                                              Paolo Marcatili
Substr
 substr($data, start, length)
         ^         ^        ^
       your string      |       |
                 start from 0 |
             you can omit this
                 (you will extract up to the end of string)




Bioinformatics master course, ‘11/’12
 11                                              Paolo Marcatili
Substr
 substr($data, start, length)
 Examples:

 my $data=“il mattino ha l’oro in bocca”;
 print substr($data,0) . “n”; #prints all string
 print substr($data,3,5) . “n”; #prints matti
 print substr($data,25) . “n”; #prints bocca
 print substr($data,-5) . “n”; #prints bocca




Bioinformatics master course, ‘11/’12
 12                                              Paolo Marcatili
Pdb rotation




Bioinformatics master course, ‘11/’12   Paolo Marcatili
PDB
   ATOM     4   O   ASP L   1   43.716 -12.235   68.502   1.00 70.05        O
   ATOM     5   N   ILE L   2   44.679 -10.569   69.673   1.00 48.19        N
   …




   COLUMNS        DATA TYPE     FIELD        DEFINITION
   ------------------------------------------------------------------------------------
       -
    1 - 6         Record name   "ATOM "
    7 - 11        Integer       serial       Atom serial number.
   13 - 16        Atom          name         Atom name.
   17             Character     altLoc       Alternate location indicator.
   18 - 20        Residue name resName       Residue name.
   22             Character     chainID      Chain identifier.
   23 - 26        Integer       resSeq       Residue sequence number.
   27             AChar         iCode        Code for insertion of residues.
   31 - 38        Real(8.3)     x            Orthogonal coordinates for X in Angstroms
   39 - 46        Real(8.3)     y            Orthogonal coordinates for Y in Angstroms
   47 - 54        Real(8.3)     z            Orthogonal coordinates for Z in Angstroms
   55 - 80        Bla Bla Bla (not useful for our purposes)


Bioinformatics master course, ‘11/’12
 14                                                            Paolo Marcatili
simmetry
   X->Z
   Y->X
   Z->Y


                  Y




                                        X

Bioinformatics master course, ‘11/’12
 15                                         Paolo Marcatili
Rotation
   #! /usr/bin/perl -w

   use strict;
   open(IG, "<IG.pdb") || die "cannot open IG.pdb:$!";
      open(IGR, ">IG_rotated.pdb") || die "cannot open IG_rotated.pdb:$!";
      while (my $line=<IG>){
         if (substr($line,0,4) eq "ATOM"){
             my $X= substr($line,30,8);
             my $Y= substr($line,38,8);
             my $Z= substr($line,46,8);
             print IGR substr($line,0,30).$Z.$X.$Y.substr($line,54);
         }
   else{
             print IGR $line;
         }
   }
   close IG;
   close IGR;




Bioinformatics master course, ‘11/’12
 16                                                           Paolo Marcatili
RegExp




Bioinformatics master course, ‘11/’12   Paolo Marcatili
Regular Expressions
     PDB have a “fixed” structures.

 What if we want to do something like
 “check for a valid email address”…




Bioinformatics master course, ‘11/’12
 18                                     Paolo Marcatili
Regular Expressions
        PDB have a “fixed” structures.

 What if we want to do something like
 “check for a valid email address”…
 1. There must be some letters or numbers
 2. There must be a @
 3. Other letters
 4. .something
 paolo.marcatili@gmail.com is good

 paolo.marcatili@.com is not good


Bioinformatics master course, ‘11/’12
 19                                     Paolo Marcatili
Regular Expressions
 $line =~ m/^[a-z |1-9| .| _]+@[^.]+.[a-z]{2,}$/

 WHAAAT???

 This means:
 Check if $line has some chars at the beginning, then @, then
 some non-points, then a point, then at least two letters

 ….
 Ok, let’s start from something simpler :)




Bioinformatics master course, ‘11/’12
 20                                              Paolo Marcatili
Regular Expressions
 $line =~ m/^[a-z |1-9| .| _]+@[^.]+.[a-z]{2,}$/

 WHAAAT???

 This means:
 Check if $line has some chars at the beginning, then @, then
 some non-points, then a point, then at least two letters

 ….
 Ok, let’s start from something simpler :)




Bioinformatics master course, ‘11/’12
 21                                              Paolo Marcatili
Regular Expressions
 $line =~ m/^ATOM/
 Line starts with ATOM

 $line =~ m/^ATOMs+/
 Line starts with ATOM, then there are some spaces

 $line =~ m/^ATOMs+[-|0-9]+/
 Line starts with ATOM, then there are some spaces, then there are some digits
       or -
 $line =~ m/^ATOMs+-?[0-9]+/
 Line starts with ATOM, then there are some spaces, then there can be a
       minus, then some digits




Bioinformatics master course, ‘11/’12
 22                                             Paolo Marcatili
Regular Expressions




Bioinformatics master course, ‘11/’12
 23                                     Paolo Marcatili
PDB Header
    We want to find %id for L and H chain




Bioinformatics master course, ‘11/’12
 24                                         Paolo Marcatili
PDB Header
    We want to find %id for L and H chain


    $pidL= $1 if ($line=~m/REMARK SUMMARY-ID_GLOB_L:([.|0-9])/);
    $pidH= $1 if ($line=~m/REMARK SUMMARY-ID_GLOB_H:([.|0-9])/);

    ONELINER!!


    cat IG.pdb | perl -ne ‘print “$1n”
    if ($_=~m/^REMARK SUMMARY-ID_GLOB_([LH]:[.|0-9]+)/);’




Bioinformatics master course, ‘11/’12
 25                                             Paolo Marcatili
Zinc Finger




Bioinformatics master course, ‘11/’12   Paolo Marcatili
Zinc Finger
   A zinc finger is a large superfamily of protein
      domains that can bind to DNA.

   A zinc finger consists of two antiparallel β
      strands, and an α helix.
   The zinc ion is crucial for the stability of this
      domain type - in the absence of the metal
      ion the domain unfolds as it is too small to
      have a hydrophobic core.
   The consensus sequence of a single finger is:

   C-X{2-4}-C-X{3}-[LIVMFYWC]-X{8}-H-X{3}-H


Bioinformatics master course, ‘11/’12
 27                                            Paolo Marcatili
Homework
       Find all occurencies of ZF motif in zincfinger.fasta

   Put them in file ZF_motif.fasta

   e.g.
   weofjpihouwefghoicalcvgnfglapglifhtylhyuiui




Bioinformatics master course, ‘11/’12
 28                                     Paolo Marcatili
Homework
       Find all occurencies of ZF motif in zincfinger.fasta

   Put them in file ZF_motif.fasta

   e.g.
   weofjpihouwefghoicalcvgnfglapglifhtylhyuiui

   calcvgnfglapglifhtylh




Bioinformatics master course, ‘11/’12
 29                                     Paolo Marcatili

Weitere ähnliche Inhalte

Ähnlich wie Regexp master 2011

Perl6 a whistle stop tour
Perl6 a whistle stop tourPerl6 a whistle stop tour
Perl6 a whistle stop tourSimon Proctor
 
Perl6 a whistle stop tour
Perl6 a whistle stop tourPerl6 a whistle stop tour
Perl6 a whistle stop tourSimon Proctor
 
PERL for QA - Important Commands and applications
PERL for QA - Important Commands and applicationsPERL for QA - Important Commands and applications
PERL for QA - Important Commands and applicationsSunil Kumar Gunasekaran
 
Thoughts On Learning A New Programming Language
Thoughts On Learning A New Programming LanguageThoughts On Learning A New Programming Language
Thoughts On Learning A New Programming LanguagePatricia Aas
 
2010/7/31 LTの虎@LL Tiger
2010/7/31 LTの虎@LL Tiger2010/7/31 LTの虎@LL Tiger
2010/7/31 LTの虎@LL TigerAkihiro Okuno
 
Kamil witecki asynchronous, yet readable, code
Kamil witecki asynchronous, yet readable, codeKamil witecki asynchronous, yet readable, code
Kamil witecki asynchronous, yet readable, codeKamil Witecki
 
introtorandrstudio.ppt
introtorandrstudio.pptintrotorandrstudio.ppt
introtorandrstudio.pptMalkaParveen3
 
Trying to learn C# (NDC Oslo 2019)
Trying to learn C# (NDC Oslo 2019)Trying to learn C# (NDC Oslo 2019)
Trying to learn C# (NDC Oslo 2019)Patricia Aas
 
Ruby: OOP, metaprogramming, blocks, iterators, mix-ins, duck typing. Code style
Ruby: OOP, metaprogramming, blocks, iterators, mix-ins, duck typing. Code styleRuby: OOP, metaprogramming, blocks, iterators, mix-ins, duck typing. Code style
Ruby: OOP, metaprogramming, blocks, iterators, mix-ins, duck typing. Code styleAnton Shemerey
 
Kotlin: forse è la volta buona (Trento)
Kotlin: forse è la volta buona (Trento)Kotlin: forse è la volta buona (Trento)
Kotlin: forse è la volta buona (Trento)Davide Cerbo
 
Wheels we didn't re-invent: Perl's Utility Modules
Wheels we didn't re-invent: Perl's Utility ModulesWheels we didn't re-invent: Perl's Utility Modules
Wheels we didn't re-invent: Perl's Utility ModulesWorkhorse Computing
 
Privet Kotlin (Windy City DevFest)
Privet Kotlin (Windy City DevFest)Privet Kotlin (Windy City DevFest)
Privet Kotlin (Windy City DevFest)Cody Engel
 
(De)serial Killers - BSides Las Vegas & AppSec IL 2018
(De)serial Killers - BSides Las Vegas & AppSec IL 2018(De)serial Killers - BSides Las Vegas & AppSec IL 2018
(De)serial Killers - BSides Las Vegas & AppSec IL 2018Checkmarx
 
(De)serial Killers - BSides Las Vegas & AppSec IL 2018
(De)serial Killers - BSides Las Vegas & AppSec IL 2018(De)serial Killers - BSides Las Vegas & AppSec IL 2018
(De)serial Killers - BSides Las Vegas & AppSec IL 2018Dor Tumarkin
 
Perl6 for-beginners
Perl6 for-beginnersPerl6 for-beginners
Perl6 for-beginnersJens Rehsack
 

Ähnlich wie Regexp master 2011 (20)

Master datatypes 2011
Master datatypes 2011Master datatypes 2011
Master datatypes 2011
 
Master perl io_2011
Master perl io_2011Master perl io_2011
Master perl io_2011
 
Perl IO
Perl IOPerl IO
Perl IO
 
Perl6 a whistle stop tour
Perl6 a whistle stop tourPerl6 a whistle stop tour
Perl6 a whistle stop tour
 
Perl6 a whistle stop tour
Perl6 a whistle stop tourPerl6 a whistle stop tour
Perl6 a whistle stop tour
 
PERL for QA - Important Commands and applications
PERL for QA - Important Commands and applicationsPERL for QA - Important Commands and applications
PERL for QA - Important Commands and applications
 
Thoughts On Learning A New Programming Language
Thoughts On Learning A New Programming LanguageThoughts On Learning A New Programming Language
Thoughts On Learning A New Programming Language
 
2010/7/31 LTの虎@LL Tiger
2010/7/31 LTの虎@LL Tiger2010/7/31 LTの虎@LL Tiger
2010/7/31 LTの虎@LL Tiger
 
Kamil witecki asynchronous, yet readable, code
Kamil witecki asynchronous, yet readable, codeKamil witecki asynchronous, yet readable, code
Kamil witecki asynchronous, yet readable, code
 
introtorandrstudio.ppt
introtorandrstudio.pptintrotorandrstudio.ppt
introtorandrstudio.ppt
 
Trying to learn C# (NDC Oslo 2019)
Trying to learn C# (NDC Oslo 2019)Trying to learn C# (NDC Oslo 2019)
Trying to learn C# (NDC Oslo 2019)
 
Perl intro
Perl introPerl intro
Perl intro
 
Ruby: OOP, metaprogramming, blocks, iterators, mix-ins, duck typing. Code style
Ruby: OOP, metaprogramming, blocks, iterators, mix-ins, duck typing. Code styleRuby: OOP, metaprogramming, blocks, iterators, mix-ins, duck typing. Code style
Ruby: OOP, metaprogramming, blocks, iterators, mix-ins, duck typing. Code style
 
Kotlin: forse è la volta buona (Trento)
Kotlin: forse è la volta buona (Trento)Kotlin: forse è la volta buona (Trento)
Kotlin: forse è la volta buona (Trento)
 
Wheels we didn't re-invent: Perl's Utility Modules
Wheels we didn't re-invent: Perl's Utility ModulesWheels we didn't re-invent: Perl's Utility Modules
Wheels we didn't re-invent: Perl's Utility Modules
 
Privet Kotlin (Windy City DevFest)
Privet Kotlin (Windy City DevFest)Privet Kotlin (Windy City DevFest)
Privet Kotlin (Windy City DevFest)
 
(De)serial Killers - BSides Las Vegas & AppSec IL 2018
(De)serial Killers - BSides Las Vegas & AppSec IL 2018(De)serial Killers - BSides Las Vegas & AppSec IL 2018
(De)serial Killers - BSides Las Vegas & AppSec IL 2018
 
(De)serial Killers - BSides Las Vegas & AppSec IL 2018
(De)serial Killers - BSides Las Vegas & AppSec IL 2018(De)serial Killers - BSides Las Vegas & AppSec IL 2018
(De)serial Killers - BSides Las Vegas & AppSec IL 2018
 
Data Types Master
Data Types MasterData Types Master
Data Types Master
 
Perl6 for-beginners
Perl6 for-beginnersPerl6 for-beginners
Perl6 for-beginners
 

Kürzlich hochgeladen

Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbuapidays
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 

Kürzlich hochgeladen (20)

Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 

Regexp master 2011

  • 1. Parsing a File with Perl Regexp, substr and oneliners Bioinformatics master course, ‘11/’12 Paolo Marcatili
  • 2. Agenda Today we will see how to • Extract information from a file • Substr and regexp We already know how to use: • Scalar variables $ and arrays @ • If, for, while, open, print, close… Bioinformatics master course, ‘11/’12 2 Paolo Marcatili
  • 3. Task Today Bioinformatics master course, ‘11/’12 Paolo Marcatili
  • 4. Protein Structures 1st task: • Open a PDB file • Operate a symmetry transformation • Extract data from file header Bioinformatics master course, ‘11/’12 4 Paolo Marcatili
  • 5. Zinc Finger 2nd task: • Open a fasta file • Find all occurencies of Zinc Fingers (homework?) Bioinformatics master course, ‘11/’12 5 Paolo Marcatili
  • 6. Parsing Bioinformatics master course, ‘11/’12 Paolo Marcatili
  • 7. Rationale Biological data -> human readable files If you can read it, Perl can read it as well *BUT* It can be tricky Bioinformatics master course, ‘11/’12 7 Paolo Marcatili
  • 8. Parsing flow-chart Open the file For each line{ look for “grammar” and store data } Close file Use data Bioinformatics master course, ‘11/’12 8 Paolo Marcatili
  • 9. Substr Bioinformatics master course, ‘11/’12 Paolo Marcatili
  • 10. Substr substr($data, start, length) returns a substring from the expression supplied as first argument. Bioinformatics master course, ‘11/’12 10 Paolo Marcatili
  • 11. Substr substr($data, start, length) ^ ^ ^ your string | | start from 0 | you can omit this (you will extract up to the end of string) Bioinformatics master course, ‘11/’12 11 Paolo Marcatili
  • 12. Substr substr($data, start, length) Examples: my $data=“il mattino ha l’oro in bocca”; print substr($data,0) . “n”; #prints all string print substr($data,3,5) . “n”; #prints matti print substr($data,25) . “n”; #prints bocca print substr($data,-5) . “n”; #prints bocca Bioinformatics master course, ‘11/’12 12 Paolo Marcatili
  • 13. Pdb rotation Bioinformatics master course, ‘11/’12 Paolo Marcatili
  • 14. PDB ATOM 4 O ASP L 1 43.716 -12.235 68.502 1.00 70.05 O ATOM 5 N ILE L 2 44.679 -10.569 69.673 1.00 48.19 N … COLUMNS DATA TYPE FIELD DEFINITION ------------------------------------------------------------------------------------ - 1 - 6 Record name "ATOM " 7 - 11 Integer serial Atom serial number. 13 - 16 Atom name Atom name. 17 Character altLoc Alternate location indicator. 18 - 20 Residue name resName Residue name. 22 Character chainID Chain identifier. 23 - 26 Integer resSeq Residue sequence number. 27 AChar iCode Code for insertion of residues. 31 - 38 Real(8.3) x Orthogonal coordinates for X in Angstroms 39 - 46 Real(8.3) y Orthogonal coordinates for Y in Angstroms 47 - 54 Real(8.3) z Orthogonal coordinates for Z in Angstroms 55 - 80 Bla Bla Bla (not useful for our purposes) Bioinformatics master course, ‘11/’12 14 Paolo Marcatili
  • 15. simmetry X->Z Y->X Z->Y Y X Bioinformatics master course, ‘11/’12 15 Paolo Marcatili
  • 16. Rotation #! /usr/bin/perl -w use strict; open(IG, "<IG.pdb") || die "cannot open IG.pdb:$!"; open(IGR, ">IG_rotated.pdb") || die "cannot open IG_rotated.pdb:$!"; while (my $line=<IG>){ if (substr($line,0,4) eq "ATOM"){ my $X= substr($line,30,8); my $Y= substr($line,38,8); my $Z= substr($line,46,8); print IGR substr($line,0,30).$Z.$X.$Y.substr($line,54); } else{ print IGR $line; } } close IG; close IGR; Bioinformatics master course, ‘11/’12 16 Paolo Marcatili
  • 17. RegExp Bioinformatics master course, ‘11/’12 Paolo Marcatili
  • 18. Regular Expressions PDB have a “fixed” structures. What if we want to do something like “check for a valid email address”… Bioinformatics master course, ‘11/’12 18 Paolo Marcatili
  • 19. Regular Expressions PDB have a “fixed” structures. What if we want to do something like “check for a valid email address”… 1. There must be some letters or numbers 2. There must be a @ 3. Other letters 4. .something paolo.marcatili@gmail.com is good paolo.marcatili@.com is not good Bioinformatics master course, ‘11/’12 19 Paolo Marcatili
  • 20. Regular Expressions $line =~ m/^[a-z |1-9| .| _]+@[^.]+.[a-z]{2,}$/ WHAAAT??? This means: Check if $line has some chars at the beginning, then @, then some non-points, then a point, then at least two letters …. Ok, let’s start from something simpler :) Bioinformatics master course, ‘11/’12 20 Paolo Marcatili
  • 21. Regular Expressions $line =~ m/^[a-z |1-9| .| _]+@[^.]+.[a-z]{2,}$/ WHAAAT??? This means: Check if $line has some chars at the beginning, then @, then some non-points, then a point, then at least two letters …. Ok, let’s start from something simpler :) Bioinformatics master course, ‘11/’12 21 Paolo Marcatili
  • 22. Regular Expressions $line =~ m/^ATOM/ Line starts with ATOM $line =~ m/^ATOMs+/ Line starts with ATOM, then there are some spaces $line =~ m/^ATOMs+[-|0-9]+/ Line starts with ATOM, then there are some spaces, then there are some digits or - $line =~ m/^ATOMs+-?[0-9]+/ Line starts with ATOM, then there are some spaces, then there can be a minus, then some digits Bioinformatics master course, ‘11/’12 22 Paolo Marcatili
  • 23. Regular Expressions Bioinformatics master course, ‘11/’12 23 Paolo Marcatili
  • 24. PDB Header We want to find %id for L and H chain Bioinformatics master course, ‘11/’12 24 Paolo Marcatili
  • 25. PDB Header We want to find %id for L and H chain $pidL= $1 if ($line=~m/REMARK SUMMARY-ID_GLOB_L:([.|0-9])/); $pidH= $1 if ($line=~m/REMARK SUMMARY-ID_GLOB_H:([.|0-9])/); ONELINER!! cat IG.pdb | perl -ne ‘print “$1n” if ($_=~m/^REMARK SUMMARY-ID_GLOB_([LH]:[.|0-9]+)/);’ Bioinformatics master course, ‘11/’12 25 Paolo Marcatili
  • 26. Zinc Finger Bioinformatics master course, ‘11/’12 Paolo Marcatili
  • 27. Zinc Finger A zinc finger is a large superfamily of protein domains that can bind to DNA. A zinc finger consists of two antiparallel β strands, and an α helix. The zinc ion is crucial for the stability of this domain type - in the absence of the metal ion the domain unfolds as it is too small to have a hydrophobic core. The consensus sequence of a single finger is: C-X{2-4}-C-X{3}-[LIVMFYWC]-X{8}-H-X{3}-H Bioinformatics master course, ‘11/’12 27 Paolo Marcatili
  • 28. Homework Find all occurencies of ZF motif in zincfinger.fasta Put them in file ZF_motif.fasta e.g. weofjpihouwefghoicalcvgnfglapglifhtylhyuiui Bioinformatics master course, ‘11/’12 28 Paolo Marcatili
  • 29. Homework Find all occurencies of ZF motif in zincfinger.fasta Put them in file ZF_motif.fasta e.g. weofjpihouwefghoicalcvgnfglapglifhtylhyuiui calcvgnfglapglifhtylh Bioinformatics master course, ‘11/’12 29 Paolo Marcatili