[Bioperl-l] Parsing a FASTA file (Was: Bioperl-l Digest, Vol 74, Issue 25)

Mark A. Jensen maj at fortinbras.us
Wed Jul 1 01:41:16 UTC 2009


Hi Paola, 

You want to try Bio::SearchIO, I think. It's not quite clear what you 
want to do, but here's an example of what you can do: 

Get all high-scoring pairs ( the mini-alignments ) involving
the database sequence called "2ojg:A"--

 use Bio::SearchIO;
 
 my $io = Bio::SearchIO->new(-format=>'fasta', -file=>'yourfile.fasta');
 my $result = $io->next_result;
 my @desired_hsps;

 while ( my $hit = $result->next_hit ) {
   push @desired_hsps, grep { $_->subject->seq_id =~ /2ojg:A/ } $hit->hsps;
 }
 
 # now all your desired hsps are in the array @desired_hsps;
 # you can get Bio::SimpleAlign objects from them all, for example:
 my @aligns = map { $_->get_aln } @desired_hsps;
 #...and lots of other things...

Look at http://www.bioperl.org/wiki/HOWTO:SearchIO#Using_SearchIO
and http://www.bioperl.org/wiki/HOWTO:SearchIO#Using_the_methods 
for a nice introduction to the Bio::SearchIO system by its authors. They 
use a blast output as an example, but everything applies to fasta output 
as well.

You didn't waste your time writing regexps, by the way. For a Perl
student, that kind of work is like money in the bank.

cheers, 
Mark
      

----- Original Message ----- 
From: "Paola Bisignano" <paola.bisignano at gmail.com>
To: <bioperl-l at lists.open-bio.org>
Sent: Tuesday, June 30, 2009 5:12 AM
Subject: Re: [Bioperl-l] Bioperl-l Digest, Vol 74, Issue 25


> Hi,
> I need a little help, to parse a file, but I tried to search some
> modules of bioperl, but there are a lot, and I don't know how to
> start, I find moduls for all db, for different web site, but not for
> my favorite PDBsum....so I parsed a lot of thing on my own, even if I
> was new in learning perl....but now I'm waiting for help...because I
> need to parse a FASTA file, resulted from aligned sequences...I need
> to extract the aligned sequences, only for the pdb in my lista....
> 
> 
> my fasta file is like:
> 
> Query: /ebi/research/thornton/tmp/sas307986/seq.fasta
>  1>>>Sequence 3e7e:A - 333 aa
> Library: /ebi/research/thornton/www/databases/html/pdbsum/data/pdblib
> 17840403 residues in 79353 sequences
> 
>       opt      E()
> < 20   286     0:===
>  22     1     0:=          one = represents 135 library sequences
>  24     1     0:=
>  26     0     2:*
>  28    21    18:*
>  30    36   109:*
>  32   237   421:== *
>  34   956  1140:========*
>  36  1924  2342:===============  *
>  38  3591  3871:=========================== *
>  40  4904  5400:=====================================  *
>  42  6750  6600:================================================*=
>  44  7145  7281:=====================================================*
>  46  8047  7416:======================================================*=====
> .........
> 
>>>2np8:A                                                  (159 aa)
> initn: 125 init1:  72 opt: 136  Z-score: 168.6  bits: 38.5 E(): 0.011
> Smith-Waterman score: 136; 26.0% identity (57.1% similar) in 154 aa
> overlap (59-204:13-153)
> 
>               10        20        30        40        50        60
> Sequen NFIVGNPWDDKLIFKLLSGLSKPVSSYPNTFEWQCKLPAIKPKTEFQLGSKLVYVHHLLG
>                                                                 ::
> 2np8:A                                               QWALEDFEIGRPLG
>                                                             10
> 
>               70          80        90         100        110
> Sequen EGAFAQVYEATQNKQKFVL--KVQKPANPWEFYIGTQLMER--LKPSMQH-MFMKFYSAH
>       .: :..:: : ....::.:  ::   :.  .  .  :: ..  ..  ..:  ....:.
> 2np8:A KGKFGNVYLAREKQSKFILALKVLFKAQLEKAGVEHQLRREVEIQSHLRHPNILRLYG--
>           20        30        40        50        60        70
> 
>         120         130       140       150       160       170
> Sequen LFQNGS--VLVGELYSYGTLLNAINLYKNTPEKVMPQGLVISFAMRMLYMIEQVHDCEII
>        :....   :. :    ::.   ..  ..  :.      . ..  ..   .   :. ..:
> 2np8:A YFHDATRVYLILEYAPLGTVYRELQKLSKFDEQR-----TATYITELANALSYCHSKRVI
>             80        90       100            110       120
> 
>           180       190        200       210       220       230
> Sequen HGDIKPDNFILGNGFLEQSAG-LALIDLGQSIDMKLFPKGTIFTAKCETSGFQCVEMLSN
>       : ::::.:..::      ::: : . :.: :.
> 2np8:A HRDIKPENLLLG------SAGELKIADFGWSVHAPSSR
>       130             140       150
> 
>            240       250       260       270       280       290
> Sequen KPWNYQIDYFGVAATVYCMLFGTYMKVKNEGGECKPEGLFRRLPHLDMWNEFFHVMLNIP
> 
>            300       310       320       330
> Sequen DCHHLPSLDLLRQKLKKVFQQHYTNKIRALRNRLIVLLLEC
> 
>>>2ojg:A                                                  (337 aa)
> initn:  85 init1:  53 opt: 140  Z-score: 168.1  bits: 39.5 E(): 0.012
> Smith-Waterman score: 140; 20.3% identity (56.2% similar) in 217 aa
> overlap (46-252:1-204)
> 
>               10        20        30        40        50        60
> Sequen NFIVGNPWDDKLIFKLLSGLSKPVSSYPNTFEWQCKLPAIKPKTEFQLGSKLVYVHHLLG
>                                                    :..: . . . .. :
> 2ojg:A                                              FDVGPRYTNLSYI-G
>                                                            10
> 
>               70        80        90        100       110
> Sequen EGAFAQVYEATQNKQKFVLKVQKPANPWEFYIGTQ-LMERLKPSMQHMFMKFYSAHLFQN
>       :::...:  : .: .:  . ..:  .:.:     :  ....:     ....:   ...
> 2ojg:A EGAYGMVCSAYDNVNKVRVAIKK-ISPFEHQTYCQRTLREIK-----ILLRFRHENIIGI
>           20        30         40        50             60
> 
>     120              130       140       150       160       170
> Sequen GSVL-------VGELYSYGTLLNAINLYKNTPEKVMPQGLVISFAMRMLYMIEQVHDCEI
>       ....       . ..:    :... .:::    . . .  .  : ...:  .. .:. ..
> 2ojg:A NDIIRAPTIEQMKDVYIVQDLMET-DLYKLLKTQHLSNDHICYFLYQILRGLKYIHSANV
>       70        80        90        100       110       120
> 
>            180       190       200        210       220        230
> Sequen IHGDIKPDNFILGNGFLEQSAGLALIDLGQS-IDMKLFPKGTIFTAKCETSGFQCVE-ML
>       .: :.::.:..:..     .  : . :.: . .      .  ..:    :  ..  : ::
> 2ojg:A LHRDLKPSNLLLNT-----TCDLKICDFGLARVADPDHDHTGFLTEYVATRWYRAPEIML
>       130       140            150       160       170       180
> 
>              240       250       260       270       280       290
> Sequen SNKPWNYQIDYFGVAATVYCMLFGTYMKVKNEGGECKPEGLFRRLPHLDMWNEFFHVMLN
>       ..: .. .:: ..:.  .  ::
> 2ojg:A NSKGYTKSIDIWSVGCILAEMLSNRPIFPGKHYLDQLNHILGILGSPSQEDLNCIINLKA
>            190       200       210       220       230       240
> 
>              300       310       320       330
> Sequen IPDCHHLPSLDLLRQKLKKVFQQHYTNKIRALRNRLIVLLLEC
> 
> 2ojg:A RNYLLSLPHKNKVPWNRLFPNADSKALDLLDKMLTFNPHKRIEVEQALAHPYLEQYYDPS
>            250       260       270       280       290       300
> 
> 2ojg:A DEPIAEAPFKFELDDLPKEKLKELIFEETARFQPG
>            310       320       330
> 
>>>2oji:A                                                  (344 aa)
> initn:  85 init1:  53 opt: 140  Z-score: 168.0  bits: 39.5 E(): 0.012
> Smith-Waterman score: 140; 20.3% identity (56.2% similar) in 217 aa
> overlap (46-252:5-208)
> 
>               10        20        30        40        50        60
> Sequen NFIVGNPWDDKLIFKLLSGLSKPVSSYPNTFEWQCKLPAIKPKTEFQLGSKLVYVHHLLG
>                                                    :..: . . . .. :
> 2oji:A                                          RGQVFDVGPRYTNLSYI-G
>                                                        10
> 
>               70        80        90        100       110
> Sequen EGAFAQVYEATQNKQKFVLKVQKPANPWEFYIGTQ-LMERLKPSMQHMFMKFYSAHLFQN
>       :::...:  : .: .:  . ..:  .:.:     :  ....:     ....:   ...
> 2oji:A EGAYGMVCSAYDNVNKVRVAIKK-ISPFEHQTYCQRTLREIK-----ILLRFRHENIIGI
>       20        30        40         50             60        70
> 
>     120              130       140       150       160       170
> Sequen GSVL-------VGELYSYGTLLNAINLYKNTPEKVMPQGLVISFAMRMLYMIEQVHDCEI
>       ....       . ..:    :... .:::    . . .  .  : ...:  .. .:. ..
> 2oji:A NDIIRAPTIEQMKDVYIVQDLMET-DLYKLLKTQHLSNDHICYFLYQILRGLKYIHSANV
>             80        90        100       110       120       130
> 
>            180       190       200        210       220        230
> Sequen IHGDIKPDNFILGNGFLEQSAGLALIDLGQS-IDMKLFPKGTIFTAKCETSGFQCVE-ML
>       .: :.::.:..:..     .  : . :.: . .      .  ..:    :  ..  : ::
> 2oji:A LHRDLKPSNLLLNT-----TCDLKICDFGLARVADPDHDHTGFLTEYVATRWYRAPEIML
>             140            150       160       170       180
> 
>              240       250       260       270       280       290
> Sequen SNKPWNYQIDYFGVAATVYCMLFGTYMKVKNEGGECKPEGLFRRLPHLDMWNEFFHVMLN
>       ..: .. .:: ..:.  .  ::
> 2oji:A NSKGYTKSIDIWSVGCILAEMLSNRPIFPGKHYLDQLNHILGILGSPSQEDLNCIINLKA
>        190       200       210       220       230       240
> 
>              300       310       320       330
> Sequen IPDCHHLPSLDLLRQKLKKVFQQHYTNKIRALRNRLIVLLLEC
> 
> 2oji:A RNYLLSLPHKNKVPWNRLFPNADSKALDLLDKMLTFNPHKRIEVEQALAHPYLEQYYDPS
>        250       260       270       280       290       300
> 
> 2oji:A DEPIAEAPFKFDMELDDLPKEKLKELIFEETARFQPGY
>        310       320       330       340
> 
> .......
> I show a part of the file...if I want for example only that two
> alignment? are there moduls to parse...because I've tried to parse
> whit regex but....without results :-(....
> If anyone has suggestion for muduls or anything else, I'll be very
> happy to learn
> thanks
> Paola
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> 
>



More information about the Bioperl-l mailing list