[Bioperl-l] Bioperl-l Digest, Vol 74, Issue 25

Paola Bisignano paola.bisignano at gmail.com
Tue Jun 30 09:12:49 UTC 2009


Hi,
I need a little help, to parse a file, but I tried to search some
modules of bioperl, but there are a lot, and I don't know how to
start, I find moduls for all db, for different web site, but not for
my favorite PDBsum....so I parsed a lot of thing on my own, even if I
was new in learning perl....but now I'm waiting for help...because I
need to parse a FASTA file, resulted from aligned sequences...I need
to extract the aligned sequences, only for the pdb in my lista....


my fasta file is like:

Query: /ebi/research/thornton/tmp/sas307986/seq.fasta
  1>>>Sequence 3e7e:A - 333 aa
Library: /ebi/research/thornton/www/databases/html/pdbsum/data/pdblib
17840403 residues in 79353 sequences

       opt      E()
< 20   286     0:===
  22     1     0:=          one = represents 135 library sequences
  24     1     0:=
  26     0     2:*
  28    21    18:*
  30    36   109:*
  32   237   421:== *
  34   956  1140:========*
  36  1924  2342:===============  *
  38  3591  3871:=========================== *
  40  4904  5400:=====================================  *
  42  6750  6600:================================================*=
  44  7145  7281:=====================================================*
  46  8047  7416:======================================================*=====
.........

>>2np8:A                                                  (159 aa)
 initn: 125 init1:  72 opt: 136  Z-score: 168.6  bits: 38.5 E(): 0.011
Smith-Waterman score: 136; 26.0% identity (57.1% similar) in 154 aa
overlap (59-204:13-153)

               10        20        30        40        50        60
Sequen NFIVGNPWDDKLIFKLLSGLSKPVSSYPNTFEWQCKLPAIKPKTEFQLGSKLVYVHHLLG
                                                                 ::
2np8:A                                               QWALEDFEIGRPLG
                                                             10

               70          80        90         100        110
Sequen EGAFAQVYEATQNKQKFVL--KVQKPANPWEFYIGTQLMER--LKPSMQH-MFMKFYSAH
       .: :..:: : ....::.:  ::   :.  .  .  :: ..  ..  ..:  ....:.
2np8:A KGKFGNVYLAREKQSKFILALKVLFKAQLEKAGVEHQLRREVEIQSHLRHPNILRLYG--
           20        30        40        50        60        70

         120         130       140       150       160       170
Sequen LFQNGS--VLVGELYSYGTLLNAINLYKNTPEKVMPQGLVISFAMRMLYMIEQVHDCEII
        :....   :. :    ::.   ..  ..  :.      . ..  ..   .   :. ..:
2np8:A YFHDATRVYLILEYAPLGTVYRELQKLSKFDEQR-----TATYITELANALSYCHSKRVI
             80        90       100            110       120

           180       190        200       210       220       230
Sequen HGDIKPDNFILGNGFLEQSAG-LALIDLGQSIDMKLFPKGTIFTAKCETSGFQCVEMLSN
       : ::::.:..::      ::: : . :.: :.
2np8:A HRDIKPENLLLG------SAGELKIADFGWSVHAPSSR
       130             140       150

            240       250       260       270       280       290
Sequen KPWNYQIDYFGVAATVYCMLFGTYMKVKNEGGECKPEGLFRRLPHLDMWNEFFHVMLNIP

            300       310       320       330
Sequen DCHHLPSLDLLRQKLKKVFQQHYTNKIRALRNRLIVLLLEC

>>2ojg:A                                                  (337 aa)
 initn:  85 init1:  53 opt: 140  Z-score: 168.1  bits: 39.5 E(): 0.012
Smith-Waterman score: 140; 20.3% identity (56.2% similar) in 217 aa
overlap (46-252:1-204)

               10        20        30        40        50        60
Sequen NFIVGNPWDDKLIFKLLSGLSKPVSSYPNTFEWQCKLPAIKPKTEFQLGSKLVYVHHLLG
                                                    :..: . . . .. :
2ojg:A                                              FDVGPRYTNLSYI-G
                                                            10

               70        80        90        100       110
Sequen EGAFAQVYEATQNKQKFVLKVQKPANPWEFYIGTQ-LMERLKPSMQHMFMKFYSAHLFQN
       :::...:  : .: .:  . ..:  .:.:     :  ....:     ....:   ...
2ojg:A EGAYGMVCSAYDNVNKVRVAIKK-ISPFEHQTYCQRTLREIK-----ILLRFRHENIIGI
           20        30         40        50             60

     120              130       140       150       160       170
Sequen GSVL-------VGELYSYGTLLNAINLYKNTPEKVMPQGLVISFAMRMLYMIEQVHDCEI
       ....       . ..:    :... .:::    . . .  .  : ...:  .. .:. ..
2ojg:A NDIIRAPTIEQMKDVYIVQDLMET-DLYKLLKTQHLSNDHICYFLYQILRGLKYIHSANV
       70        80        90        100       110       120

            180       190       200        210       220        230
Sequen IHGDIKPDNFILGNGFLEQSAGLALIDLGQS-IDMKLFPKGTIFTAKCETSGFQCVE-ML
       .: :.::.:..:..     .  : . :.: . .      .  ..:    :  ..  : ::
2ojg:A LHRDLKPSNLLLNT-----TCDLKICDFGLARVADPDHDHTGFLTEYVATRWYRAPEIML
       130       140            150       160       170       180

              240       250       260       270       280       290
Sequen SNKPWNYQIDYFGVAATVYCMLFGTYMKVKNEGGECKPEGLFRRLPHLDMWNEFFHVMLN
       ..: .. .:: ..:.  .  ::
2ojg:A NSKGYTKSIDIWSVGCILAEMLSNRPIFPGKHYLDQLNHILGILGSPSQEDLNCIINLKA
            190       200       210       220       230       240

              300       310       320       330
Sequen IPDCHHLPSLDLLRQKLKKVFQQHYTNKIRALRNRLIVLLLEC

2ojg:A RNYLLSLPHKNKVPWNRLFPNADSKALDLLDKMLTFNPHKRIEVEQALAHPYLEQYYDPS
            250       260       270       280       290       300

2ojg:A DEPIAEAPFKFELDDLPKEKLKELIFEETARFQPG
            310       320       330

>>2oji:A                                                  (344 aa)
 initn:  85 init1:  53 opt: 140  Z-score: 168.0  bits: 39.5 E(): 0.012
Smith-Waterman score: 140; 20.3% identity (56.2% similar) in 217 aa
overlap (46-252:5-208)

               10        20        30        40        50        60
Sequen NFIVGNPWDDKLIFKLLSGLSKPVSSYPNTFEWQCKLPAIKPKTEFQLGSKLVYVHHLLG
                                                    :..: . . . .. :
2oji:A                                          RGQVFDVGPRYTNLSYI-G
                                                        10

               70        80        90        100       110
Sequen EGAFAQVYEATQNKQKFVLKVQKPANPWEFYIGTQ-LMERLKPSMQHMFMKFYSAHLFQN
       :::...:  : .: .:  . ..:  .:.:     :  ....:     ....:   ...
2oji:A EGAYGMVCSAYDNVNKVRVAIKK-ISPFEHQTYCQRTLREIK-----ILLRFRHENIIGI
       20        30        40         50             60        70

     120              130       140       150       160       170
Sequen GSVL-------VGELYSYGTLLNAINLYKNTPEKVMPQGLVISFAMRMLYMIEQVHDCEI
       ....       . ..:    :... .:::    . . .  .  : ...:  .. .:. ..
2oji:A NDIIRAPTIEQMKDVYIVQDLMET-DLYKLLKTQHLSNDHICYFLYQILRGLKYIHSANV
             80        90        100       110       120       130

            180       190       200        210       220        230
Sequen IHGDIKPDNFILGNGFLEQSAGLALIDLGQS-IDMKLFPKGTIFTAKCETSGFQCVE-ML
       .: :.::.:..:..     .  : . :.: . .      .  ..:    :  ..  : ::
2oji:A LHRDLKPSNLLLNT-----TCDLKICDFGLARVADPDHDHTGFLTEYVATRWYRAPEIML
             140            150       160       170       180

              240       250       260       270       280       290
Sequen SNKPWNYQIDYFGVAATVYCMLFGTYMKVKNEGGECKPEGLFRRLPHLDMWNEFFHVMLN
       ..: .. .:: ..:.  .  ::
2oji:A NSKGYTKSIDIWSVGCILAEMLSNRPIFPGKHYLDQLNHILGILGSPSQEDLNCIINLKA
        190       200       210       220       230       240

              300       310       320       330
Sequen IPDCHHLPSLDLLRQKLKKVFQQHYTNKIRALRNRLIVLLLEC

2oji:A RNYLLSLPHKNKVPWNRLFPNADSKALDLLDKMLTFNPHKRIEVEQALAHPYLEQYYDPS
        250       260       270       280       290       300

2oji:A DEPIAEAPFKFDMELDDLPKEKLKELIFEETARFQPGY
        310       320       330       340

.......
I show a part of the file...if I want for example only that two
alignment? are there moduls to parse...because I've tried to parse
whit regex but....without results :-(....
If anyone has suggestion for muduls or anything else, I'll be very
happy to learn
thanks
Paola



More information about the Bioperl-l mailing list