[Bioperl-l] Bio/SeqIO/swiss.pm parsing error

Tue Nov 14 04:44:09 UTC 2006

On Nov 13, 2006, at 5:50 PM, James D. White wrote:

> "Erik" <er at xs4all.nl> wrote:
>
>> Hi all,
>>
>> I noticed the parsing is borked with newest swisprot files:
>>  UniProt Knowledgebase Release 9 consists of:
>>  UniProtKB/Swiss-Prot Release 51.0 of 31-Oct-2006
>>  UniProtKB/TrEMBL Release 34.0 of 31-Oct-2006
>>
>>
>> I edited my local copy of Bio/SeqIO/swiss.pm to parse the ID lines
>> in swissprot/trembl according to the new specification (see
>> http://expasy.org/sprot/relnotes/sp_news.html).
>>
>> Basically, the change is as follows:
>>  ID   EntryName DataClass; MoleculeType; SequenceLength.
>> is changed to:
>>  ID   EntryName DataClass; SequenceLength.
>>
>>
>>
>> The change I made was only in the regex capturing the entry name:
>> method next_seq (Bio/SeqIO/swiss.pm) :
>>
>> ===============
>>
>>  unless(  m/
>>               ^
>>                  ID              \s+     #
>>                  (\S+)           \s+     #  $1  entryname
>>                  ([^\s;]+);      \s+     #  $2  DataClass
>>                  [0-9]+[ ]AA     \.      #      Sequencelength  
>> (capture?)
>>                $
>>            /ox )
>>  {
>>    $self->throw("swissprot stream with no ID. Not swissprot in my  
>> book");
>>  }
>>
>> ===============
>>
>>
>
> How about something like the following to recognize both old and  
> new formats
>
> ===============
>
>   unless(  m/
>                ^
>                   ID              \s+           #
>                   (\S+)           \s+           #  $1  entryname
>                   ( (: [^\s;]+;   \s+ )? )      #  $2  DataClass  
> (including ";\s+")
>                   [0-9]+[ ]AA     \.            #       
> Sequencelength (capture?)
>                 $
>             /ox )
>   {
>     $self->throw("swissprot stream with no ID. Not swissprot in my  
> book");
>   }
>   # Because $2 now contains a trailing ";\s+" in the new format, it  
> needs to be fixed
>   $DataClass = $2 || 'default DataClass';       # provide default  
> for old file format
>   $DataClass =~ s/;\s+$//;                      # remove trailing "; 
> \s+"
>
> ===============
>
> The code trailing the unless block should be modified to use the  
> appropriate
> variable names.  This is provided only to show what post-match  
> modification is
> needed.
>
>>
>> I tested this (=entry parsable and SeqIO created) against several
>> hundred Swissprot and Trembl entries.
>>
>> Of course, files with the older format are now broken - it may be  
>> better
>> to leave old and new format, and try both (newest first).
>>
>> hth,
>>
>> Erik

This has been fixed to match old and new formats in CVS and passes  
all tests so far.  You can try it out if you want.  The regex is made  
to match up to (but not include) the ';', so there is no need to  
remove the extra space.

...
    unless(  m{^
                 ID              \s+     #
                 (\S+)           \s+     #  $1  entryname
                 ([^\s;]+);      \s+     #  $2  DataClass
                 (?:PRT;)?       \s+     #  Molecule Type (optional)
                 [0-9]+[ ]AA     \.      #  Sequencelength (capture?)
                 $
                 }ox ) {
...

The molecule type was always PRT and was a carryover from EMBL format  
divisions.

Chris

Christopher Fields
Postdoctoral Researcher
Lab of Dr. Robert Switzer
Dept of Biochemistry
University of Illinois Urbana-Champaign