[Bioperl-l] Bio/SeqIO/swiss.pm parsing error
James D. White
jdw at ou.edu
Mon Nov 13 23:50:15 UTC 2006
"Erik" <er at xs4all.nl> wrote:
>Hi all,
>
>I noticed the parsing is borked with newest swisprot files:
> UniProt Knowledgebase Release 9 consists of:
> UniProtKB/Swiss-Prot Release 51.0 of 31-Oct-2006
> UniProtKB/TrEMBL Release 34.0 of 31-Oct-2006
>
>
>I edited my local copy of Bio/SeqIO/swiss.pm to parse the ID lines
>in swissprot/trembl according to the new specification (see
>http://expasy.org/sprot/relnotes/sp_news.html).
>
>Basically, the change is as follows:
> ID EntryName DataClass; MoleculeType; SequenceLength.
>is changed to:
> ID EntryName DataClass; SequenceLength.
>
>
>
>The change I made was only in the regex capturing the entry name:
>method next_seq (Bio/SeqIO/swiss.pm) :
>
>===============
>
> unless( m/
> ^
> ID \s+ #
> (\S+) \s+ # $1 entryname
> ([^\s;]+); \s+ # $2 DataClass
> [0-9]+[ ]AA \. # Sequencelength (capture?)
> $
> /ox )
> {
> $self->throw("swissprot stream with no ID. Not swissprot in my book");
> }
>
>===============
>
>
How about something like the following to recognize both old and new formats
===============
unless( m/
^
ID \s+ #
(\S+) \s+ # $1 entryname
( (: [^\s;]+; \s+ )? ) # $2 DataClass (including ";\s+")
[0-9]+[ ]AA \. # Sequencelength (capture?)
$
/ox )
{
$self->throw("swissprot stream with no ID. Not swissprot in my book");
}
# Because $2 now contains a trailing ";\s+" in the new format, it needs to be fixed
$DataClass = $2 || 'default DataClass'; # provide default for old file format
$DataClass =~ s/;\s+$//; # remove trailing ";\s+"
===============
The code trailing the unless block should be modified to use the appropriate
variable names. This is provided only to show what post-match modification is
needed.
>
>I tested this (=entry parsable and SeqIO created) against several
>hundred Swissprot and Trembl entries.
>
>Of course, files with the older format are now broken - it may be better
>to leave old and new format, and try both (newest first).
>
>hth,
>
>Erik
>
>
>
>
>
>
More information about the Bioperl-l
mailing list