[Bioperl-l] SWISS-PROT writing
Hilmar Lapp
lapp@gnf.org
Tue, 02 Jan 2001 15:45:28 -0800
Kris Boulez wrote:
>
>
> - at line 356 there is
> $mol = $seq->molecule;
> I think this should be $seq->moltype; as ->molecule only looks for
> {'molecule'} which is not set by ->new. Bio::Seq->new only sets
> {'moltype'}.
> We should change the 'protein' of ->moltype to 'PRT' to conform to the
> standard.
moltype() is internal to BioPerl. Whenever there is an attribute synonymous
to moltype() but defined by a databank, molecule() should be used for that.
So the code is correct I think.
Bio::Seq->new() indeed only sets moltype(), because at this point there is
no databank specificity. molecule() should be set by the parser. If you
want to instantiate a swissprot seq from memory and have it written in
swissprot format, the way we want to go is have dedicated classes under
Bio::Seq::*. If there is need for a swissprot-dedicated class, that one
probably would also set molecule() at instantiation time.
>
> B.T.W. do we want to allow SWISS-PROT to try to write out DNA/RNA
> sequences ?
In my opinion there's no need for that, but others may think differently.
>
> - around line 369 the whole else block should be changed. We should make
> sure we have a division ($div) in the ID part. The previous version of
> the code which is now commented out did a better try at this. Looking at
> next_seq() we why we're not able to read this (entry name must contain
> an underscore section 3.1.1 of the SWISS-PROT manual).
>
> $line =~ /^ID\s+([^\s_]+)_([^\s_]+)\s+([^\s;]+);\s+([^\s;]+);/
> || $self->throw("swissprot stream with no ID. Not swissprot in my
> book");
> $name = $1."_".$2;
> $seq->primary_id($1);
> $seq->division($2);
>
If this is the code you're referring to (sorry, don't have at hand right
now), it does ensure that there is a division part. I'm probably missing
something.
> How standard compliant do we want to be with this. If we want to be very
> strict we should e.g. make sure the 'entry name' (first item on the ID
> line) is not more then 10 characters.
>
> P.S. (very) minor issue: the division we choose 'UNK' for sequences
> which don't have a division set is not in the standard (speclist.txt),
> it only contains UNKP
>
Sure, can (should) be changed.
> Should I try to adopt swiss.pm to the thoughts I (tried to) put out or
> are there major objections ?
>
See above. I'm not sure what we already have in the Bio::Seq::* hierarchy.
If there's no Swiss.pm yet and GenBank/GenPept doesn't fit well, you could
give Bio::Seq::Swiss.pm a start and adopt the parser to instantiate objects
of that class.
Apart from this, Lorenz may wish to comment. He's been our Swissprot
cruncher for a while, but haven't heard from him for some time. Lorenz,
still out there?
Happy new year to all.
Hilmar
--
-------------------------------------------------------------
Hilmar Lapp email: lapp@gnf.org
GNF, San Diego, Ca. 92121 phone: +1-858-812-1757
-------------------------------------------------------------