[Bioperl-l] SWISS-PROT writing
Kris Boulez
krbou@pgsgent.be
Wed, 3 Jan 2001 08:29:43 +0100
Quoting Hilmar Lapp (lapp@gnf.org):
> Kris Boulez wrote:
> >
> >
> > - at line 356 there is
> > $mol = $seq->molecule;
> > I think this should be $seq->moltype; as ->molecule only looks for
> > {'molecule'} which is not set by ->new. Bio::Seq->new only sets
> > {'moltype'}.
> > We should change the 'protein' of ->moltype to 'PRT' to conform to the
> > standard.
>
> moltype() is internal to BioPerl. Whenever there is an attribute synonymous
> to moltype() but defined by a databank, molecule() should be used for that.
> So the code is correct I think.
>
Then documentation for Bio::Seq->molecule() should be extended a bit. It
now reads
molecule
Title : molecule
Usage : $obj->molecule($newval)
Function:
Returns : type of molecule (DNA, mRNA)
Args : newvalue (optional)
> Bio::Seq->new() indeed only sets moltype(), because at this point there is
> no databank specificity. molecule() should be set by the parser. If you
> want to instantiate a swissprot seq from memory and have it written in
> swissprot format, the way we want to go is have dedicated classes under
> Bio::Seq::*. If there is need for a swissprot-dedicated class, that one
> probably would also set molecule() at instantiation time.
>
> >
> > B.T.W. do we want to allow SWISS-PROT to try to write out DNA/RNA
> > sequences ?
>
> In my opinion there's no need for that, but others may think differently.
>
> >
> > - around line 369 the whole else block should be changed. We should make
> > sure we have a division ($div) in the ID part. The previous version of
> > the code which is now commented out did a better try at this. Looking at
> > next_seq() we why we're not able to read this (entry name must contain
> > an underscore section 3.1.1 of the SWISS-PROT manual).
> >
> > $line =~ /^ID\s+([^\s_]+)_([^\s_]+)\s+([^\s;]+);\s+([^\s;]+);/
> > || $self->throw("swissprot stream with no ID. Not swissprot in my
> > book");
> > $name = $1."_".$2;
> > $seq->primary_id($1);
> > $seq->division($2);
> >
>
> If this is the code you're referring to (sorry, don't have at hand right
> now), it does ensure that there is a division part. I'm probably missing
> something.
>
Sorry I wasn't clear on this one obviously. The code I pasted is from
next_seq(). What I was referring to is the code in write_seq(). In there
we do not enforce that there is a division part (I think we should at
least check if $seq->display_id() returns an underscore in a reasonable
position). The code reads
} else {
#$temp_line = sprintf ("%10s STANDARD; %3s; %d AA.",
# $seq->primary_id()."_".$div,$mol,$len);
# Reconstructing the ID relies heavily upon the input source
# having
# been in a format that is parsed as this routine expects it --
# that is,
# by this module itself. This is bad, I think, and immediately
# breaks
# if e.g. the Bio::DB::GenPept module is used as input.
# Hence, switch to display_id(); _every_ sequence is supposed to
# have
# this. HL 2000/09/03
$temp_line = sprintf ("%10s STANDARD; %3s; %d AA.",
$seq->display_id(), $mol, $len);
}
> > How standard compliant do we want to be with this. If we want to be very
> > strict we should e.g. make sure the 'entry name' (first item on the ID
> > line) is not more then 10 characters.
> >
> > P.S. (very) minor issue: the division we choose 'UNK' for sequences
> > which don't have a division set is not in the standard (speclist.txt),
> > it only contains UNKP
> >
>
> Sure, can (should) be changed.
>
> > Should I try to adopt swiss.pm to the thoughts I (tried to) put out or
> > are there major objections ?
> >
>
> See above. I'm not sure what we already have in the Bio::Seq::* hierarchy.
> If there's no Swiss.pm yet and GenBank/GenPept doesn't fit well, you could
> give Bio::Seq::Swiss.pm a start and adopt the parser to instantiate objects
> of that class.
>
The only thing we have now is Bio::Seq::LargeSeq en LargePrimarySeq. Do
you plan on having a Bio::Seq::* class for every (complex) sequence type ?
Kris,