[Bioperl-l] BLAST to FeaturePair
Ewan Birney
birney@ebi.ac.uk
Mon, 31 Jul 2000 16:18:50 +0100 (GMT)
On Mon, 31 Jul 2000, L.Pollak wrote:
> Hi Hilmar, Hi Ewan!
>
> > > 2) Bio::Tools::BPlite::HSP has several implementation short-comings from my
> > > perspective: a) $feature->seqname() does not return a seq id but the full
> > > BLAST description line, incl. the '>' (could be fixed easily),
> >
> > This could be done client side: Keep BPLite just "representing BLAST"
> > without too much magic. But stripping '>' seems sane.
>
> i can do that. what should be stored in $feature->seqname ?
> (but as there is no full sequence object, i can't store ids and accs,
> right?)
you can store whatever id BLAST has for this sequence in seqname. It would
be >(\S+) in my book.
>
> > > b) the
> > > lengths of the sequences are not stored (would require additional parsing
> > > code),
>
> ok, i can parse the length, but how should i store it ?
> (the feature does not contain a method for that, but i could
> attach a sequence object that has correct id, acc and sequence
> length but no real sequence in it...)
>
Ugh. I don't like this at all. I think this will just confuse people
(perhaps people disagree). Ideas?
> > > c) properties of the alignment are stored as 'new' tags, instead of
> > > through the tag system. This prevents them from easy de/serialization
> > > through the gff_string()/_from_gff_string() methods. (BTW does the string
> > > returned by $bplite_hsp->homologySeq() make sense to anyone?)
> >
> > Talking to Lorenz - I'm not siure about this.
>
> what properties do you mean? score, bits, P value, matching, positives
> and such things? if so then hilmar is right, they are not stored through
> the tag system, but it should be no problem for me to add this!
> (would both query and sbjct feature have the same tag values in this
> case?)
>
We can possibly add this into the tag system, but I *really* want to
discourage using the tag system heavily. If we *do* use the tag system,
then we must also have additional methods in object such that
$hsp->p_value
becomes $hsp->each_tag_value('p_value');
We are *not* going down the AUTOLOAD route here either (or at least people
are going to have to drag my backwards to allow this).
I am ok on having $specificobject->specific_attribute chaining to
$genericobject->each_tag_value('specific_attribute') and
$genericobject->add_tag_value('specific_attribute',$value) --- though
believe me this is because someone forced me into this (see below). This
is a good balance between data-orienated views of the data and real
objects in my view.
The problem here is that we have validly different objects inherieting
from a common interface. If we want to serialize objects, then we should
use a standard serialisation system, like Freeze-Thaw. If this particular
object has a sensible data representation, in, say XML, then we should
store things in XML (for example Bio::Variation stuff). If we want the
object to be represented in a particular way in EMBL/GenBank, then write
a specific to_FTHelper (yes, we need a way to get these back out sensibly
as well).
We cannot tackle the whole "I want a completely undefined object to be
accurately defined as a hash-of-arrays on particular tags" for every
object. It is just impossible to ensure that we can get everything through
the tag system.
Phew - Rant over.
> why does the homologySeq make no sense?
> (i just adopted it from the original BPlite...)
>
> and could someone please explain to me what's the purpose of those
> gff_string methods?
>
GFF format is a useful programmatic interface to features, in particular
computationally defined features (eg, exons, repeats). GFF format has had
a whole bunch of silly (in my view) extensions, including a generic
tag-value system added to it, which people then abuse, trying to be able
to pass the entire knowledge of a particular feature across in GFF format.
http://www.sanger.ac.uk/Software/GFF
I used to vehmently hate GFF, and enough people twisted my arm to make
Bioperl sequence feature GFF compatible, hence this whole tag system which
returns arrays.
Now I think GFF has a valuable role to play, in particular for large
datasets - see Ian Holmes' work - and for computational infrastructure for
a bunch of things. I think it is very useful.
However, it sucks for transfering all the information about one object,
and should not be used for this
>
> thank you,
> Lorenz
>
-----------------------------------------------------------------
Ewan Birney. Mobile: +44 (0)7970 151230, Work: +44 1223 494420
<birney@ebi.ac.uk>.
-----------------------------------------------------------------