[Bioperl-l] Use of Bio namespace
Keith James
kdj@sanger.ac.uk
11 Oct 2000 11:13:26 +0100
>>>>> "Ewan" == Ewan Birney <birney@ebi.ac.uk> writes:
>> Stuff I wanted was:
>>
>> Non-fussy but fairly complete EMBL parsing
>>
Ewan> I think we are closer to that, but there are still issues:
Ewan> (a) having this CDS_span with sub features. Perhaps bad
Ewan> (b) completely not handling fuzziness.
Ewan> (I *hate* fuzziness. But I guess some people use it.).
It's a pain, but we need to parse stuff back from EMBL where other
people have used this system.
Ewan> Knowing what you do keith, I would expect that in addition
Ewan> your parsers can possibly make more "assumptions" about how
Ewan> to interpret the genbank file as you know precisely how you
Ewan> would want to use it. (? am I wrong)
As we don't have to deal with splicing in a lot of our genomes, we
don't suffer so much from the introns/exons/splice-sites as top-level
features problem. This EMBL parser deals in generic features because
the interpretation depends on what particular convention the
originators of the sequences you want to parse were using that day. (a
case of the beauty of standards being that there are so many to choose
from). The assumptions are made at the script level at the moment
e.g. building gene objects.
>> Terse, but intuitive manipulation of feature qualifiers in
>> scripts
>>
Ewan> I don't like this part of Bioperl 100% either. What is your
Ewan> suggestion?
I've been using various equivalents to the bioperl 'tags' methods, but
with the ability to regex tag names, plus AUTOLOADing a method for
each tag/qualifier which returns the first value, or all values
depending in scalar/list context (keeping a list of valid AUTOLOADable
methods). These can get called a lot, so I need to fix it so that the
methods get created and installed in the module's namespace the first
time round, so subsequent calls will find a 'real' one and not go
looking.
>> Features with & without sequence
Ewan> That is handled ok in theory, just small changes required.
>> Clone, trim, reverse-complement sequences with all the features
>> attached
Ewan> We could do this. I have held off on it because it can lead
Ewan> to serious complications for complex feature compliant
Ewan> objects (cloneability of features). I can now see a "way
Ewan> through".
I have this working as alpha. Fuzzies are a pain here though! I'm not
sure that it all works because I have't got the tests together
yet. Much gnashing of teeth here.
>> Fuzzy ranges (parsed from EMBL, supported in other operations)
Ewan> This could be a bug-bear. What is your object model here?
I stored a fuzzy_start and fuzzy_end for each sequence range, in
addition to start and end. These are the 'outermost' start/end
coordinates, with the fixed ones inside. (Sometimes there are no fixed
coordinates, though).
>> Low memory Blast parsing
Ewan> We have this now (BPLite)
>> Fasta search output parsing
Ewan> We'd love to get this in...
I re-wrote this recently along the lines of the Blast parser, in case
anyone had seriously large results files. I think it could be one of
the easier things to integrate. But isn't there already one in
bioperl, Bio::Search::Result::Fasta, Bio::Search::Hit::Fasta etc. ?
I'm sure I stole ideas from it! However, mine uses the -m 10 output
from Fasta, which is a bit more stable between versions of the
program, but not very human-readable.
Anyway, I have some (slightly out-of-date) stuff on my page at
http://www.sanger.ac.uk/Users/kdj/software.html
and a class diagram in a UML-stylee kicking around (Gnome Dia format)
which isn't up there yet.
One thing that I can put in pretty soon is a Tk Blast/Fasta search
viewer widget because that relies on a pretty simple adapter to accept
Hit/HSP objects and a callback to activate on clicking.
cheers,
--
-= Keith James - kdj@sanger.ac.uk - http://www.sanger.ac.uk/Users/kdj =-
The Sanger Centre, Wellcome Trust Genome Campus, Hinxton, Cambs CB10 1SA