[Bioperl-l] SeqFeature design

Jason Stajich jason@cgt.mc.duke.edu
Fri, 4 Oct 2002 11:34:44 -0400 (EDT)


On Fri, 4 Oct 2002, Seth Purcell wrote:

> Jason,
>
> As soon as I read your email I knew what the problem was - my script was
> inadvertently calling Dumper on each feature individually, rather than
> on an array of features, and thus was not printing the '_gsf_seq' =>
> $VAR1->{'_gsf_seq'} that I expected to see, but was silently
> dereferencing the same hashref each time. Thanks very much for your
> help, and I'm glad my original understanding of the design was correct.
>

Whew!

> On an unrelated note, is there any way to easily parse ASN.1 format with
> BioPerl, or is it assumed that all the NCBI resources that are in ASN.1
> are also in genbank format?


Yeah, we never really wanted to touch that beast (ASN.1 parsing in
Bioperl) with a 10ft pole.  There is, in fact, an ASN.1 module in CPAN
but never tried out, I think it doesn't quite work here.

However, since NCBI has already done the hard work I would (assuming UNIX
here) use (untested)
open(FH, "asn2ff -fb -i filename.asn1 |");

or something similar.  So like you said, assume all ASN.1 can be converted
to genbank.  We would *love* to have someone write a proper XML parser for
the NCBI seq XML and slot it into a new event-based parsing framework for
sequence data, but not on my personal radar screen so I can't really get
to it at this point.

Good luck, let us know if there is anything else we can help with, and/or
if you've got ideas about improvements new areas.

-jason

>
> Thanks again,
> Seth Purcell
> Scientific Programmer
> Whitehead/MIT Center for Genome Research
> Cambridge, MA
>
> Jason Stajich wrote:
>
> > On Fri, 4 Oct 2002, Seth Purcell wrote:
> >
> >
> >>Hi -
> >>
> >>I am using SeqIO::genbank to parse in annotated sequences, and it
> >>appears that each SeqFeature object the parser creates contains its own
> >>copy of the entire sequence as a PrimarySeq. Obviously, this can't work
> >>
> >
> > No it has a reference to the original sequence object it does not create
> > a separate instance for each feature.
> >
> >
> >>for any non-trivial annotated sequence - I've been testing with a 40kb
> >>seq and the memory requirements for the features are almost 100 times
> >>the sequence length. I read in the Seq documentation that circular
> >>
> >
> > Obviously this depends on how many features are annotating this 40kb
> > sequence?  We've been working on streamlining the system some, but there
> > are a number of container objects which get instantiated as well for each
> > sequence and feature set, have you checked the memory req on a 100kb
> > sequence and after the Bio::SeqIO parser has been destroyed?
> >
> >
> >>references are avoided, which is quite understandable in Perl, but I
> >>thought it said that each feature had a reference to its sequence, not a
> >>copy of its PrimarySeq:
> >>
> >> > By having this split we avoid a lot of nasty circular references
> >> > (sequence features can hold a reference to a sequence without the
> >> > sequence holding a reference to the sequence feature).
> >>
> >>
> > I'm unclear where you think that the feature is creating a new copy of the
> > Bio::PrimarySeq object?  If you print out the mem location of all the
> > features seq object isn't it the same location?
> >
> >
> >>I have had little luck so far in finding out whether this is how
> >>SeqFeature objects are supposed to be constructed, or if this is rogue
> >>behavior on the part of the parser. Could someone please tell me what's
> >>going on?
> >>
> >
> > Features are created and then added to the Bio::Seq object which updates
> > the feature's reference to the sequence.
> >
> >
> >>Thank you very much,
> >>Seth Purcell
> >>Scientific Programmer
> >>Whitehead/MIT Center for Genome Research
> >>Cambridge, MA
> >>
> >>_______________________________________________
> >>Bioperl-l mailing list
> >>Bioperl-l@bioperl.org
> >>http://bioperl.org/mailman/listinfo/bioperl-l
> >>
> >>
> >
>
>

-- 
Jason Stajich
Duke University
jason at cgt.mc.duke.edu