[Biojava-l] Re: genbank contig stuff
Schreiber, Mark
mark.schreiber at agresearch.co.nz
Tue Jul 8 12:16:55 EDT 2003
There seems to be lots of ways to think about contigs. One nice way is a Markov Chain (although that is more of a consensus). An alternative is to treat the contig as a collection of sequences and some associated information about the locations of the sequences in the contig and what the consensus should look like. We do this with an XML description of the contig.
I feel that all the parts needed are in biojava and it would be good to have a fairly abstract Contig object that holds the information required. When the needed sequenceDB is available then a view could be made to the consensus (a Sequence object) or a view to the Alignment or even a view to a Markov Chain. When quality info is available a Sequence over the Phred alphabet could be produced. In this way a Contig object is not a Sequence an Alignment or a Markov chain but information in it could be used to produce all three.
Anyone want to code that up :)
- Mark
-----Original Message-----
From: Greg Cox [mailto:greg.cox at lionbioscience.com]
Sent: Tue 8/07/2003 3:19 a.m.
To: Matthew Pocock
Cc: biojava-l
Subject: RE: [Biojava-l] Re: genbank contig stuff
We looked at this a while back, and I suspect this isn't a problem BioJava can solve.
If we treat it as a sequence, one option is try to assemble it. If BioJava assembles the sequence, it has to know where to get the composing sequences. This implies some sort of database backing to parse the contig sequences, which seems a bit excessive. If all you want is the features, we could create a dummy sequence of ambiguous nucleotides of the proper length, and attach the features to that. At that point though, I think it makes more sense to create a feature holder instead of pretending it's a real sequence. Which segues into...
The other option is to treat a contig as a new kind of beast, not a sequence. I don't know what this beast would look like; it has to be a feature holder, probably annotatable, and then what? Aesthetically I'm not sure this makes sense either, after all, a contig sequence is still a sequence.
The ray of light is that most (all?) contigs are avilable in an expanded form also. That's been enough for us to avoid grappling with this bull so far.
Greg
-----Original Message-----
From: biojava-l-bounces at biojava.org
[mailto:biojava-l-bounces at biojava.org]On Behalf Of Matthew Pocock
Sent: Thursday, June 26, 2003 2:58 PM
To: Matthew Pocock
Cc: biojava-l
Subject: [Biojava-l] Re: genbank contig stuff
Sory - I fired that off without thinking much.
I just downloaded the genbank file NT_010783 from the ncbi. Our parsers
spewed lots of errors about features not being within the range 1..0,
and after a little poking arround in the code, I found that a zero
length sequence was being generated. In despiration, I looked at the
physical genbank file. Instead of sequences, it contains a CONTIG
section with a single big join() describing how to build it from other
entries.
Has anybody modified our genbank parser to process entries like this? To
be honest, I'm not quite sure where to start.
Matthew
_______________________________________________
Biojava-l mailing list - Biojava-l at biojava.org
http://biojava.org/mailman/listinfo/biojava-l
_______________________________________________
Biojava-l mailing list - Biojava-l at biojava.org
http://biojava.org/mailman/listinfo/biojava-l
=======================================================================
Attention: The information contained in this message and/or attachments
from AgResearch Limited is intended only for the persons or entities
to which it is addressed and may contain confidential and/or privileged
material. Any review, retransmission, dissemination or other use of, or
taking of any action in reliance upon, this information by persons or
entities other than the intended recipients is prohibited by AgResearch
Limited. If you have received this message in error, please notify the
sender immediately.
=======================================================================
More information about the Biojava-l
mailing list