[Biojava-l] EST assemblies

Thu, 15 Aug 2002 09:26:42 +1200

Hi -

With all of the recent discussion about alignments, I have been thinking
about ways to represent EST assemblies in biojava. Of course in reality
they are gapped alignments with potentially tens of thousands of
sequences. They also tend to carry some cruft around with them like a
consensus sequence (the alignment consensus) and the contig sequence -
basically the quality clipped and ungapped consensus. There needs to be
mapping between the contig sequence coordinates and the underlying
alignment coordinates. They also have interesting things like SNPs which
really only exist as columns in the alignment that exceed some threshold
conditions.

Some issues to think about:

1. How to best hold potentially thousands of sequences in an alignment.
One solution might be to store only the differences from the consensus
and infer the rest from the consensus.

2. How to represent the quality data, should the contig/ consensus
sequence be represented as PhredSequences, Sequences or maybe even
Markov Chains.

3. How to make a SNP like feature.

- Mark

Mark Schreiber
Bioinformatics
AgResearch Invermay
PO Box 50034
Mosgiel
New Zealand

PH:   +64 3 489 9175
FAX:  +64 3 489 3739

=======================================================================
Attention: The information contained in this message and/or attachments
from AgResearch Limited is intended only for the persons or entities
to which it is addressed and may contain confidential and/or privileged
material. Any review, retransmission, dissemination or other use of, or
taking of any action in reliance upon, this information by persons or
entities other than the intended recipients is prohibited by AgResearch
Limited. If you have received this message in error, please notify the
sender immediately.
=======================================================================