[Biojava-l] New Ensembl Release

Wed May 7 12:03:21 EDT 2003

Once upon a time, Warth,Rainer,LAUSANNE,NRC-BS wrote:
> Dear all,
>   with the expected update of the human genome I ask myself how I update my
> data. A first approach will be to
> to compare the ensembl peptide/gene fasta sequence file with the new
> release. The comparison should be probably look at the sequence as well as
> the fast header. Does anybody have program usign biojava which would do
> this. Does anybody know if
> ENSEMBL will provid it ?

What do you want to learn from this comparison?

As part of the Ensembl release process, the sequences of
predicted genes are compared with previous releases.  Only
if the sequences are very similar does are the ENS* IDs
reused.  I can't remember the precise criterion for
"very similar", but there's some information about the
id-mapping process at:

    http://www.ensembl.org/java/README-id-mapping.txt

If you want more details, it might be worth contacting the
ensembl-dev mailing list.

If you want more detailed analysis, I don't know of any
off-the-shelf tools which would help you.  Depending on
precisely what you want to find, extracting sequences from
the database with biojava-ensembl then comparing them either
by running an external tool (blast?) or using the biojava
dynamic programming and alignment toolkit might be a
sensible strategy.

      Thomas.