[Bioperl-l] Bio::Index:EMBL on embl flatfiles

Brian Osborne brian_osborne at cognia.com
Mon Apr 4 14:34:28 EDT 2005


Jonathan,

I see. You're not interested in fetching by id or key, you're interested in
fetching by coordinate. Modifying Bio/Index/EMBL won't help then. One idea
would be that you create the fasta files used for the BLAST from the EMBL
files, and make the fasta headers match the desired format. On the other
hand I may not understand precisely your intent.

Brian O.

-----Original Message-----
From: Jonathan Miller [mailto:millerj at bcm.tmc.edu]
Sent: Monday, April 04, 2005 2:04 PM
To: Brian Osborne
Subject: RE: [Bioperl-l] Bio::Index:EMBL on embl flatfiles



Dear Brian,

thank you for your reply. regarding an "identifier in common,"
this seems to be somewhat tricky; I give a specific
example below. Because of the use of relative, and not
absolute coordinates, within the fasta file, I will probably
wait a few days for you to write the interface, if you would
be so kind; on the other hand, you might well believe that
EMBL should have formatted their files somewhat differently,
or you might suggest I go about it an entirely different way.

many thanks,

jm

More specifically, the goal is to BLAST a
(local) fasta file; find the sequence location,
and look up its annotation in a (local) EMBL flatfile.

So, for example, for honeybee fasta file from EMBL:
Apis_mellifera.AMEL1.1.mar.dna.contig.fa,
the fasta header of the
contig where the sequence is found might be:

>Contig18.1.1312 dna:contig scaffold:AMEL1.1:Group1.1:1:1312:1

Now I want look up the annotation in the EMBL
flat files, (for example, Apis_mellifera.0.dat), that
I have indexed using Bio::Index:EMBL.

However, the accession numbers in the EMBL flat files
have the form:

scaffold:AMEL1.1:Group1.10:1:348491:1

and apparently

scaffold:AMEL1.1:Group1.1:1:1312:1

-never- appears in an EMBL flat file, although the entry:

SV   scaffold:AMEL1.1:Group1.1:1:422138:1

does appear,
as does the entry within this ID:

FT   misc_feature    1..1312
FT                   /note="contig Contig18.1.1312 1..1312(1)"
FT   misc_feature    1860..4967
FT                   /note="contig Contig17.1.3108 1..3108(1)"
...etc...

however, as you can see, the coordinates in this last entry are -local-
and not absolute with respect to the scaffold entry.

I don't know if I should be indexing differently,
searching on a different key, or what.

With NCBI fasta and GenBank flat files, this procedure
was straightfoward (e.g. no thought was required)
to implement successfully. Presumably there is an
analogous interface for the EMBL format?



On Mon, 4 Apr 2005, Brian Osborne wrote:

> Jonathan,
>
> Is there some identifier in common between the fasta entries and the EMBL
> entries? If so what you want to be able to do is to create your EMBL
indices
> based on this key, but the current Bio::Index::EMBL doesn't do this. If
you
> want to wait a couple of days I can modify EMBL.pm so it can create this
> sort of custom index, or you can try to modify EMBL.pm yourself. If you
look
> at its sister, Genbank.pm, you'll see that the modifications are not
> difficult.
>
> Brian O.
>
> -----Original Message-----
> From: bioperl-l-bounces at portal.open-bio.org
> [mailto:bioperl-l-bounces at portal.open-bio.org]On Behalf Of Jonathan
> Miller
> Sent: Monday, April 04, 2005 12:27 AM
> To: bioperl-l at bioperl.org
> Subject: [Bioperl-l] Bio::Index:EMBL on embl flatfiles
>
>
>
> I have the following task to perform in large
> quantities. I can do it successfully with
> files from ncbi, in fasta and genbank format.
>
> For various reasons, I would prefer to do it
> with embl format annotation files, rather than
> genbank.
>
> I first used formatdb to create a blast index
> for Apis_mellifera.AMEL1.1.mar.dna.contig.fa  .
>
> I blast my sequence against this file,
> and obtain the expected hit, and then I want to find
> annotation for this hit, in embl format
> flatfiles (Apis_mellifera.0.dat, etc.) with bioperl.
>
> To do this, I have to first make an index with
> bioperl, using Bio::Index::EMBL .
>
> Then I need to use "fetch" within bioperl.
> The problem is, that "fetch" within bioperl
> doesn't seem to know how to use the fasta
> headers to find the sequence in the embl flatfile.
>
> There is probably a simple solution to this
> that everyone working with bioperl and embl
> flatfiles knows, but I don't know what it
> is.
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioperl-l
>
>
>
>





More information about the Bioperl-l mailing list