[Bioperl-l] Missing Sequences

Brian Osborne brian_osborne@cognia.com
Thu, 30 May 2002 12:59:00 -0400


Mick,

I should also mention that the new set of CGIs at Genbank, the eutils, will
accept either RefSeq ids or the familiar accession numbers, or gi's.
However, Bioperl's sequence retrieval software doesn't yet use eutils.

Brian O.

-----Original Message-----
From: mick@lightlink.com [mailto:mick@lightlink.com]On Behalf Of Mick Watson
Sent: Thursday, May 30, 2002 11:55 AM
To: Brian Osborne
Cc: bioperl-l@bioperl.org
Subject: Re: [Bioperl-l] Missing Sequences

Thanks for your help! :-)

I guess this is a bad assumption that when I look at a unigene record and
see:

    /gb=NM_etc

I assume that the gb stands for GenBank and the NM_etc is an accession
number
for GenBank - when in effect it could be a RefSeq accession number.

But aren't RefSeq entries in some way derived from GenBank/EMBL entries?  So
why not have the GenBank accession in the /gb= tag and have a new tag, /rs=
for the refseq accession....?

Or maybe I am just confused....

It is also rather unfortunate that the fetch software at both the EBI and
NCBI
will croak when just one of a whole list of accessions is not present in the
database

Thanks again
Mick

Brian Osborne wrote:

> Mick,
>
> Those NM_* ids correspond to RefSeq entries. From the FAQ:
>
>   Q2.3: How can I get NT_ or NM_ accessions from NCBI (Reference
>         Sequences)?
>
>         Use Bio::DB::RefSeq not Bio::DB::GenBank when you are retrieving
>         the NM_ accessions. This is still an area of active development
>         because the data providers have not provided the best interface
for
>         us to query.  EBI has provided a mirror with their dbfetch system
>         which is accessible through the Bio::DB::RefSeq object however,
>         there are cases where NT_ accessions will not be retrievable.
>
> Brian O.
>
> -----Original Message-----
> From: bioperl-l-admin@bioperl.org [mailto:bioperl-l-admin@bioperl.org]On
> Behalf Of Mick Watson
> Sent: Thursday, May 30, 2002 11:30 AM
> To: bioperl-l@bioperl.org
> Subject: [Bioperl-l] Missing Sequences
>
> This is an old-ish problem when using Bioperl to fetch multiple
> sequences from GenBank/EMBL
>
> I am using EMBL.pm (Bioperl 1.0) to fetch multiple sequences that have
> been identified from a blast search against Unigene.  Parsing the
> Accession from unigene entries is simple as I just look for the
>
>     /gb=.....
>
> token and I have the accessions.  Simple.
>
> The problem is, I guess, that these are GenBank accessions so I get the
> following list:
>
> AL117415 AJ291674 AJ291673 AJ291675 NM_022139 AF253318 NM_025220
> AB055891 BI826766 BG547620
>
> When I use EMBL.pm to fetch these, it croaks with the error that
> NM_022139 and NM_025220 do not exist, and when I try to fetch them from
> the ebi, it's right, they don't.  However, when I go to the NCBI, they
> DO exist in GenBank (or at least the NCBI's nucleotide fetch tool says
> that they do)
>
> So my question is why is it that there are sequences in GenBank that
> aren't in EMBL?  I'm guessing the NM_ prefix has some sort of
> relevance....
>
> Also, this looks as if this will force me to use GenBank.pm to fetch the
> sequences and not EMBL.pm, and I don't want to do this for various
> reasons....
>
> Thanks
> Mick
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@bioperl.org
> http://bioperl.org/mailman/listinfo/bioperl-l