Seqret and GIs

Wed Mar 27 09:19:41 UTC 2002

Richard Cote wrote:
> Is there a way to retrieve entries from genbank flatfiles and blast
> formatted databases based on the GI using seqret?
> 
> If I use seqret blastnr:NP_005047.1 (a typical AC entry), it will return
> a fasta file without a problem. If I use seqret blastnr:4826968 (the GI
> corresponding to the same AC as above), it complains that it cannot find
> the entry...
> 
> The reason why I need to access records through their GI and not their
> AC is that the standalone www blast server only returns a GI in the html
> output and not a AC.
> 
> Can anyone help?

Well ... You can write a script to query the database by GI (or ID or ACC)
using some other NCBI utility, and use that as "methodentry". That will
work with the present EMBOSS release. But also ...

Coming soon (in EMBOSS 2.4, but some work is needed before we have the
index fields for dbiblast indexed databases) is the ability to query by
additional fields, including for example "SV" for the sequence version
(AA123456.1 for example).

It is easy to extend this to include GI (the USA would be expanded to
"BLASTNR-GI:4826968"), but this would be limited to databases that include
a GI number, for example GenBank but not EMBL. (Aside: Why don't NCBI use
the sequence version so entries can be tracked by accession number as
well??? AA123456.1 is so much more useful than 1681491).

For EMBOSS is is not a problem if only a few databases include a field,
because the database definition includes the list of fields that can be
queried (you add "fields: sv" to the database definition to query by
SeqVersion) so GI can be limited to NCBI format blast databases.

The fields added so far are (in addition to ID and ACC already supported) :

SV (EMBL/GenBank sequence version)
DES (words in description)
KEY (complete keywords)
ORG (taxonomy levels)

These work (because they are part of the query language) through SRS, and
for querying simple file input. We are looking at how best to build indices
for them with dbiflat, dbifasta, dbiblast and dbigcg. The index file format
and the source code to query the indices will be essentially the same as
the existing code for accession  numbers.

Are there other fields that would be useful?

regards,

Peter Rice

-- 
------------------------------------------------
Peter Rice, LION Bioscience Ltd, Cambridge, UK
peter.rice at uk.lionbioscience.com +44 1223 224723