[Bioperl-l] blast results accession numbers

Mon, 4 Nov 2002 12:58:18 -0600

Hi,

I noticed that when parsing text blast results, the accession number is not always parsed correctly.  Instead the locus number is given.  I am going to fix that to give me the accession number, according to the docs from
ftp://ftp.ncbi.nih.gov/blast/db/README.  

For some of them, I am not sure what to do (see bottom for database fasta description template for blast results)  

PDB - take entry
GNL - take identifier.  

The current output also does not keep the version (the version is not kept in the XML output either).  I will not make the text parsing keep it either, unless someone chimes in that they want it.  Otherwise I am defaulting to what I can find in the XML output.

If anyone has strong feelings, let me know, otherwise I am putting this in?

FYI - copied from above link

Appendix 1: Sequence Identifier Syntax

The syntax of sequence header lines used by the NCBI BLAST server depends on
the database from which each sequence was obtained.  The table below lists
the identifiers for the databases from which the sequences were derived.

  Database Name                     Identifier Syntax
  ============================      ========================
  GenBank                           gb|accession|locus
  EMBL Data Library                 emb|accession|locus
  DDBJ, DNA Database of Japan       dbj|accession|locus
  NBRF PIR                          pir||entry
  Protein Research Foundation       prf||name
  SWISS-PROT                        sp|accession|entry name
  Brookhaven Protein Data Bank      pdb|entry|chain
  Patents                           pat|country|number 
  GenInfo Backbone Id               bbs|number 
  General database identifier	    gnl|database|identifier
  NCBI Reference Sequence           ref|accession|locus
  Local Sequence identifier         lcl|identifier

-Mat