[Biopython-dev] [Bug 1946] Parsing GenBank Files - unknown line type PROJECT

bugzilla-daemon at portal.open-bio.org bugzilla-daemon at portal.open-bio.org
Sun Feb 5 07:00:15 EST 2006


http://bugzilla.open-bio.org/show_bug.cgi?id=1946


biopython-bugzilla at maubp.freeserve.co.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |ASSIGNED
          Component|Martel/Mindy                |Main Distribution
         OS/Version|Mac OS                      |All
            Summary|Parsing GenBank Files -     |Parsing GenBank Files -
                   |ParserPositionException:    |unknown line type PROJECT




------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk  2006-02-05 07:00 -------
The non-martel GenBank parser in CVS is also unaware of the project line in
GenBank files.

I would expect it to fail with an assertion error:

Unknown line type, PROJECT found:
PROJECT     GenomeProject:14204

This looks like an easy fix, however we need to decide how to store the project
information.  Maybe a simple string for now, "GenomeProject:14204"

Also maybe unknown line types in the header should trigger warnings rather than
errors that stop the parsing...

---------------------------------------

Quoting from 
ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt

---------------------------------------

1.4.1 New Linetype for Genome Project Identifier

  DDBJ, EMBL, and GenBank are working to create a collaborative system that
will assign a unique numeric identifier to genome projects. The purpose of
this new identifier is to provide a link among sequence records that pertain
to a specific genome sequencing project.

  At GenBank, this new identifier will be presented in the flatfile format
via a new linetype : PROJECT . Here is a mocked-up example demonstrating
the new linetype's use:

LOCUS       CH476840             1669278 bp    DNA     linear   CON 05-OCT-2005
DEFINITION  Magnaporthe grisea 70-15 supercont5.200 genomic scaffold, whole
            genome shotgun sequence.
ACCESSION   CH476840 AACU02000000
VERSION     CH476840.1  GI:77022292
PROJECT     GENOME_PROJECT:12345

The integer 12345 represents the value of a possible genome project
identifier.

There is a possibility that the contents of the PROJECT line might change 
somewhat from this example by the time the new identifier is implemented.
We will keep you posted of any such changes via these release notes and the
GenBank listserv.

  These Genome Project identifiers will be searchable within NCBI's
Entrez: Genome-Project database:

  http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=genomeprj

  The earliest date on which this new linetype will appear in the GenBank
flatfile format is February 15 2006.
---------------------------------------

Looks like they are ahead of shedule in releasing this new type line.




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


More information about the Biopython-dev mailing list