[Biopython-dev] [Bug 2591] GenBank files misparsed for long organism names

bugzilla-daemon at portal.open-bio.org bugzilla-daemon at portal.open-bio.org
Wed Dec 17 23:44:58 UTC 2008


http://bugzilla.open-bio.org/show_bug.cgi?id=2591





------- Comment #5 from joelb at lanl.gov  2008-12-17 18:44 EST -------
I received the following response to my followup.  It now appears that the bug
is with BioPython, since GenBank has changed its definition.  It seems likely
that all Bio* flatfile parsers will be affected.

>I just received the wording that will appear in Section 3.4.2 of gbrel.txt 
>for this month's release:
>
>   ORGANISM     - Formal scientific name of the organism (first line)
>and taxonomic classification levels (second and subsequent lines).
>Mandatory subkeyword in all annotated entries/two or more records.
>
>   In the event that the organism name exceeds 68 characters (80 - 13 +
>1)
>   in length, it will be line-wrapped and continue on a second line,
>   prior to the taxonomic classification. Unfortunately, very long 
>   organism names were not anticipated when the fixed-length GenBank
>   flatfile format was defined in the 1980s. The possibility of linewraps
>   makes the job of flatfile parsers more difficult : essentially, one
>   cannot be sure that the second line is truly a classification/lineage
>   unless it consists of multiple tokens, delimited by semi-colons.
>   The long-term solution to this problem is to introduce an additional
>   subkeyword, probably 'LINEAGE' . This might occur sometime in 2009
>   or 2010.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.



More information about the Biopython-dev mailing list