[Biojava-l] Re: Biojava-l digest, Vol 1 #334 - 2 msgs

Mon, 11 Jun 2001 17:57:30 +0100

On Mon, Jun 11, 2001 at 06:32:08PM +0200, Sarath wrote:
>   Hello everyone
>     I have been using the biojava package for around a month and today
> surprisingly i have met with a strange circumstance of simple program not
> able to compute the gc content from a file in the gene bank format.I would
> be very glad if some body can tell me the bug in the program to find the
> gc content from the file (AE000783.gbk)  from the url
>   ftp://ncbi.nlm.nih.gov/genbank/genomes/Bacteria/Borrelia_burgdorferi/  
>     I am pasting the source code here i hope this is not an inconvenience
> to some people.The code is exactly the same as that given in the tutorials
> .Probably there is a bug in biojava ...who knows   or should i blame the
> genbank people???? 

Bug?  No, never... ;)

I've taken a quick look at this.  The problem is with
the VERSION line of the file.  Most Genbank entries look
like:

  VERSION     AE000784.1  GI:2690041

The BioJava parser is explicitly expecting to see two tokens
folowing the VERSION keyword.

The file you are trying to read has a line like:

  VERSION     AE000783

The only format documentation I can find for Genbank is at:

  ftp://ncbi.nlm.nih.gov/genbank/docs/

This concentrates on the feature tables, and doesn't seem
to give a normative description of the headers.  However,
the example given includes the two-token VERSION string,
and this seems to be found in the vast majority of entries.

That said, we probably ought to go the `strict in what you
produce, tolerant in what you accept' approach.  I'll leave
the final decision to people who use the Genbank parser more
regularly, though (Greg, are you listening?)

Is there a normative specification for the Genbank header
lines anywhere?  If so, maybe it is worth complaining about
that entry...

   Thomas.