[Biopython-dev] GenBank parser -- first go

Mon Dec 11 01:09:09 EST 2000

Brad,

> Here's a justification for this.  It's already common practice with
> GenBank files to have subitems indented under the major item.  For
> example,
>
> SOURCE      thale cress.
>   ORGANISM  Arabidopsis thaliana

There are a few caveats that come up with indenting that I've come
across. Save the feature table, there used to be only one level of
subitem. The new PUBMED tag breaks this paradigm:

REFERENCE   1  (bases 1 to 675)
  AUTHORS   Sant,V.J., Sainani,M.N., Sami-Subbu,R., Ranjekar,P.K. and
            Gupta,V.S.
  TITLE     Ty1-copia retrotransposon-like elements in chickpea genome: their
            identification, distribution and use for diversity analysis
  JOURNAL   Gene 257 (1), 157-166 (2000)
   PUBMED   11054578

It's indented three spaces instead of two...

Brad, this will mean your indent_space definition will break (or pick
up unnecessary stuff).

Also, it's not fair to assume that the initial indenting is two spaces.
In some of the larger entries like LMFLCHR12 that is about 2000000 bp
long, the seven figures in the origin section causes there to be a one
character indent instead of the normal two character minimum.

ORIGIN
       1  TCAGTTTGTG CGGGGTGTGC ATATGCATGT GCATGCATAC ATGCACATAC ACATATATAC
...
 2287441  GCGTCACGTG GCGACGTCGA GGCCCGCAGC TTCTATTTTT TTT
//

However, I don't think this will break anything in the parser, but is
something to be remembered if you become more strict...

Cheers,
Edwin.
-------------------------------------------------------------------------------
Edwin Steele
QA Manager, eBioinformatics.             http://www.ebioinformatics.com
email: edwin.steele at eBioinformatics.com  Bay 16/104, Australian Technology Park
ph: +61 (2) 9209-4765                    Eveleigh 1430, NSW, Australia.