[BioRuby] Genbank file parsing question

Tue Sep 18 13:22:13 UTC 2012

Hi,

On Mon, 17 Sep 2012 14:46:21 -0400
Josh Earl <joshearl1 at hotmail.com> wrote:

> Hey Nick,
> Wow, that was incredibly helpful, thanks.  One of the reasons I was confused about with the Bio::FlatFile.new method is the 
> http://thebird.nl/bioruby/Tutorial.rd.html is a bit confusing in that regard (always uses the .new method instead of .open for Bio::FlatFile).  Is it the correct usage on the tutorial, or was I just interpreting that incorrectly?

The usage in the tutorial is right. As you can see, it only
teaches Bio::FlatFile.new, but this does not mean there are no
other methods. Indeed, I think many useful methods, classes,
modules, and usages of them are not yet described in the tutorial.
Thanks giving us an idea to improve the tutorial.

> Is there a reason to go specifically in hard-coding the locations instead of just splitting it into an array? 

Because the positions are officially defined by NCBI.
See section 3.4.4 in the NCBI GenBank Release Note.
ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt
(current version: Release 191.0)

It says:
>> Positions  Contents
>> ---------  --------
>> 01-05      'LOCUS'
>> 06-12      spaces
>> 13-28      Locus name
>> 29-29      space
>> 30-40      Length of sequence, right-justified
>> 41-41      space
>> 42-43      bp
>> 44-44      space
>> 45-47      spaces, ss- (single-stranded), ds- (double-stranded), or
>>            ms- (mixed-stranded)
>> 48-53      NA, DNA, RNA, tRNA (transfer RNA), rRNA (ribosomal RNA), 
>>            mRNA (messenger RNA), uRNA (small nuclear RNA).
>>            Left justified.
>> 54-55      space
>> 56-63      'linear' followed by two spaces, or 'circular'
>> 64-64      space
>> 65-67      The division code (see Section 3.3)
>> 68-68      space
>> 69-79      Date, in the form dd-MMM-yyyy (e.g., 15-MAR-1991)

> I suppose there are issues with both, since if there are missing values, the locations in the array would be incorrect, and if the locus name is too long that might move the other fields out of position.  

Locus name longer than 16 characters is not officially allowed
in the GenBank format.

It is not so easy to allow parsing of non-standard GenBank format
that breaks the above definition, partly because of avoiding
potential conflicts with future versions of NCBI GenBank format.
Only NCBI has the right to change the format definition.
In addition, non-standard means that the format definition is
ambiguous and not fixed. This also makes difficult to parse
such kind of data.

> And thanks for clarifying how to get access to the organism.  It does seem a bit odd to me that the first genbank record would have an "organism" method that wasn't set, even if there is an organism name listed.  For instance 
> gb.first
> refers to a single genbank record, right?  So, what is 
> gb.first.organism
> referring to, if not the organism of that record?  I guess I have similar questions about the .classification and .taxonomy methods (both of which return empty values on the first genbank record).

Each GenBank entry provided by NCBI has SOURCE field and ORGANISM
subkeyward. See sections 3.4.2 and 3.4.10 in the GenBank Release
Note. (ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt)
According to the section 3.4.2, SOURCE is mandatory keyword.
Bio::GenBank#organism, source, common_name, taxonomy and
classification methods get their contents from the SOURCE and
ORGANISM, not from the "source" feature in the feature table.

Naohisa Goto
ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org