[BioRuby] Genbank file parsing question

Tue Sep 18 15:19:27 UTC 2012

Thanks!  This was all great information, especially 
ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt which I had never tracked down before.  This will help a lot with any issues I run into with non-standard genbank formats.  
My confusion with the tutorial is:
Bio::FlatFile.new(Bio::GenBank, ARGF)works when part of a script and you pass the script a path/filename.
Bio::FlatFile.new(Bio::GenBank, "path/filename") doesn't work.  
I looked at the code for Bio::FlatFile.newnew(dbclass, stream)Same as ::open, except that ‘stream’ should be a opened stream object (IO, File, …, who have the ‘gets’ method).So.. why is ARGF (which would just be a string passed to the script) working, if it should be a stream?  Shouldn't I have to open the file?  For instance this works:
Bio::FlatFile.new(Bio::GenBank, File.open("path/filename"))

Is there some ruby magic going on?
~josh

Center for Genomic Sciences
(412)-359-8341

> Date: Tue, 18 Sep 2012 22:22:13 +0900
> From: ngoto at gen-info.osaka-u.ac.jp
> To: bioruby at lists.open-bio.org
> Subject: Re: [BioRuby] Genbank file parsing question
> 
> Hi,
> 
> On Mon, 17 Sep 2012 14:46:21 -0400
> Josh Earl <joshearl1 at hotmail.com> wrote:
> 
> > Hey Nick,
> > Wow, that was incredibly helpful, thanks.  One of the reasons I was confused about with the Bio::FlatFile.new method is the 
> > http://thebird.nl/bioruby/Tutorial.rd.html is a bit confusing in that regard (always uses the .new method instead of .open for Bio::FlatFile).  Is it the correct usage on the tutorial, or was I just interpreting that incorrectly?
> 
> The usage in the tutorial is right. As you can see, it only
> teaches Bio::FlatFile.new, but this does not mean there are no
> other methods. Indeed, I think many useful methods, classes,
> modules, and usages of them are not yet described in the tutorial.
> Thanks giving us an idea to improve the tutorial.
> 
> > Is there a reason to go specifically in hard-coding the locations instead of just splitting it into an array? 
> 
> Because the positions are officially defined by NCBI.
> See section 3.4.4 in the NCBI GenBank Release Note.
> ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt
> (current version: Release 191.0)
> 
> It says:
> >> Positions  Contents
> >> ---------  --------
> >> 01-05      'LOCUS'
> >> 06-12      spaces
> >> 13-28      Locus name
> >> 29-29      space
> >> 30-40      Length of sequence, right-justified
> >> 41-41      space
> >> 42-43      bp
> >> 44-44      space
> >> 45-47      spaces, ss- (single-stranded), ds- (double-stranded), or
> >>            ms- (mixed-stranded)
> >> 48-53      NA, DNA, RNA, tRNA (transfer RNA), rRNA (ribosomal RNA), 
> >>            mRNA (messenger RNA), uRNA (small nuclear RNA).
> >>            Left justified.
> >> 54-55      space
> >> 56-63      'linear' followed by two spaces, or 'circular'
> >> 64-64      space
> >> 65-67      The division code (see Section 3.3)
> >> 68-68      space
> >> 69-79      Date, in the form dd-MMM-yyyy (e.g., 15-MAR-1991)
> 
> 
> > I suppose there are issues with both, since if there are missing values, the locations in the array would be incorrect, and if the locus name is too long that might move the other fields out of position.  
> 
> Locus name longer than 16 characters is not officially allowed
> in the GenBank format.
> 
> It is not so easy to allow parsing of non-standard GenBank format
> that breaks the above definition, partly because of avoiding
> potential conflicts with future versions of NCBI GenBank format.
> Only NCBI has the right to change the format definition.
> In addition, non-standard means that the format definition is
> ambiguous and not fixed. This also makes difficult to parse
> such kind of data.
> 
> > And thanks for clarifying how to get access to the organism.  It does seem a bit odd to me that the first genbank record would have an "organism" method that wasn't set, even if there is an organism name listed.  For instance 
> > gb.first
> > refers to a single genbank record, right?  So, what is 
> > gb.first.organism
> > referring to, if not the organism of that record?  I guess I have similar questions about the .classification and .taxonomy methods (both of which return empty values on the first genbank record).
> 
> Each GenBank entry provided by NCBI has SOURCE field and ORGANISM
> subkeyward. See sections 3.4.2 and 3.4.10 in the GenBank Release
> Note. (ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt)
> According to the section 3.4.2, SOURCE is mandatory keyword.
> Bio::GenBank#organism, source, common_name, taxonomy and
> classification methods get their contents from the SOURCE and
> ORGANISM, not from the "source" feature in the feature table.
> 
> 
> Naohisa Goto
> ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org
> _______________________________________________
> BioRuby Project - http://www.bioruby.org/
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby