[BioRuby] FlatFile GFF

Naohisa GOTO ngoto at gen-info.osaka-u.ac.jp
Thu Apr 1 13:41:27 UTC 2010


On Thu, 1 Apr 2010 11:33:27 +1100
Ben Woodcroft <donttrustben at gmail.com> wrote:

> Hi,
> I have a conceptual question for the list. When I open a gff2 file using
> Bio::FlatFile, the next_entry method gives me all of the lines at once (in
> the form of a Bio::GFF::GFF2 object).
> f = Bio::FlatFile.open(Bio::GFF::GFF2,"some.gff2") => Bio::FlatFile
> g = f.next_entry => Bio::GFF::GFF2 object
> g.records => array of GFF2 records
> To me, this seems a little counter-intuitive. I expected to get info for a
> single line of the GFF file from FlatFile#next_entry

The design of Bio::GFF classes was determined by the first authors of
the classes. I don't know much about what they thought, but I suppose
because GFF can have header lines, sequences in Fasta format, and
relation information across two or more lines, they might think it is
easy to gather all information in a file into a single object.

Because Bio::FlatFile supports many file formats, format-specific
situation may sometimes be omitted and "normalized".

> The other problem is that the whole file must be parsed at the beginning,
> and this can cause memory problems when using large GFF files (e.g. the
> current WormBase gff2 is 2.6GB).

To overcome the problem, reorganizing of Bio::GFF classes may be needed.
Bio::FlatFile is only a controller with input buffer, and format specific
things should be implemented in the format parser and splitter classes.

Currently, for a workaroud, use Bio::GFF::GFF2::Record directly without
using Bio::FlatFile.

> To get around the problem I can use File.foreach('some.gff2') and then parse
> each line using Bio::GFF::GFF2. I'm not sure what the situation is with
> other file formats.
> So, my question is, could we introduce a foreach method into FlatFile that
> iterates (without parsing all at once so it is light on memory) over the
> GFF/etc entries in the file? Ideally we could change next_entry, but that
> wouldn't be backwards compatible I don't think.

I'm negative, because this is basically not the Bio::FlatFile issue,
but the Bio::GFF design problem, and modifying only Bio::FlatFile
does not solve the problem.

Indeed, the method name is too confusing, because we already have
Bio::FlatFile.foreach and Bio::FlatFile#each.
http://bioruby.org/rdoc/classes/Bio/FlatFile.html#M002156 (foreach)
http://bioruby.org/rdoc/classes/Bio/FlatFile.html#M002168 (each)

I'm thinking to implement another GFF parser frontend class that
can be specified as a file format.

   ff = Bio::FlatFile.open(Bio::GFF::AltParser, "xxx.gff")

Alternatively, introducing optional parameters to a Bio::FlatFile
and it could change parameters passed to the parser and splitter
classes for the format.

Naohisa Goto
ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org

More information about the BioRuby mailing list