[BioRuby] FlatFile GFF

Ben Woodcroft donttrustben at gmail.com
Thu Apr 1 00:33:27 UTC 2010


Hi,

I have a conceptual question for the list. When I open a gff2 file using
Bio::FlatFile, the next_entry method gives me all of the lines at once (in
the form of a Bio::GFF::GFF2 object).

f = Bio::FlatFile.open(Bio::GFF::GFF2,"some.gff2") => Bio::FlatFile
g = f.next_entry => Bio::GFF::GFF2 object
g.records => array of GFF2 records

To me, this seems a little counter-intuitive. I expected to get info for a
single line of the GFF file from FlatFile#next_entry

The other problem is that the whole file must be parsed at the beginning,
and this can cause memory problems when using large GFF files (e.g. the
current WormBase gff2 is 2.6GB).

To get around the problem I can use File.foreach('some.gff2') and then parse
each line using Bio::GFF::GFF2. I'm not sure what the situation is with
other file formats.

So, my question is, could we introduce a foreach method into FlatFile that
iterates (without parsing all at once so it is light on memory) over the
GFF/etc entries in the file? Ideally we could change next_entry, but that
wouldn't be backwards compatible I don't think.

Thanks,
ben

-- 
FYI: My email addresses at unimelb, uq and gmail all redirect to the same
place.



More information about the BioRuby mailing list