[BioRuby] GFF3

Wed Aug 18 06:12:11 UTC 2010

On Wed, Aug 18, 2010 at 10:09:06AM +0900, Tomoaki NISHIYAMA wrote:
>> Thanks for the nice example. It shows how you can filter GFF without
>> storing everything in memory. Naturally that does not work for
>> extracting all transcripts as GFF does not guarantee ordered data.
>
> I think the code is not dependent on the order of the GFF file.

Sorry, I was not talking about your script. I merely stated your
example shows *how* it is possible to filter data. My sentence was
ambiguous.

> I've never seen an unordered GFF file, but there could be different
> orders.
> 1. The lines are just sorted according to the location.
> 2. genes are ordered and the parts of the gene comes together.
> For example the arabidopsis GFF file looks like this and you can see  
> that the
> feature itself is not ordered that protein 3760 comes earlier than exon 
> 3631.

Thanks for that. In that case I can store the seekpos of every
gene/location and use disk access instead. The way GFF is normally
orgainized would hardly incur a penalty.

I do the same with my BigBio FASTA reader.

I want to get away from loading everything in memory. We can not
assume that memory expansion keeps up with data load. It is fine as
an 'optimization', but we should not take it for granted.

> I think GFF is an exchange format rather than to work directly with
> part of it.  The data can be relatively easily stored into a RDB and
> extracted from it.  Index on RDB will allow a fast identification of
> all feature in a  specified region or a gene. That subset is good to
> work with.

I avoid RDB (assuming you mean RDBMS, and not the Rwanda Development
Board), until BioRuby comes with an RDBMS that can be used in a
transparent fashion. You can not assume every user has an RDBMS readily
available.

Pj.