[BioRuby] GFF3

Pjotr Prins pjotr.public14 at thebird.nl
Thu Aug 12 14:30:12 UTC 2010


I intend to use GFF3 and document its use.

In my gff3 github branch (see http://github.com/pjotrp/bioruby/tree/gff3) I
have just added a first example for fetching sequence data from GFF3. First I
took an example from Lincoln Stein (in his BioPerl repository) and stuck that
in ./test/data/gff/test.gff3. This data contains empty lines - so I modified
the GFF3 parser to ignore those.

Before I continue, I also wonder about the wisdom of including a
Bio::FastaFormat record *inside* a Bio::Sequence record. This duplicates the
@definition with @entry_id. Not only that, the sequence contains white space,
which does not match GFF's positioning data:

#<Bio::Sequence:0xb7c2b354 @entry_id="test01",
@source_data=#<Bio::FastaFormat:0xb7c31574 @entry_overrun=nil,
@data="\nACGAAGATTTGTATGACTGATTTATCCTGGACAGGCATTGGTCAGATGTCTCCTTCCGTATCGTCGTTTA\nGTTGCAAATCCGAGTGTTCGGGGGTATTGCTATTTGCCACCTAGAAGCGCAACATGCCCAGCTTCACACA\nCCATAGCGAACACGCCGCCCCGGTGGCGACTATCGGTCGAAGTTAAGACAATTCATGGGCGAAACGAGAT\nAATGGGTACTGCACCCCTCGTCCTGTAGAGACGTCACAGCCAACGTGCCTTCTTATCTTGATACATTAGT\nGCCCAAGAATGCGATCCCAGAAGTCTTGGTTCTAAAGTCGTCGGAAAGATTTGAGGAACTGCCATACAGC\nCCGTGGGTGAAACTGTCGACATCCATTGTGCGAATAGGCCTGCTAGTGAC\n\n",
@definition="test01">>

Now, to print FASTA I now do:

  gff3.sequences.each do | item |
    print item.to_fasta(item.entry_id, 70)
  end

(to_fasta is being deprecated)

To get a FASTA sequence I would like to do the sane:

  gff3.sequences.each do | item |
    rec = Bio::FastaFormat.new('> '+item.definition.strip+"\n"+item.data)
    print rec
  end

where item.data is just the clean sequence.

The current implementation is rather uninituitive. I realise GFF3 contains
FASTA, but there is no reason to store it like that. How about removing the
contained Bio::FastaFormat and just use a sequence string? And remove the white
space by default?

It does also away with FASTA formatting - the to_fasta in GFF3.

I can make the changes, if you agree.

Pj.




More information about the BioRuby mailing list