[BioRuby] GFF3

Michael Paulini mh6 at sanger.ac.uk
Thu Aug 12 14:42:23 UTC 2010


  Pjotr,

are you coming over to the GMOD meeting in Cambridge?
Because if we need/want to make changes to teh GFF3 specifications, we could discuss it there, as some from Lincoln's 
group will also be there.

... and yes, the inlined fasta at the end is not a perfect solution.

Michael

On 12/08/10 15:30, Pjotr Prins wrote:
> I intend to use GFF3 and document its use.
>
> In my gff3 github branch (see http://github.com/pjotrp/bioruby/tree/gff3) I
> have just added a first example for fetching sequence data from GFF3. First I
> took an example from Lincoln Stein (in his BioPerl repository) and stuck that
> in ./test/data/gff/test.gff3. This data contains empty lines - so I modified
> the GFF3 parser to ignore those.
>
> Before I continue, I also wonder about the wisdom of including a
> Bio::FastaFormat record *inside* a Bio::Sequence record. This duplicates the
> @definition with @entry_id. Not only that, the sequence contains white space,
> which does not match GFF's positioning data:
>
> #<Bio::Sequence:0xb7c2b354 @entry_id="test01",
> @source_data=#<Bio::FastaFormat:0xb7c31574 @entry_overrun=nil,
> @data="\nACGAAGATTTGTATGACTGATTTATCCTGGACAGGCATTGGTCAGATGTCTCCTTCCGTATCGTCGTTTA\nGTTGCAAATCCGAGTGTTCGGGGGTATTGCTATTTGCCACCTAGAAGCGCAACATGCCCAGCTTCACACA\nCCATAGCGAACACGCCGCCCCGGTGGCGACTATCGGTCGAAGTTAAGACAATTCATGGGCGAAACGAGAT\nAATGGGTACTGCACCCCTCGTCCTGTAGAGACGTCACAGCCAACGTGCCTTCTTATCTTGATACATTAGT\nGCCCAAGAATGCGATCCCAGAAGTCTTGGTTCTAAAGTCGTCGGAAAGATTTGAGGAACTGCCATACAGC\nCCGTGGGTGAAACTGTCGACATCCATTGTGCGAATAGGCCTGCTAGTGAC\n\n",
> @definition="test01">>
>
> Now, to print FASTA I now do:
>
>    gff3.sequences.each do | item |
>      print item.to_fasta(item.entry_id, 70)
>    end
>
> (to_fasta is being deprecated)
>
> To get a FASTA sequence I would like to do the sane:
>
>    gff3.sequences.each do | item |
>      rec = Bio::FastaFormat.new('>  '+item.definition.strip+"\n"+item.data)
>      print rec
>    end
>
> where item.data is just the clean sequence.
>
> The current implementation is rather uninituitive. I realise GFF3 contains
> FASTA, but there is no reason to store it like that. How about removing the
> contained Bio::FastaFormat and just use a sequence string? And remove the white
> space by default?
>
> It does also away with FASTA formatting - the to_fasta in GFF3.
>
> I can make the changes, if you agree.
>
> Pj.
>
> _______________________________________________
> BioRuby Project - http://www.bioruby.org/
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby



-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE. 



More information about the BioRuby mailing list