[BioRuby] Proposal: Bio::FastaFormat#each_entry

Naohisa GOTO ngoto at gen-info.osaka-u.ac.jp
Fri Jan 29 10:25:29 UTC 2010


Hi,

On Fri, 29 Jan 2010 15:46:15 +0900
"MISHIMA, Hiroyuki" <missy at be.to> wrote:

> Hi all,
> 
> How about implementing the following methods?
> 
> 	Bio::FastaFormat#each_entry
> 	Bio::FastaNumericFormat#each_entry
> 
> The following is a sample code to generate a FASTQ string from a FASTA 
> string and a FASTA.QUAL string. This sample may need ruby 1.8.7 or later.
>
> I am afraid that simpler or easier ways are already existed in BioRuby...

I think mixing single entry parser with multiple entry iterator
will cause confusion, and not good way.

For most parser classes in bioruby, expected data source is
String containing single entry data. In addition, for IO with
possible multiple entries, Bio::FlatFile is the front-end that
can detect data type, splits each entry, and calling assigned
parser class.

For String containing multiple entries, using StringIO and
then Bio::FlatFile is the easiest way, although indirect.
Recently, many efficient memory-mapped data transfer methods
are available, e.g. memcached, IPC shared memory, mmap(2)
system call. I'm now thinking how to treat such data efficiently.

Below is an example using StringIO and Bio::FlatFile.
#------------------------------------------------
  require 'stringio'
  require 'bio'

  # When copy-and paste this script, the "> " in the head of
  # each line should be removed. 
> fasta = <<EOS
> >FXQB1I00000001
> TATGGAATCTGTAGAATCAGTGGTAGGTGCAGCAGATGGAGGAAGG
> >FXQB1I00000002
> CTGGAGAATTCTGGATCCTCGACTTATGACTTGGTGGTTCTGGTAACTGTGAGCTTAGGATAGTCAG
> EOS
> 
> qual = <<EOS
> >FXQB1I00000001
> 30 30 29 42 25 24 5 30 30 30 30 30 28 30 26 9 30 30 30 30 30 42 25 30 30 
> 42 25 29 22 30 29 26 30 30 30 29 30 42 25 30 32 17 40 23 39 24
> >FXQB1I00000002
> 30 30 33 19 28 30 26 9 32 12 30 30 33 20 30 30 32 15 27 27 30 28 28 34 
> 22 27 22 28 28 29 26 9 33 19 22 43 25 33 19 28 27 32 15 30 32 12 28 30 
> 27 30 30 26 27 30 40 23 30 40 23 30 29 29 30 30 30 29 30
> EOS
  
  ff_fasta = Bio::FlatFile.open(StringIO.new(fasta))
  ff_qual = Bio::FlatFile.open(StringIO.new(qual))

  while entry_fasta = ff.fasta.next_entry
    seq = entry_fasta.to_biosequence
    seq.quality_score_type = :phred
    seq.quality_scores = ff_qual.next_entry.data
    puts fastq.output(:fastq, :title => entry_fasta.definition)
  end
#------------------------------------------------

> enum_fasta = Bio::FastaFormat.new(fasta).each_entry
> enum_qual = Bio::FastaNumericFormat.new(qual).each_entry
> 
> loop do
>    fastq = Bio::Sequence.adapter(enum_fasta.next,
>                                  Bio::Sequence::Adapter::Fastq)
>    fastq.quality_score_type = :phred
>    fastq.quality_scores = enum_qual.next.data
>    puts fastq.output(:fastq)
> end

Bio::Sequence.adapter is bioruby library internal use only,
and normally should not be used by user scripts. In addition,
using Adapter::Fastq for Bio::FastaFormat data is mismatch. 
In this case, use Bio::FastaFormat#to_biosequence.

> 
> -- 
> MISHIMA, Hiroyuki, DDS, Ph.D.
> COE Research Fellow
> Department of Human Genetics
> Nagasaki University Graduate School of Biomedical Sciences

Thanks,

Naohisa Goto
ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org




More information about the BioRuby mailing list