[Bioperl-l] Bio::SeqIO can't guess the format of data from a pipe

Florent Angly florent.angly at gmail.com
Sat Aug 27 11:12:05 UTC 2011


On the topic of guessing file formats, last I checked, it was difficult 
to reuse the format guessed by Bio::SeqIO

For example, if I want to takes sequences in any format (FASTA, FASTQ, 
...) and filter some of them out and put them in a new file in the same 
format, I need to do something along these lines:

     # Open the file and let BioPerl guess its format
     my $in = Bio::SeqIO->new( -file => $input_seqfile );

     # Have Bioperl guess the format (again) so we can use the same 
format for the output file
     my $format = $in->_guess_format( $input_seqfile );

     # Open the output file (same format as the input file
     my $out = Bio::SeqIO->new( -file => ">".$output_seqfile , format => 
$format );

     # Now do the work...

The limitations of the code above is that in is more complex than it 
should be and forces Bioperl do check the file format twice. My proposal 
would be to store the format of a file somewhere in the Bio::SeqIO 
object and create a new get/set method in Bio::SeqIO called format() to 
store of access its value. The idea would be that the example code above 
could be rewritten as:

     # Open the file and let BioPerl guess its format
     my $in = Bio::SeqIO->new( -file => $input_seqfile );

     # Retrieve the format guessed by BioPerl
     my $format = $in->format( );

     # Open the output file using the same format as the input file
     my $out = Bio::SeqIO->new( -file => ">".$output_seqfile , format => 
$format );

     # Now do the work...

I think this is more elegant since it is more readable, requires less 
computation (the file format is guessed once), and is more consistent 
with other Bio::SeqIO methods like alphabet, that guesses the alphabet 
but has a get/set method to access it.

Florent



On 26/08/11 07:04, Chris Fields wrote:
> On Aug 25, 2011, at 1:52 PM, J.J. Emerson wrote:
>
>> Hi Chris,
>>
>> You asked:
>>
>> My question (not a criticism, just trying to understand the problem): why are you going through all the trouble of using GuessSeqFormat as a permanent solution anyway?  If you have a stream returning a possibly unknown data type, I would argue that the fundamental bug is not GuessSeqFormat but something else, more specifically not knowing the behavior of the data source and the returned format to begin with.  Is something preventing that?
>>
>> In my particular case, I'm trying not to impose a particular usage scenario onto the script I'm writing in the hopes it will be useful (and general) to others in my lab in the future*. In my proximate case, I will certainly be able to provide SeqIO with a format argument. But insofar as GuessSeqFormat is considered desirable (and reasonable people could indeed disagree whether it is desirable) I think its applicability shouldn't hinge on whether it is guessing on a pipe or a file.
>>
>> My point is, GuessSeqFormat is fine as a temporary stop-gap, but it is not a permanent solution to your problems (it is guessing, after all).  Note the code has had very little development over the years, and the related SeqIO code hasn't aged particularly well.
>>
>> I see. I wasn't aware that GuessSeqFormat was so relatively neglected. Given the rather challenging nature of the more elegant fix you suggested (using the buffering of Root:IO), perhaps I should consider dropping my issue or filing it as a feature request rather than a bug?
> That's fine.  I don't want to dissuade you from taking this on, either.
>
>> Cheers,
>>
>> J.J.
>>
>> PS
>>
>> * The way I plan on using my script is roughly as follows:
>>
>> prog1 [some arguments] \
>> | myscript.pl --informat fasta \
>> | prog2 \
>> | prog3>  pipeline.output
>>
>> However, I'd like for the "--informat" switch to be optional, mainly to increase usability for other users. For any well considered format, the information is right there in the data to know what the format is, and as such, providing the format a second time is somewhat redundant. In principle, being able to do the following would be very useful:
>>
>> prog1 [some arguments] \
>> | myscript.pl \
>> | prog2>  pipeline.output
>>
>> The modularity of pipelining is very valuable and this is what caused me to anticipate a usage scenario that involved both GuessSeqFormat and reading from a pipe.
> Not disagreeing with you at all, flexible code is best.
>
> chris
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l




More information about the Bioperl-l mailing list