[Bioperl-l] Bio::SeqIO can't guess the format of data from a pipe

Chris Fields cjfields at illinois.edu
Thu Aug 25 16:58:51 UTC 2011


On Aug 24, 2011, at 8:53 PM, J.J. Emerson wrote:

> Hello All,
> 
> I have experienced some behavior in SeqIO that doesn't seem to be what I
> would expect. Basically, for a certain script, if I try to pass something
> like "-fh => \*STDIN" to Bio::SeqIO->new(), it will fail if both of the
> following two conditions are met simultaneously:
> 
>   1. STDIN is coming from a pipe;
>   2. SeqIO is trying to guess the format.
> 
> If STDIO is coming from redirection instead of a pipe or if the format is
> specified manually (i.e. BioPERL doesn't have to guess), the error doesn't
> seem to occur.
> 
> This issue has been reported previously:
> 
> http://lists.open-bio.org/pipermail/bioperl-l/2010-July/033681.html
> https://redmine.open-bio.org/issues/3122

Yes, this was addressed according to that case.

> This issue is ultimately one of using seek() on a pipe, which is forbidden
> (see below). To be clear, there are kludgy ways around this that allow
> BioPERL to take input from a pipe AND guess the format. My naive and
> inefficient kludge was to test for reading from STDIN and for the absence of
> a format. If both of these conditions are met, then I slurp STDIN into a
> variable and then open a filehandle on that variable, and pass it to SeqIO,
> which can guess the format if the fh isn't opened on a pipe. SeqIO then
> successfully guesses the format and does the SeqIO thing, at the expense of
> having the program pass over the data at least twice. And if the input file
> is huge, it could potentially consume all the memory. A better way to
> address the problem would be to process the input one line at a time, but
> this seems to require more extensive changes.

Have you tried tempfiles?  Not that this is a great solution, but it's very commonly used for large sequence data, and it is seekable.  This behavior could also be wrapped in GuessSeqFormat i suppose (but see below)

> The reason I'm reposting this is because I think that the inability to guess
> the sequence format from data originating from a pipe is an important
> limitation for a fundamental part of BioPERL. When designing scripts to be
> used in pipelines, the inability to guess formats for piped data limits
> BioPERL's pipelineability substantially. Even though previous reports of
> this have been made and a bug opened and closed, I was wondering if anyone
> thought this was worthwhile fixing so as to make SeqIO (and probably AlignIO
> as well?) more flexible?
> 
> Does anyone think this should be refiled as a bug?
> 
> Cheers,
> 
> J.J.

The fundamental problem with pipes (as you indicated) is that the data stream is not seekable.  We do have a built-in buffer in Bio::Root::IO that somewhat handles this, but Bio::Tools::GuessSeqFormat is (IIRC) designed to use the filehandle directly, bypassing the BioPerl IO layer completely.  

One solution is to redesign GuessSeqFormat to use Bio::Root::IO, have GuessSeqFormat push all data back to the buffer, then let SeqIO parse.  That will require some fundamental changes for both Bio::Root::IO and Bio::SeqIO (note that one cannot pass a Bio::Root::IO instance to another Bio::Root::IO-based class for parsing at this time).

The other option is (as hinted above) having GuessSeqFormat dump the data to a tempfile, seek back after guessing, and retain the filehandle for Bio::SeqIO.  Not the best solutions, but either should work.

My question (not a criticism, just trying to understand the problem): why are you going through all the trouble of using GuessSeqFormat as a permanent solution anyway?  If you have a stream returning a possibly unknown data type, I would argue that the fundamental bug is not GuessSeqFormat but something else, more specifically not knowing the behavior of the data source and the returned format to begin with.  Is something preventing that?  

My point is, GuessSeqFormat is fine as a temporary stop-gap, but it is not a permanent solution to your problems (it is guessing, after all).  Note the code has had very little development over the years, and the related SeqIO code hasn't aged particularly well.

> PS
> 
> Below are snippets of code and/or errors related to reproducing the failure
> to guess unspecified formats. I'll see how Mailman treats my attachments and
> post the code as a reply if they don't work.
> 
> The bioperl_fhtest.pl attachment is the script that reproduces the error.
> The w.fa is a fasta file containing some sequence.
> 
> Here are the command lines to generate the behavior I observe (w.fa is a
> file containing some fasta sequences, in my case it was the w gene from
> different *Drosophila* species):
> 
> ./bioperl_fhtest.pl fasta < w.fa # Works (redirection, no guessing)
>> ./bioperl_fhtest.pl < w.fa # Works (redirection, guessing)
>> 
>> cat w.fa | ./bioperl_fhtest.pl fasta # Works (pipe, no guessing)
>> cat w.fa | ./bioperl_fhtest.pl # DOESN'T work (pipe, guessing)
>> 
> 
> 
> Here's the error I get in the last case:
> 
> ------------- EXCEPTION: Bio::Root::Exception -------------
>> MSG: Failed resetting the filehandle; IO error occurred
>> STACK: Error::throw
>> STACK: Bio::Root::Root::throw
>> /usr/local/share/perl/5.10.1/Bio/Root/Root.pm:472
>> STACK: Bio::Tools::GuessSeqFormat::guess
>> /usr/local/share/perl/5.10.1/Bio/Tools/GuessSeqFormat.pm:512
>> STACK: Bio::SeqIO::new /usr/local/share/perl/5.10.1/Bio/SeqIO.pm:381
>> STACK: ./bioperl_fhtest.pl:8
>> -----------------------------------------------------------
>> 
> 
>> From what I gather, the error is triggered by a failure of seek() on a STDIO
> fh on lines 517-518 (text from the version GuessSeqFormat.pm installed on my
> server):
> 
>    512     if (defined $self->{-file}) {
>>    513         # Close the file we opened.
>>    514         close($fh);
>>    515     } elsif (ref $fh eq 'GLOB') {
>>    516         # Try seeking to the start position.
>>    517         seek($fh, $start_pos, 0) || $self->throw("Failed resetting
>> the ".
>>    518                                         "filehandle; IO error
>> occurred");;
>>    519     } elsif (defined $fh && $fh->can('setpos')) {
>>    520         # Seek to the start position.
>>    521         $fh->setpos($start_pos);
>>    522     }
>> 
> <bioperl_fhtest.pl><w.fa>_______________________________________________

You are always welcome to reopen and update the bug, or file a new one.  

chris





More information about the Bioperl-l mailing list