[Bioperl-l] perl one-liner with Bio::SeqIO

Thu Jul 22 13:27:30 UTC 2010

Would someone like to file this as a bug?  My guess is this may be a combination of using pipes and the way FASTA is parsed (locally resets $/).

http://bugzilla.open-bio.org

chris

On Jul 22, 2010, at 6:49 AM, Roy Chaudhuri wrote:

> Hi Alper,
> 
> The problem comes about because you don't specify -format=>'fasta' in your Bio::SeqIO object. BioPerl attempts to guess the format if you don't specify it, but seems to be struggling in this case. I can't really think of any good reason for not specifying the format. Just in case anyone wants to investigate further, I noticed that if you try the example with longer fasta sequences, the first line of the sequence is interpreted as the id, with the remainder as the sequence.
> 
> Cheers.
> Roy.
> 
> On 22/07/2010 11:48, Frank Schwach wrote:
>> Hi Alper,
>> 
>> You can actually reproduce it also by providing STDIN from keyboard
>> input like so:
>> $ perl -MBio::SeqIO -e 'my $seq=Bio::SeqIO->new(-fh =>\*STDIN); while
>> ($myseq=$seq->next_seq){ print $myseq->id,"\t",$myseq->seq,"\n"}'
>>> 1
>> aaaaaaaaa
>>> 2
>> aaaaaaaaa
>> ggggggggg
>>> 3
>> 2       ggggggggg
>> ccccccccc
>> 3       ccccccccc
>> 
>> In this case I typed
>> ">1"[ENTER]
>> "aaaaaaaaa"[ENTER]
>> ">2"[ENTER}
>> then the command returned the sequence of the first entry without the ID
>> again.
>>> From the second entry onwards, it is all correct.
>> 
>> I'm not 100% sure but could it be linked to buffering? SeqIO has to read
>> ahead to find a complete entry that spans multiple lines. When you get
>> STDIN from a file, you will get buffering and receive more than one line
>> at once, which will allow the next_seq method to work as expected. If
>> you provide line-by-line input then that method probably can't work
>> correctly.
>> If that is the case then you can't use the command in a pipe at all.
>> 
>> Frank
>> 
>> 
>> 
>> On Thu, 2010-07-22 at 00:09 -0400, Alper Yilmaz wrote:
>>> Hi,
>>> 
>>> I was using Bio::SeqIO with perl one-liner and I noticed an oddity.
>>> Can someone suggest a correction or workaround?
>>> 
>>> Let test.fa be;
>>>> 1
>>> AGTC
>>>> 2
>>> CTGA
>>> 
>>> Then, commandline below prints the expected output:
>>> $ perl -MBio::SeqIO -e 'my $seq=Bio::SeqIO->new(-fh =>\*STDIN); while
>>> ($myseq=$seq->next_seq){ print $myseq->id,"\t",$myseq->seq,"\n"}'<
>>> test.fa
>>> 
>>> output:
>>> 1	AGTC
>>> 2	CTGA
>>> 
>>> However, if use the command in a pipe, then the output has an issue
>>> with primary_id of initial sequence.
>>> $ cat test.fa | perl -MBio::SeqIO -e 'my $seq=Bio::SeqIO->new(-fh
>>> =>\*STDIN); while ($myseq=$seq->next_seq){ print
>>> $myseq->id,"\t",$myseq->seq,"\n"}'
>>> 
>>> output:
>>> AGTC	
>>> 2	CTGA
>>> 
>>> What is the workaround to make Bio::SeqIO work correctly in a
>>> one-liner with pipes?
>>> 
>>> thanks,
>>> 
>>> Alper Yilmaz
>>> Post-doctoral Researcher
>>> Plant Biotechnology Center
>>> The Ohio State University
>>> 1060 Carmack Rd
>>> Columbus, OH 43210
>>> (614)688-4954
>>> 
>>> 
>>> PS: Normally, the example is demonstrating useless use of cat, for the
>>> sake giving an example, it can be "command1 | command2 | command3 |
>>> perl -MBioSeqIO -e'...' " instead..
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>> 
>> 
>> 
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l