[Bioperl-l] Bio::SeqIO issue

Hilmar Lapp hlapp at gmx.net
Thu Aug 6 13:36:08 UTC 2009


Why is specifying fasta format when your input is not in fasta format  
not a user error?

I agree with the not removing newlines in raw format being a bug.

	-hilmar

On Aug 6, 2009, at 1:12 AM, Chris Fields wrote:

> Just to confirm: the following is using bioperl-live on my macbook  
> pro (perl 5.10.0, 64bit).  We need to decide if this is a legit bug  
> or a user issue (if it's the former, we can easily add an exception  
> indicating lack of a header).  Note that 'raw' also fails for the  
> raw example below (doesn't appear to remove newlines).
>
> -c
>
> cjfields4:fasta cjfields$ cat raw_v_fasta.pl
> #!/usr/bin/perl -w
>
> use strict;
> use warnings;
> use IO::String;
> use Bio::SeqIO;
> use Test::More qw(no_plan);
>
> my %seq;
>
> $seq{raw} = <<RAW;
> MWTALPLLCAGAWLLSAGATAELTVNAIEKFHFTSWMKQHQKTYSSREYSHRLQVFANNWRKIQAHNQRN
> HTFKMGLNQFSDMSFAEIKHKYLWSEPQNCSATKSNYLRGTGPYPSSMDWRKKGNVVSPVKNQGACGSCW
> TFSTTGALESAVAIASGKMMTLAEQQLVDCAQNFNNHGCQGGLPSQAFEYILYNKGIMGEDSYPYIGKNG
> QCKFNPEKAVAFVKNVVNITLNDEAAMVEAVALYNPVSFAFEVTEDFMMYKSGVYSSNSCHKTPDKVNHA
> VLAVGYGEQNGLLYWIVKNSWGSNWGNNGYFLIERGKNMCGLAACASYPIPQV
> RAW
>
> $seq{fasta} = <<FASTA;
> >CATH_RAT
> MWTALPLLCAGAWLLSAGATAELTVNAIEKFHFTSWMKQHQKTYSSREYSHRLQVFANNWRKIQAHNQRN
> HTFKMGLNQFSDMSFAEIKHKYLWSEPQNCSATKSNYLRGTGPYPSSMDWRKKGNVVSPVKNQGACGSCW
> TFSTTGALESAVAIASGKMMTLAEQQLVDCAQNFNNHGCQGGLPSQAFEYILYNKGIMGEDSYPYIGKNG
> QCKFNPEKAVAFVKNVVNITLNDEAAMVEAVALYNPVSFAFEVTEDFMMYKSGVYSSNSCHKTPDKVNHA
> VLAVGYGEQNGLLYWIVKNSWGSNWGNNGYFLIERGKNMCGLAACASYPIPQV
> FASTA
>
> my %newdata;
> for my $input (sort keys %seq) {
>    my $fh = IO::String->new($seq{$input});
>    my $seq = Bio::SeqIO->new(-format => 'fasta',
>                             -fh => $fh)->next_seq;
>    $newdata{$input} = $seq->seq;
> }
> is($newdata{raw}, $newdata{fasta}, 'format');
>
> cjfields4:fasta cjfields$ perl raw_v_fasta.pl
> not ok 1 - format
> #   Failed test 'format'
> #   at raw_v_fasta.pl line 36.
> #          got:  
> 'HTFKMGLNQFSDMSFAEIKHKYLWSEPQNCSATKSNYLRGTGPYPSSMDWRKKGNVVSPVKNQGACGSCWTFSTTGALESAVAIASGKMMTLAEQQLVDCAQNFNNHGCQGGLPSQAFEYILYNKGIMGEDSYPYIGKNGQCKFNPEKAVAFVKNVVNITLNDEAAMVEAVALYNPVSFAFEVTEDFMMYKSGVYSSNSCHKTPDKVNHAVLAVGYGEQNGLLYWIVKNSWGSNWGNNGYFLIERGKNMCGLAACASYPIPQV'
> #     expected:  
> 'MWTALPLLCAGAWLLSAGATAELTVNAIEKFHFTSWMKQHQKTYSSREYSHRLQVFANNWRKIQAHNQRNHTFKMGLNQFSDMSFAEIKHKYLWSEPQNCSATKSNYLRGTGPYPSSMDWRKKGNVVSPVKNQGACGSCWTFSTTGALESAVAIASGKMMTLAEQQLVDCAQNFNNHGCQGGLPSQAFEYILYNKGIMGEDSYPYIGKNGQCKFNPEKAVAFVKNVVNITLNDEAAMVEAVALYNPVSFAFEVTEDFMMYKSGVYSSNSCHKTPDKVNHAVLAVGYGEQNGLLYWIVKNSWGSNWGNNGYFLIERGKNMCGLAACASYPIPQV'
> 1..1
> # Looks like you failed 1 test of 1.
>
> On Aug 5, 2009, at 6:12 PM, Mark A. Jensen wrote:
>
>> If these items were included in a Bugzilla report, that would be  
>> most convenient (= most likely to get looked carefully)
>> and is the best place for us to keep track of these kinds of  
>> issues-- http://bugzilla.bioperl.org/
>> cheers MAJ
>> ----- Original Message ----- From: "Hilmar Lapp" <hlapp at gmx.net>
>> To: "Chris Fields" <cjfields at illinois.edu>
>> Cc: "BioPerl List" <bioperl-l at lists.open-bio.org>
>> Sent: Wednesday, August 05, 2009 6:53 PM
>> Subject: Re: [Bioperl-l] Bio::SeqIO issue
>>
>>
>>> I don't think that can be the problem. If anything, providing the   
>>> format ought to be better in terms of result than not providing it?
>>> Uwe - I'd like you to go back to Chris' initial questions that  
>>> you  haven't answered yet: "What version of bioperl are you using,  
>>> OS,  etc?  What does your data look like?" I'd add to that, can  
>>> you show us  your full script, or a smaller code snippet that  
>>> reproduces the problem.
>>> I suspect that either something in your script is swallowing the  
>>> line,  or that the line endings in your data file are from a  
>>> different OS  than the one you're running the script on. (Or that  
>>> you are running a  very old version of BioPerl, which is entirely  
>>> possible if you  installed through CPAN.)
>>> -hilmar
>>> On Aug 5, 2009, at 5:37 PM, Chris Fields wrote:
>>>> Uwe,
>>>>
>>>> Please keep replies on the list.
>>>>
>>>> It's very possible that's the issue; IIRC the fasta parser pulls  
>>>> out  the full sequence in chunks (based on local $/ = "\n>") and  
>>>> splits  the header off as the first line in that chunk.  You  
>>>> could probably  try leaving the format out and letting SeqIO  
>>>> guess it, or passing  the file into Bio::Tools::GuessSeqFormat  
>>>> directly, but it's probably  better to go through the files and  
>>>> add a file extension that  corresponds to the format.
>>>>
>>>> chris
>>>>
>>>> On Aug 5, 2009, at 4:23 PM, Hilgert, Uwe wrote:
>>>>
>>>>> Thanks, Chris. The files have no extension, but we indicate  
>>>>> what  format
>>>>> to use, like in the manual:
>>>>>
>>>>> $in  = Bio::SeqIO->new(-file => "file_path", -format => 'Fasta');
>>>>>
>>>>> I wonder now whether this could exactly cause the problem: as we  
>>>>> are
>>>>> telling that input files are in fasta format they are being  
>>>>> treated  as
>>>>> such (=remove first line) - regardless of whether they really  
>>>>> are  fasta?
>>>>>
>>>>> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
>>>>> Uwe Hilgert, Ph.D.
>>>>> Dolan DNA Learning Center
>>>>> Cold Spring Harbor Laboratory
>>>>>
>>>>> C: (516) 857-1693
>>>>> V: (516) 367-5185
>>>>> E: hilgert at cshl.edu
>>>>> F: (516) 367-5182
>>>>> W: http://www.dnalc.org
>>>>>
>>>>> -----Original Message-----
>>>>> From: Chris Fields [mailto:cjfields at illinois.edu]
>>>>> Sent: Wednesday, August 05, 2009 5:04 PM
>>>>> To: Hilgert, Uwe
>>>>> Cc: bioperl-l at lists.open-bio.org
>>>>> Subject: Re: [Bioperl-l] Bio::SeqIO issue
>>>>>
>>>>> On Aug 5, 2009, at 3:27 PM, Hilgert, Uwe wrote:
>>>>>
>>>>>> Is my impression correct that Bio::SeqIO just assumes that  
>>>>>> sequences
>>>>>> are
>>>>>> being submitted in FASTA format?
>>>>>
>>>>> No. See:
>>>>>
>>>>> http://www.bioperl.org/wiki/HOWTO:SeqIO
>>>>>
>>>>> SeqIO tries to guess at the format using the file extension, and  
>>>>> if
>>>>> one isn't present makes use of Bio::Tools::GuessSeqFormat.  It's
>>>>> possible that the extension is causing the problem, or that
>>>>> GuessSeqFormat guessing wrong (it's apt to do that, as it's  
>>>>> forced to
>>>>> guessing).  In any case, it's always advisable to explicitly  
>>>>> indicate
>>>>> the format when possible.
>>>>>
>>>>> Relevant lines:
>>>>>
>>>>> return 'fasta'   if /\.(fasta|fast|fas|seq|fa|fsa|nt|aa|fna|faa) 
>>>>> $/ i;
>>>>> ...
>>>>> return 'raw'     if /\.(txt)$/i;
>>>>>
>>>>>> In our experience, implementing
>>>>>> Bio::SeqIO led to the first line of files being cut off,   
>>>>>> regardless of
>>>>>> whether the files were indeed fasta files or files that only   
>>>>>> contained
>>>>>> sequence.
>>>>>
>>>>> Files that only contain sequence are 'raw'.  Ones in FASTA are   
>>>>> 'fasta'.
>>>>>
>>>>>> Which, in the latter, led to sequence submissions that had the
>>>>>> first line of nucleotides removed. Has anyone tried to write a  
>>>>>> fix  for
>>>>>> this?
>>>>>
>>>>> This sounds like a bug, but we have very little to go on beyond  
>>>>> your
>>>>> description.  What version of bioperl are you using, OS, etc?   
>>>>> What
>>>>> does your data look like?  File extension?
>>>>>
>>>>> chris
>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Uwe
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
>>>>>>
>>>>>> Uwe Hilgert, Ph.D.
>>>>>>
>>>>>> Dolan DNA Learning Center
>>>>>>
>>>>>> Cold Spring Harbor Laboratory
>>>>>>
>>>>>>
>>>>>>
>>>>>> V: (516) 367-5185
>>>>>>
>>>>>> E: hilgert at cshl.edu <mailto:hilgert at cshl.edu>
>>>>>>
>>>>>> F: (516) 367-5182
>>>>>>
>>>>>> W: http://www.dnalc.org
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Bioperl-l mailing list
>>>>>> Bioperl-l at lists.open-bio.org
>>>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>>>
>>>>
>>>> _______________________________________________
>>>> Bioperl-l mailing list
>>>> Bioperl-l at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>> -- 
>>> ===========================================================
>>> : Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
>>> ===========================================================
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>

-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================






More information about the Bioperl-l mailing list