[Bioperl-l] Bio::SeqIO issue

Thu Aug 6 04:43:45 UTC 2009

The SeqIO::fasta parser sets:

local $/ = "\n>";

then splits the resulting chunks of data (each corresponding to a full  
FASTA-formatted sequence) into two pieces:

my ($top,$sequence) = split(/\n/,$entry,2);

If there is no description line (e.g. the file is all raw sequence  
data) these lines would result in reading in the whole file, then  
split out the first line.

chris

On Aug 5, 2009, at 5:53 PM, Hilmar Lapp wrote:

> I don't think that can be the problem. If anything, providing the  
> format ought to be better in terms of result than not providing it?
>
> Uwe - I'd like you to go back to Chris' initial questions that you  
> haven't answered yet: "What version of bioperl are you using, OS,  
> etc?  What does your data look like?" I'd add to that, can you show  
> us your full script, or a smaller code snippet that reproduces the  
> problem.
>
> I suspect that either something in your script is swallowing the  
> line, or that the line endings in your data file are from a  
> different OS than the one you're running the script on. (Or that you  
> are running a very old version of BioPerl, which is entirely  
> possible if you installed through CPAN.)
>
> 	-hilmar
>
> On Aug 5, 2009, at 5:37 PM, Chris Fields wrote:
>
>> Uwe,
>>
>> Please keep replies on the list.
>>
>> It's very possible that's the issue; IIRC the fasta parser pulls  
>> out the full sequence in chunks (based on local $/ = "\n>") and  
>> splits the header off as the first line in that chunk.  You could  
>> probably try leaving the format out and letting SeqIO guess it, or  
>> passing the file into Bio::Tools::GuessSeqFormat directly, but it's  
>> probably better to go through the files and add a file extension  
>> that corresponds to the format.
>>
>> chris
>>
>> On Aug 5, 2009, at 4:23 PM, Hilgert, Uwe wrote:
>>
>>> Thanks, Chris. The files have no extension, but we indicate what  
>>> format
>>> to use, like in the manual:
>>>
>>> $in  = Bio::SeqIO->new(-file => "file_path", -format => 'Fasta');
>>>
>>> I wonder now whether this could exactly cause the problem: as we are
>>> telling that input files are in fasta format they are being  
>>> treated as
>>> such (=remove first line) - regardless of whether they really are  
>>> fasta?
>>>
>>> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
>>> Uwe Hilgert, Ph.D.
>>> Dolan DNA Learning Center
>>> Cold Spring Harbor Laboratory
>>>
>>> C: (516) 857-1693
>>> V: (516) 367-5185
>>> E: hilgert at cshl.edu
>>> F: (516) 367-5182
>>> W: http://www.dnalc.org
>>>
>>> -----Original Message-----
>>> From: Chris Fields [mailto:cjfields at illinois.edu]
>>> Sent: Wednesday, August 05, 2009 5:04 PM
>>> To: Hilgert, Uwe
>>> Cc: bioperl-l at lists.open-bio.org
>>> Subject: Re: [Bioperl-l] Bio::SeqIO issue
>>>
>>> On Aug 5, 2009, at 3:27 PM, Hilgert, Uwe wrote:
>>>
>>>> Is my impression correct that Bio::SeqIO just assumes that  
>>>> sequences
>>>> are
>>>> being submitted in FASTA format?
>>>
>>> No. See:
>>>
>>> http://www.bioperl.org/wiki/HOWTO:SeqIO
>>>
>>> SeqIO tries to guess at the format using the file extension, and if
>>> one isn't present makes use of Bio::Tools::GuessSeqFormat.  It's
>>> possible that the extension is causing the problem, or that
>>> GuessSeqFormat guessing wrong (it's apt to do that, as it's forced  
>>> to
>>> guessing).  In any case, it's always advisable to explicitly  
>>> indicate
>>> the format when possible.
>>>
>>> Relevant lines:
>>>
>>>  return 'fasta'   if /\.(fasta|fast|fas|seq|fa|fsa|nt|aa|fna|faa)$/ 
>>> i;
>>> ...
>>>  return 'raw'     if /\.(txt)$/i;
>>>
>>>> In our experience, implementing
>>>> Bio::SeqIO led to the first line of files being cut off,  
>>>> regardless of
>>>> whether the files were indeed fasta files or files that only  
>>>> contained
>>>> sequence.
>>>
>>> Files that only contain sequence are 'raw'.  Ones in FASTA are  
>>> 'fasta'.
>>>
>>>> Which, in the latter, led to sequence submissions that had the
>>>> first line of nucleotides removed. Has anyone tried to write a  
>>>> fix for
>>>> this?
>>>
>>> This sounds like a bug, but we have very little to go on beyond your
>>> description.  What version of bioperl are you using, OS, etc?  What
>>> does your data look like?  File extension?
>>>
>>> chris
>>>
>>>> Thanks,
>>>>
>>>> Uwe
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
>>>>
>>>> Uwe Hilgert, Ph.D.
>>>>
>>>> Dolan DNA Learning Center
>>>>
>>>> Cold Spring Harbor Laboratory
>>>>
>>>>
>>>>
>>>> V: (516) 367-5185
>>>>
>>>> E: hilgert at cshl.edu <mailto:hilgert at cshl.edu>
>>>>
>>>> F: (516) 367-5182
>>>>
>>>> W: http://www.dnalc.org
>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Bioperl-l mailing list
>>>> Bioperl-l at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
> -- 
> ===========================================================
> : Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
> ===========================================================
>
>
>