[Bioperl-l] Bio::SeqIO issue

Hilmar Lapp hlapp at gmx.net
Thu Aug 6 15:18:06 UTC 2009


Uwe - could you send an actual data file (as an attachment) that  
reproduces the problem, or is that not possible?

	-hilmar

On Aug 6, 2009, at 11:01 AM, Hilgert, Uwe wrote:

> I'm not sure what version we have. Cornel may have installed it a  
> while
> ago from CVS:
>
> Module id = Bio::Root::Build
>    CPAN_USERID  CJFIELDS (Christopher Fields <cjfields at bioperl.org>)
>    CPAN_VERSION 1.006000
>    INST_FILE    /usr/lib/perl5/site_perl/5.8.8/Bio/Root/Build.pm
>    INST_VERSION 1.006900
> cpan> m Bio::Root::Version
> Module id = Bio::Root::Version
>    CPAN_USERID  CJFIELDS (Christopher Fields <cjfields at bioperl.org>)
>    CPAN_VERSION 1.006000
>    INST_FILE    /usr/lib/perl5/site_perl/5.8.8/Bio/Root/Version.pm
>    INST_VERSION 1.006900
> cpan> m Bio::SeqIO
> Module id = Bio::SeqIO
>    CPAN_USERID  CJFIELDS (Christopher Fields <cjfields at bioperl.org>)
>    CPAN_VERSION 1.006000
>    INST_FILE    /usr/lib/perl5/site_perl/5.8.8/Bio/SeqIO.pm
>    INST_VERSION undef
>
> Cornel still has the checked-out "bioperl-live" directory and the last
> changes are from March this year.
>
> As per why he used "Fasta" instead of 'fasta" as the format  
> parameter in
> Bio::SeqIO, it's because that what it says in the modules manual. He  
> now
> tried 'fasta' instead and see no changes in behavior. Omitting the
> format parameter altogether, fasta-formatted sequence continues to be
> treated correctly, the first line being removed. However, raw sequence
> is being treated differently in that the first line is not being  
> removed
> any more. Instead, the program returns the first line only. Which, in
> the example I am going to forward in my next message, will return 60
> amino acids out of raw sequence of 300 aa. Can't win with raw
> sequence...
>
>
> The files may be created on different platforms, we didn't notice any
> difference between using files created on Windows or Linux.
>
> Thanks
> Uwe
>
>
>
>
> -----Original Message-----
> From: Hilmar Lapp [mailto:hlapp at gmx.net]
> Sent: Wednesday, August 05, 2009 6:54 PM
> To: Chris Fields
> Cc: Hilgert, Uwe; BioPerl List
> Subject: Re: [Bioperl-l] Bio::SeqIO issue
>
> I don't think that can be the problem. If anything, providing the
> format ought to be better in terms of result than not providing it?
>
> Uwe - I'd like you to go back to Chris' initial questions that you
> haven't answered yet: "What version of bioperl are you using, OS,
> etc?  What does your data look like?" I'd add to that, can you show us
> your full script, or a smaller code snippet that reproduces the  
> problem.
>
> I suspect that either something in your script is swallowing the line,
> or that the line endings in your data file are from a different OS
> than the one you're running the script on. (Or that you are running a
> very old version of BioPerl, which is entirely possible if you
> installed through CPAN.)
>
> 	-hilmar
>
> On Aug 5, 2009, at 5:37 PM, Chris Fields wrote:
>
>> Uwe,
>>
>> Please keep replies on the list.
>>
>> It's very possible that's the issue; IIRC the fasta parser pulls out
>> the full sequence in chunks (based on local $/ = "\n>") and splits
>> the header off as the first line in that chunk.  You could probably
>> try leaving the format out and letting SeqIO guess it, or passing
>> the file into Bio::Tools::GuessSeqFormat directly, but it's probably
>> better to go through the files and add a file extension that
>> corresponds to the format.
>>
>> chris
>>
>> On Aug 5, 2009, at 4:23 PM, Hilgert, Uwe wrote:
>>
>>> Thanks, Chris. The files have no extension, but we indicate what
>>> format
>>> to use, like in the manual:
>>>
>>> $in  = Bio::SeqIO->new(-file => "file_path", -format => 'Fasta');
>>>
>>> I wonder now whether this could exactly cause the problem: as we are
>>> telling that input files are in fasta format they are being treated
>>> as
>>> such (=remove first line) - regardless of whether they really are
>>> fasta?
>>>
>>> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
>>> Uwe Hilgert, Ph.D.
>>> Dolan DNA Learning Center
>>> Cold Spring Harbor Laboratory
>>>
>>> C: (516) 857-1693
>>> V: (516) 367-5185
>>> E: hilgert at cshl.edu
>>> F: (516) 367-5182
>>> W: http://www.dnalc.org
>>>
>>> -----Original Message-----
>>> From: Chris Fields [mailto:cjfields at illinois.edu]
>>> Sent: Wednesday, August 05, 2009 5:04 PM
>>> To: Hilgert, Uwe
>>> Cc: bioperl-l at lists.open-bio.org
>>> Subject: Re: [Bioperl-l] Bio::SeqIO issue
>>>
>>> On Aug 5, 2009, at 3:27 PM, Hilgert, Uwe wrote:
>>>
>>>> Is my impression correct that Bio::SeqIO just assumes that  
>>>> sequences
>>>> are
>>>> being submitted in FASTA format?
>>>
>>> No. See:
>>>
>>> http://www.bioperl.org/wiki/HOWTO:SeqIO
>>>
>>> SeqIO tries to guess at the format using the file extension, and if
>>> one isn't present makes use of Bio::Tools::GuessSeqFormat.  It's
>>> possible that the extension is causing the problem, or that
>>> GuessSeqFormat guessing wrong (it's apt to do that, as it's forced  
>>> to
>>> guessing).  In any case, it's always advisable to explicitly  
>>> indicate
>>> the format when possible.
>>>
>>> Relevant lines:
>>>
>>>  return 'fasta'   if /\.(fasta|fast|fas|seq|fa|fsa|nt|aa|fna|faa)$/
>>> i;
>>> ...
>>>  return 'raw'     if /\.(txt)$/i;
>>>
>>>> In our experience, implementing
>>>> Bio::SeqIO led to the first line of files being cut off,
>>>> regardless of
>>>> whether the files were indeed fasta files or files that only
>>>> contained
>>>> sequence.
>>>
>>> Files that only contain sequence are 'raw'.  Ones in FASTA are
>>> 'fasta'.
>>>
>>>> Which, in the latter, led to sequence submissions that had the
>>>> first line of nucleotides removed. Has anyone tried to write a fix
>>>> for
>>>> this?
>>>
>>> This sounds like a bug, but we have very little to go on beyond your
>>> description.  What version of bioperl are you using, OS, etc?  What
>>> does your data look like?  File extension?
>>>
>>> chris
>>>
>>>> Thanks,
>>>>
>>>> Uwe
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
>>>>
>>>> Uwe Hilgert, Ph.D.
>>>>
>>>> Dolan DNA Learning Center
>>>>
>>>> Cold Spring Harbor Laboratory
>>>>
>>>>
>>>>
>>>> V: (516) 367-5185
>>>>
>>>> E: hilgert at cshl.edu <mailto:hilgert at cshl.edu>
>>>>
>>>> F: (516) 367-5182
>>>>
>>>> W: http://www.dnalc.org
>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Bioperl-l mailing list
>>>> Bioperl-l at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
> -- 
> ===========================================================
> : Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
> ===========================================================
>
>

-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================






More information about the Bioperl-l mailing list