[Bioperl-l] Check sequence format, question

Chris Fields cjfields at uiuc.edu
Thu Nov 2 15:11:01 UTC 2006


Brian,

I think the validation issue is worthwhile but I can see logistical  
nightmares having every SeqIO parser validate sequence while parsing;  
GenBank and EMBL do this to some extent already but it isn't  
foolproof.  Much of SeqIO (e.g. GenBank/EMBL/Swiss parsing) is  
already in dire need of an overhaul as is w/o adding validation.

I wonder if it would be better if SeqIO has-a validator object  
instead of acting as a validator itself, i.e. SeqIO would focus on  
parsing and writing, the validator would focus on validation.  It  
might be easier from the maintenance aspect.  It's probably  
worthwhile exploring using Bio::Tools::GuessSeqFormat within SeqIO,  
or setting up a new system altogether.  Validation using the sequence  
validator could then be enabled by having a validation option when  
instantiating SeqIO.  We could even enable XML format validation  
using the DTD/Schema, which should be fairly straightforward.

Of course, this all depends on someone writing it up...

Chris

On Nov 2, 2006, at 6:49 AM, Brian Osborne wrote:

> Chris et al.,
>
> As you know the question of whether SeqIO should or should not  
> validate or
> check the given format is still an open one. In fact, some SeqIO  
> modules do
> validate to some extent. See:
>
> http://bugzilla.open-bio.org/show_bug.cgi?id=1508
>
> I can see that you've commented on this enhancement, I'm replying  
> just to
> bring this to the attention of others.
>
> Brian O.
>
>
> On 11/2/06 12:28 AM, "Chris Fields" <cjfields at uiuc.edu> wrote:
>
>> On Nov 1, 2006, at 6:15 PM, Eugene Bolotin wrote:
>>
>>> Dear bioperl mailing list,
>>> I trying to get sequence from a file using Bio::SeqIO, before I do
>>> anything,
>>> I want to make sure that the file is in a correct Fasta sequence
>>> format. I
>>> want it to spit out an error message if it is in any other format.
>>> What is the easiest way to do it?
>>> Thanks,
>>> Eugene Bolotin
>>> Sladek Lab.
>>
>> There is no formal FASTA definition that is universally accepted
>> beyond having the first line start with '>' and an optional
>> description, with the sequence in subsequent lines.
>>
>> http://www.bioperl.org/wiki/FASTA_sequence_format
>>
>> Bio::SeqIO isn't currently set up to validate sequence formats
>> directly, but you could try preparsing the data using
>> Bio::Tools::GuessSeqFormat.
>>
>> Chris
>>
>> Christopher Fields
>> Postdoctoral Researcher
>> Lab of Dr. Robert Switzer
>> Dept of Biochemistry
>> University of Illinois Urbana-Champaign
>>
>>
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
>

Christopher Fields
Postdoctoral Researcher
Lab of Dr. Robert Switzer
Dept of Biochemistry
University of Illinois Urbana-Champaign






More information about the Bioperl-l mailing list