[Bioperl-l] Sequence Validation

Matthew Laird lairdm at sfu.ca
Wed Jun 11 11:05:00 EDT 2003


Hello, I hope this is the correct place to ask this...

I've been looking through the BioPerl documentation and the mailing list 
archives and am wondering if there is anything built to do sequence 
validation.

What I mean is this, there are functions as I see to do things such as 
read in FASTA files (Bio::SeqIO) but how would one test if the file is 
valid?  We're attempting to create a web interface where people can submit 
sequences for analysis, however people could submit faulty formatted 
files.  Example:
>
BRKISLIGLATMSMLAFNTSAFALGTASSNSGASGKHWSVVGGAALVQPK
NGKNAAQNTVKFGGDVAPTLSVTYYINDNVGFELWGITKKLSYTAKTDAS

Bio:SeqIO doesn't throw any error on this, what it does do is begin at the 
line starting with "NGKN" as the beginning of the sequence.  Yes this 
sequence violates the FASTA format, but in web interfaces you can't be 
sure people will submit a perfectly formatted file.

Can anyone point me in the direction of a module which will validate the 
file as it's read for both format and that only allowed sequence letters 
are included?  Or is this something which needs to be written?  Ideally 
this should work for multiple formats as well.

If such a module doesn't exist I suppose I'll begin working on one and 
submit the results to the collective since this seems like such a useful 
tool.

Thanks.

-- 
Matthew Laird
SysAdmin/Web Developer, Brinkman Laboratory, MBB Dept.
Simon Fraser University





More information about the Bioperl-l mailing list