[Bioperl-l] Sequence Validation
Jason Stajich
jason at cgt.duhs.duke.edu
Wed Jun 11 14:27:43 EDT 2003
Which version of bioperl are you using? 1.2 branch and the main-trunk code
(soon to be 1.3 branch) parse that seqeunce just fine for me, although
could be linefeeds are causing problems I guess.
use Bio::SeqIO;
my $in = new Bio::SeqIO(-fh => \*DATA);
my $seq = $in->next_seq;
print $seq->display_id, "\n";
print $seq->seq(), "\n";
__DATA__
>
BRKISLIGLATMSMLAFNTSAFALGTASSNSGASGKHWSVVGGAALVQPK
NGKNAAQNTVKFGGDVAPTLSVTYYINDNVGFELWGITKKLSYTAKTDAS
As for validating, SeqIO will throw an error if something is unparseable,
what we have suggested to people in the past is to use a eval block for
these.
If you still want a validator I would suggest a small lightweight method
which given a string will attempt to guess the format and/or validate it
rather than relying on SeqIO for this just yet.
Eventually we could think of a supporting a validator slot in SeqIO to use
this type of method I guess although it would be an additional
performance hit.
-jason
On Wed, 11 Jun 2003, Matthew Laird wrote:
> Hello, I hope this is the correct place to ask this...
>
> I've been looking through the BioPerl documentation and the mailing list
> archives and am wondering if there is anything built to do sequence
> validation.
>
> What I mean is this, there are functions as I see to do things such as
> read in FASTA files (Bio::SeqIO) but how would one test if the file is
> valid? We're attempting to create a web interface where people can submit
> sequences for analysis, however people could submit faulty formatted
> files. Example:
> >
> BRKISLIGLATMSMLAFNTSAFALGTASSNSGASGKHWSVVGGAALVQPK
> NGKNAAQNTVKFGGDVAPTLSVTYYINDNVGFELWGITKKLSYTAKTDAS
>
> Bio:SeqIO doesn't throw any error on this, what it does do is begin at the
> line starting with "NGKN" as the beginning of the sequence. Yes this
> sequence violates the FASTA format, but in web interfaces you can't be
> sure people will submit a perfectly formatted file.
>
> Can anyone point me in the direction of a module which will validate the
> file as it's read for both format and that only allowed sequence letters
> are included? Or is this something which needs to be written? Ideally
> this should work for multiple formats as well.
>
> If such a module doesn't exist I suppose I'll begin working on one and
> submit the results to the collective since this seems like such a useful
> tool.
>
> Thanks.
>
>
--
Jason Stajich
Duke University
jason at cgt.mc.duke.edu
More information about the Bioperl-l
mailing list