[Bioperl-l] Sequence Validation
Jason Stajich
jason at cgt.duhs.duke.edu
Wed Jun 11 16:07:04 EDT 2003
And to remind us how to change the type of sequence object that is created
by Bio::SeqIO
my $sf = Bio::Seq::SeqFactory->new(-type => 'MySeq');
my $seqio = new Bio::SeqIO(...
-seqfactory => $sf);
On Wed, 11 Jun 2003, Hilmar Lapp wrote:
> Hm. I thought you cannot have numbers in the string. At least that's
> what my copy of the code says.
>
> The current way of doing this would be to write your own module:
>
> package MySeq;
> @ISA = qw(Bio::PrimarySeq);
> sub validate_seq{
> my ($self,$seq) = @_;
> # do whatever validation
> return $isvalid ? 1 : 0;
> }
> 1;
> __END__
>
> Then when you open a stream:
>
> my $seqio = Bio::SeqIO->new(-format => 'fasta', -fh => \*STDIN);
> $seqio->sequence_factory(Bio::Seq::SeqFactory->new(-type =>
> "MySeq"));
> while(my $seq = $seqio->next_seq) {
> # now ref($seq) eq "MySeq"
> ...
> }
>
> Hth, -hilmar
>
> > -----Original Message-----
> > From: Matthew Laird [mailto:lairdm at sfu.ca]
> > Sent: Wednesday, June 11, 2003 10:55 AM
> > To: Jason Stajich
> > Cc: bioperl-l at portal.open-bio.org
> > Subject: Re: [Bioperl-l] Sequence Validation
> >
> >
> > Ahh, thank you. Using 1.2.1 works just fine, it seems we had 1.0.1
> > installed.
> >
> > The next issue in validation I've noticed (in my attempts to
> > break things)
> > is the alphabet function in Bio:Seq. I tried putting a 'J' and the
> > number '5' into a sequence and it was stilled reported as a protein
> > sequence. Is this not the correct method to ensure a
> > sequence uses only
> > the allowed characters? validate_seq() seems to general for
> > the task. Or
> > again, would writing a quick little homebrew function be the easiest?
> >
> > Thanks again.
> >
> > On Wed, 11 Jun 2003, Jason Stajich wrote:
> >
> > > Which version of bioperl are you using? 1.2 branch and the
> > main-trunk
> > > code (soon to be 1.3 branch) parse that seqeunce just fine for me,
> > > although could be linefeeds are causing problems I guess.
> > >
> > > use Bio::SeqIO;
> > > my $in = new Bio::SeqIO(-fh => \*DATA);
> > > my $seq = $in->next_seq;
> > > print $seq->display_id, "\n";
> > > print $seq->seq(), "\n";
> > > __DATA__
> > > >
> > > BRKISLIGLATMSMLAFNTSAFALGTASSNSGASGKHWSVVGGAALVQPK
> > > NGKNAAQNTVKFGGDVAPTLSVTYYINDNVGFELWGITKKLSYTAKTDAS
> > >
> > >
> > > As for validating, SeqIO will throw an error if something is
> > > unparseable, what we have suggested to people in the past
> > is to use a
> > > eval block for these.
> > >
> > > If you still want a validator I would suggest a small lightweight
> > > method which given a string will attempt to guess the format and/or
> > > validate it rather than relying on SeqIO for this just yet.
> > >
> > > Eventually we could think of a supporting a validator slot
> > in SeqIO to
> > > use this type of method I guess although it would be an additional
> > > performance hit.
> > >
> > > -jason
> > >
> > > On Wed, 11 Jun 2003, Matthew Laird wrote:
> > >
> > > > Hello, I hope this is the correct place to ask this...
> > > >
> > > > I've been looking through the BioPerl documentation and
> > the mailing
> > > > list archives and am wondering if there is anything built to do
> > > > sequence validation.
> > > >
> > > > What I mean is this, there are functions as I see to do
> > things such
> > > > as read in FASTA files (Bio::SeqIO) but how would one test if the
> > > > file is valid? We're attempting to create a web interface where
> > > > people can submit sequences for analysis, however people could
> > > > submit faulty formatted files. Example:
> > > > >
> > > > BRKISLIGLATMSMLAFNTSAFALGTASSNSGASGKHWSVVGGAALVQPK
> > > > NGKNAAQNTVKFGGDVAPTLSVTYYINDNVGFELWGITKKLSYTAKTDAS
> > > >
> > > > Bio:SeqIO doesn't throw any error on this, what it does
> > do is begin
> > > > at the line starting with "NGKN" as the beginning of the
> > sequence.
> > > > Yes this sequence violates the FASTA format, but in web
> > interfaces
> > > > you can't be sure people will submit a perfectly formatted file.
> > > >
> > > > Can anyone point me in the direction of a module which
> > will validate
> > > > the file as it's read for both format and that only
> > allowed sequence
> > > > letters are included? Or is this something which needs to be
> > > > written? Ideally this should work for multiple formats as well.
> > > >
> > > > If such a module doesn't exist I suppose I'll begin
> > working on one
> > > > and submit the results to the collective since this seems
> > like such
> > > > a useful tool.
> > > >
> > > > Thanks.
> > > >
> > > >
> > >
> > > --
> > > Jason Stajich
> > > Duke University
> > > jason at cgt.mc.duke.edu
> > >
> >
> > --
> > Matthew Laird
> > SysAdmin/Web Developer, Brinkman Laboratory, MBB Dept.
> > Simon Fraser University
> >
> >
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at portal.open-bio.org
> > http://portal.open-> bio.org/mailman/listinfo/bioperl-l
> >
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioperl-l
>
--
Jason Stajich
Duke University
jason at cgt.mc.duke.edu
More information about the Bioperl-l
mailing list