[Bioperl-l] Sequence Validation

Hilmar Lapp hlapp at gnf.org
Wed Jun 11 12:59:03 EDT 2003


Hm. I thought you cannot have numbers in the string. At least that's
what my copy of the code says.

The current way of doing this would be to write your own module:

package MySeq;
@ISA = qw(Bio::PrimarySeq);
sub validate_seq{
	my ($self,$seq) = @_;
	# do whatever validation
	return $isvalid ? 1 : 0;
}
1;
__END__

Then when you open a stream:

	my $seqio = Bio::SeqIO->new(-format => 'fasta', -fh => \*STDIN);
	$seqio->sequence_factory(Bio::Seq::SeqFactory->new(-type =>
"MySeq"));
	while(my $seq = $seqio->next_seq) {
		# now ref($seq) eq "MySeq"
		...
	}

Hth, -hilmar

> -----Original Message-----
> From: Matthew Laird [mailto:lairdm at sfu.ca] 
> Sent: Wednesday, June 11, 2003 10:55 AM
> To: Jason Stajich
> Cc: bioperl-l at portal.open-bio.org
> Subject: Re: [Bioperl-l] Sequence Validation
> 
> 
> Ahh, thank you.  Using 1.2.1 works just fine, it seems we had 1.0.1 
> installed.
> 
> The next issue in validation I've noticed (in my attempts to 
> break things) 
> is the alphabet function in Bio:Seq.  I tried putting a 'J' and the 
> number '5' into a sequence and it was stilled reported as a protein 
> sequence.  Is this not the correct method to ensure a 
> sequence uses only 
> the allowed characters?  validate_seq() seems to general for 
> the task.  Or 
> again, would writing a quick little homebrew function be the easiest?
> 
> Thanks again.
> 
> On Wed, 11 Jun 2003, Jason Stajich wrote:
> 
> > Which version of bioperl are you using? 1.2 branch and the 
> main-trunk 
> > code (soon to be 1.3 branch)  parse that seqeunce just fine for me, 
> > although could be linefeeds are causing problems I guess.
> > 
> > use Bio::SeqIO;
> > my $in = new Bio::SeqIO(-fh => \*DATA);
> > my $seq = $in->next_seq;
> > print $seq->display_id, "\n";
> > print $seq->seq(), "\n";
> > __DATA__
> > >
> > BRKISLIGLATMSMLAFNTSAFALGTASSNSGASGKHWSVVGGAALVQPK
> > NGKNAAQNTVKFGGDVAPTLSVTYYINDNVGFELWGITKKLSYTAKTDAS
> > 
> > 
> > As for validating, SeqIO will throw an error if something is 
> > unparseable, what we have suggested to people in the past 
> is to use a 
> > eval block for these.
> > 
> > If you still want a validator I would suggest a small lightweight 
> > method which given a string will attempt to guess the format and/or 
> > validate it rather than relying on SeqIO for this just yet.
> > 
> > Eventually we could think of a supporting a validator slot 
> in SeqIO to 
> > use this type of method I guess although it would be an additional 
> > performance hit.
> > 
> > -jason
> > 
> > On Wed, 11 Jun 2003, Matthew Laird wrote:
> > 
> > > Hello, I hope this is the correct place to ask this...
> > >
> > > I've been looking through the BioPerl documentation and 
> the mailing 
> > > list archives and am wondering if there is anything built to do 
> > > sequence validation.
> > >
> > > What I mean is this, there are functions as I see to do 
> things such 
> > > as read in FASTA files (Bio::SeqIO) but how would one test if the 
> > > file is valid?  We're attempting to create a web interface where 
> > > people can submit sequences for analysis, however people could 
> > > submit faulty formatted files.  Example:
> > > >
> > > BRKISLIGLATMSMLAFNTSAFALGTASSNSGASGKHWSVVGGAALVQPK
> > > NGKNAAQNTVKFGGDVAPTLSVTYYINDNVGFELWGITKKLSYTAKTDAS
> > >
> > > Bio:SeqIO doesn't throw any error on this, what it does 
> do is begin 
> > > at the line starting with "NGKN" as the beginning of the 
> sequence.  
> > > Yes this sequence violates the FASTA format, but in web 
> interfaces 
> > > you can't be sure people will submit a perfectly formatted file.
> > >
> > > Can anyone point me in the direction of a module which 
> will validate 
> > > the file as it's read for both format and that only 
> allowed sequence 
> > > letters are included?  Or is this something which needs to be 
> > > written?  Ideally this should work for multiple formats as well.
> > >
> > > If such a module doesn't exist I suppose I'll begin 
> working on one 
> > > and submit the results to the collective since this seems 
> like such 
> > > a useful tool.
> > >
> > > Thanks.
> > >
> > >
> > 
> > --
> > Jason Stajich
> > Duke University
> > jason at cgt.mc.duke.edu
> > 
> 
> -- 
> Matthew Laird
> SysAdmin/Web Developer, Brinkman Laboratory, MBB Dept.
> Simon Fraser University
> 
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at portal.open-bio.org 
> http://portal.open-> bio.org/mailman/listinfo/bioperl-l
> 



More information about the Bioperl-l mailing list