[Bioperl-l] validating a sequence

Jason Stajich jason@cgt.mc.duke.edu
Tue, 26 Feb 2002 15:57:18 -0500 (EST)


On Mon, 25 Feb 2002, Andreas Matern wrote:

> Forgive me if this answer occurs somewhere else, but. . .
>
> I need to validate FASTA sequences. The web interface (another
> developer, can't touch his code) allows users to cut and paste, and many
> of them cut and paste sequences with numbers in them
>
> (i.e.
> >mysequence
> 1ACACGATCGACTGACATCGTCAGTACGTCGATACGATCGACTGACTAGCTC
> 51AACTCGTCGTCGTCGTCGCTGCTCGTCGCTGCTCGTCTGCTCGTCGTC
>
> etc.)
>
> The FASTA file is turned into a Bio::Index::Fasta by a cron job
> And then I (normally) run
>
> @ids = $inx->get_all_primary_ids();
> foreach $id (@ids) {
> 	my $seq = $inx->getch($id);
> 	....do stuff with seq....
> 	....connect to database...
> 	....etc....
> }
>
> This of course dies when the $seq is screwey (
>
> MSG: Attempting to set the sequence to [1ACA....] which does not look
> healthy
>
> I see the  $seq->validate_seq, but I'm not sure how to use it in my
> context
>
You can protect these in a eval { } block - but I'm not sure when you want
to evaluate - do you want to kick things out of the db before they are
indexed or just handle bad entries semi-nicely?  As for checking things
before they are indexed - the only way I can think off the top of my head
is to pre-process the file with Bio::SeqIO and protect the parse with eval
{} do a goto to restart the loop like this (still not sure what the
workflow is so not sure if this works in your scheme).  NOte: up till now
we haven't done a whole lot of trying to handle badly formatted data files
very well.

# you're going to build a new "CLEAN" db
my $in = new Bio::SeqIO(-file => 'webdump.fa');
my $newin = new Bio::SeqIO(-file => '>newwebdump.fa');

eval {
	LOOP: while( my $seq = $in->next_seq ) {
	  $newin->write_seq($seq);
	}

};
if( $@) {
 print STDERR "skipping a sequence with error \n$@";
 goto LOOP;
}
$newin->close();
# index webdump again

Now - I'm not 100% sure that our throws end up getting caught in the eval
so we may need to catch other signals - let me know if this doesn't work.

> Any suggestions, especially for stripping out non-IUPAC characters from
> a FASTA string, would be greatly appreciated...
>
> -Andreas
>
>

-- 
Jason Stajich
Duke University
jason@cgt.mc.duke.edu