[Bioperl-l] validating a sequence
Jason Stajich
jason@cgt.mc.duke.edu
Tue, 26 Feb 2002 15:57:18 -0500 (EST)
On Mon, 25 Feb 2002, Andreas Matern wrote:
> Forgive me if this answer occurs somewhere else, but. . .
>
> I need to validate FASTA sequences. The web interface (another
> developer, can't touch his code) allows users to cut and paste, and many
> of them cut and paste sequences with numbers in them
>
> (i.e.
> >mysequence
> 1ACACGATCGACTGACATCGTCAGTACGTCGATACGATCGACTGACTAGCTC
> 51AACTCGTCGTCGTCGTCGCTGCTCGTCGCTGCTCGTCTGCTCGTCGTC
>
> etc.)
>
> The FASTA file is turned into a Bio::Index::Fasta by a cron job
> And then I (normally) run
>
> @ids = $inx->get_all_primary_ids();
> foreach $id (@ids) {
> my $seq = $inx->getch($id);
> ....do stuff with seq....
> ....connect to database...
> ....etc....
> }
>
> This of course dies when the $seq is screwey (
>
> MSG: Attempting to set the sequence to [1ACA....] which does not look
> healthy
>
> I see the $seq->validate_seq, but I'm not sure how to use it in my
> context
>
You can protect these in a eval { } block - but I'm not sure when you want
to evaluate - do you want to kick things out of the db before they are
indexed or just handle bad entries semi-nicely? As for checking things
before they are indexed - the only way I can think off the top of my head
is to pre-process the file with Bio::SeqIO and protect the parse with eval
{} do a goto to restart the loop like this (still not sure what the
workflow is so not sure if this works in your scheme). NOte: up till now
we haven't done a whole lot of trying to handle badly formatted data files
very well.
# you're going to build a new "CLEAN" db
my $in = new Bio::SeqIO(-file => 'webdump.fa');
my $newin = new Bio::SeqIO(-file => '>newwebdump.fa');
eval {
LOOP: while( my $seq = $in->next_seq ) {
$newin->write_seq($seq);
}
};
if( $@) {
print STDERR "skipping a sequence with error \n$@";
goto LOOP;
}
$newin->close();
# index webdump again
Now - I'm not 100% sure that our throws end up getting caught in the eval
so we may need to catch other signals - let me know if this doesn't work.
> Any suggestions, especially for stripping out non-IUPAC characters from
> a FASTA string, would be greatly appreciated...
>
> -Andreas
>
>
--
Jason Stajich
Duke University
jason@cgt.mc.duke.edu