[Bioperl-l] Validate Fasta

Wed Mar 3 06:26:21 EST 2004

The only difference is (I think) that Michales code will use my
Bio::Tools::GuessSeqFormat module to guess that the format is
FastA (using BioPerl 1.4).

Currently, the guesser will say "fasta" if the first line of the
file matches /^>\w/ (and no other format matches for the first
line), or if any other line matches /^[A-IK-NP-Z]+$/i (and no
other format matches the same line).

I think that sequence file format validation ought to be an
optional part of Bio::SeqIO or a separate module.  I haven't
looked into Bio::SeqIO to see what goes on in there though...
My module just provides a rough guess.

Andreas

On Wed, Mar 03, 2004 at 10:45:16AM +0000, john herbert wrote:
> Hello Michael.
> Im not a BioPerl extra-ordinaire programmer (so anyone correct me if I
> am wrong) but I think the -format flag should help here. 
> 
> Try 
> 
> my $in = Bio::SeqIO->new(-file => "rubbish.fasta", -format =>
> 'Fasta');
> my $out = Bio::SeqIO->new(-file => ">rubbish2.fasta", -format =>
> 'Fasta');
> 
> I am pretty sure if you put this change in your code and run it on your
> very nice Perl fasta sequence, it will complain. 
> 
> Kind regards,
> 
> JOhn.
> 
> 
> >>> "michael watson (IAH-C)" <michael.watson at bbsrc.ac.uk> 03/03/2004
> 10:16:04 >>>
> Hi
> 
> I have searched the archives and only come up with one answer, and it
> didn't work - I want to validate a FASTA sequence (DNA).  What I mean is
> that if I am given a perfect FASTA sequence, then thats ok, but if there
> are ANY whitespace characters, or any other characters that really
> shouldn't be there, I want it to throw an error.  The script below was
> suggested by Jason in 2002:
> 
> use Bio::SeqIO;
> 
> my $in = Bio::SeqIO->new(-file => "rubbish.fasta");
> my $out = Bio::SeqIO->new(-file => ">rubbish2.fasta");
> 
> eval {
> 	LOOP: while( my $seq = $in->next_seq ) {
> 		$out->write_seq($seq);
> 	}
> 
> };
> if( $@) {
> 	print "There's an Error!\n";
> 	goto LOOP;
> }
> 
> I actually fired this at one of my scripts, a perl script that clearly
> wasn't a fasta sequence - it has #'s, \ts, \ns and all sorts of non DNA
> sequence characters.  Here is the result:
> 
> >#!/usr/bin/perl
> my$backups={'mysql'="/mick/mysql/",'apache'="/res/upity/apac
> he",'mwatson'="/res/upity/mwatson",'www'="/www/Docs",'ensemb
> l'="/too/fools/ensembl",'cgi'="/www/cgi-bin/"};my$location="
> /mick/backups";my$date=`date`;my at date=split(/\s+/,$date);my$
> date=join("_", at date[0..2],$date[$#date]);print"$date\n";#whi 
> le(my($name,$dir)=each%{$backups}){foreach$name(qw(apachemys
> qlmwatsonwwwensemblcgi)){$dir=$backups-{$name};print"tarzipp
> ing$dir\n";system("/bin/tar-c$dir$location/$name.$date.tar")
> ;system("/bin/gzip$location/$name.$date.tar");}
> 
> This is undoubtedly a wonderfully FASTA formatted perl script, but...
> 
> Anyone?  Any ideas?
> 
> Thanks in advance for the help!
> 
> Mick
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at portal.open-bio.org 
> http://portal.open-bio.org/mailman/listinfo/bioperl-l
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioperl-l

-- 
|-][-|      Andreas Kähäri                                |[==]|
|[--]|      EMBL, European Bioinformatics Institute       |=][=|
|-][-|      Wellcome Trust Genome Campus                  |[==]|
|[--]|      Hinxton, Cambridgeshire, CB10 1SD             |=][=|
|-][-|      United Kingdom                                |[==]|