[Bioperl-l] Validate Fasta
Andreas Kahari
ak at ebi.ac.uk
Wed Mar 3 06:26:21 EST 2004
The only difference is (I think) that Michales code will use my
Bio::Tools::GuessSeqFormat module to guess that the format is
FastA (using BioPerl 1.4).
Currently, the guesser will say "fasta" if the first line of the
file matches /^>\w/ (and no other format matches for the first
line), or if any other line matches /^[A-IK-NP-Z]+$/i (and no
other format matches the same line).
I think that sequence file format validation ought to be an
optional part of Bio::SeqIO or a separate module. I haven't
looked into Bio::SeqIO to see what goes on in there though...
My module just provides a rough guess.
Andreas
On Wed, Mar 03, 2004 at 10:45:16AM +0000, john herbert wrote:
> Hello Michael.
> Im not a BioPerl extra-ordinaire programmer (so anyone correct me if I
> am wrong) but I think the -format flag should help here.
>
> Try
>
> my $in = Bio::SeqIO->new(-file => "rubbish.fasta", -format =>
> 'Fasta');
> my $out = Bio::SeqIO->new(-file => ">rubbish2.fasta", -format =>
> 'Fasta');
>
> I am pretty sure if you put this change in your code and run it on your
> very nice Perl fasta sequence, it will complain.
>
> Kind regards,
>
> JOhn.
>
>
> >>> "michael watson (IAH-C)" <michael.watson at bbsrc.ac.uk> 03/03/2004
> 10:16:04 >>>
> Hi
>
> I have searched the archives and only come up with one answer, and it
> didn't work - I want to validate a FASTA sequence (DNA). What I mean is
> that if I am given a perfect FASTA sequence, then thats ok, but if there
> are ANY whitespace characters, or any other characters that really
> shouldn't be there, I want it to throw an error. The script below was
> suggested by Jason in 2002:
>
> use Bio::SeqIO;
>
> my $in = Bio::SeqIO->new(-file => "rubbish.fasta");
> my $out = Bio::SeqIO->new(-file => ">rubbish2.fasta");
>
> eval {
> LOOP: while( my $seq = $in->next_seq ) {
> $out->write_seq($seq);
> }
>
> };
> if( $@) {
> print "There's an Error!\n";
> goto LOOP;
> }
>
> I actually fired this at one of my scripts, a perl script that clearly
> wasn't a fasta sequence - it has #'s, \ts, \ns and all sorts of non DNA
> sequence characters. Here is the result:
>
> >#!/usr/bin/perl
> my$backups={'mysql'="/mick/mysql/",'apache'="/res/upity/apac
> he",'mwatson'="/res/upity/mwatson",'www'="/www/Docs",'ensemb
> l'="/too/fools/ensembl",'cgi'="/www/cgi-bin/"};my$location="
> /mick/backups";my$date=`date`;my at date=split(/\s+/,$date);my$
> date=join("_", at date[0..2],$date[$#date]);print"$date\n";#whi
> le(my($name,$dir)=each%{$backups}){foreach$name(qw(apachemys
> qlmwatsonwwwensemblcgi)){$dir=$backups-{$name};print"tarzipp
> ing$dir\n";system("/bin/tar-c$dir$location/$name.$date.tar")
> ;system("/bin/gzip$location/$name.$date.tar");}
>
> This is undoubtedly a wonderfully FASTA formatted perl script, but...
>
> Anyone? Any ideas?
>
> Thanks in advance for the help!
>
> Mick
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioperl-l
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioperl-l
--
|-][-| Andreas Kähäri |[==]|
|[--]| EMBL, European Bioinformatics Institute |=][=|
|-][-| Wellcome Trust Genome Campus |[==]|
|[--]| Hinxton, Cambridgeshire, CB10 1SD |=][=|
|-][-| United Kingdom |[==]|
More information about the Bioperl-l
mailing list