[Bioperl-l] Perl Regular Expressions sought that match typical bioinformatics applications input data

Fri Mar 26 07:01:49 EST 2004

On Fri, Mar 26, 2004 at 12:36:33PM -0800, M.Shahid wrote:
> Dear All,
> For a small software project, I'm looking for Perl Regular
> Expressions that match file formats and related input used
> for bioinformatics applications, e.g. FASTA, Clustal, 
> PROSITE patterns, PAM similarity matrices, etc. Note 
> that I do not want to parse such data, just check whether 
> the format is correct. Is there any compilation out there ? 
> If not, what is the best starting point to collect such Regular 
> Expressions ?

I'm wondering wheather regular expression really would be the
best thing to use.  It is common knowledge that it is *very*
hard to validate formats such as for example HTML using regular
expressions (one needs a proper validating parser).

Detecting is another matter.  There is a module called
Bio::Tools::GuessSeqFormat in BioPerl that, I originally wrote.
It doesn't claim to validate the various input formats or even
that it will deliver a correct answer.  It will only apply
regular expressions to the header lines of data files and try to
determine the format of the file (or string).

I don't know what the best approach to format validation would
be, but it is my belief that regular expression can not be used
alone.

Regards,
Andreas

-- 
| {} | Andreas Kähäri      EMBL, European Bioinformatics Institute
|{}{}|                     Wellcome Trust Genome Campus
| {} | DAS Project Leader  Hinxton, Cambridgeshire, CB10 1SD
|{}{}| Ensembl Developer   United Kingdom