[Bioperl-l] Perl Regular Expressions sought that match typical
bioinformatics applications input data
Andreas Kahari
ak at ebi.ac.uk
Fri Mar 26 07:01:49 EST 2004
On Fri, Mar 26, 2004 at 12:36:33PM -0800, M.Shahid wrote:
> Dear All,
> For a small software project, I'm looking for Perl Regular
> Expressions that match file formats and related input used
> for bioinformatics applications, e.g. FASTA, Clustal,
> PROSITE patterns, PAM similarity matrices, etc. Note
> that I do not want to parse such data, just check whether
> the format is correct. Is there any compilation out there ?
> If not, what is the best starting point to collect such Regular
> Expressions ?
I'm wondering wheather regular expression really would be the
best thing to use. It is common knowledge that it is *very*
hard to validate formats such as for example HTML using regular
expressions (one needs a proper validating parser).
Detecting is another matter. There is a module called
Bio::Tools::GuessSeqFormat in BioPerl that, I originally wrote.
It doesn't claim to validate the various input formats or even
that it will deliver a correct answer. It will only apply
regular expressions to the header lines of data files and try to
determine the format of the file (or string).
I don't know what the best approach to format validation would
be, but it is my belief that regular expression can not be used
alone.
Regards,
Andreas
--
| {} | Andreas Kähäri EMBL, European Bioinformatics Institute
|{}{}| Wellcome Trust Genome Campus
| {} | DAS Project Leader Hinxton, Cambridgeshire, CB10 1SD
|{}{}| Ensembl Developer United Kingdom
More information about the Bioperl-l
mailing list