[Biojava-dev] File Validator

Mark Schreiber markjschreiber at gmail.com
Sun Dec 7 01:03:28 UTC 2008


I would agree that a file validator would be excellent although
sometimes hard to write. The problem is mainly with the flat file
formats. When we wrote the biojavax parsers we tried to make them
conform to the descriptions given by NCBI etc. The problem is that
they don't always conform to this.

I think a possible problem with NCBI is that all their flat files are
produced from ASN.1 (kind of like XML). Like XML, ASN.1 can be
validated quite easily. The flatfiles are produced by a transformation
of the XML so they aren't always going to match the description.
Finally other people produce 'Genbank' and 'EMBL' files that are
really just a similar format but not the real thing.

One of the most troublesome formats is FASTA. Not because it is
difficult but because people try to code all manner of metadata into
the header without any convention existing.

Overall I would say whenever possible parse XML this should be the
safest bet, although not always possible.

- Mark

On Sun, Dec 7, 2008 at 12:34 AM, Mark Fortner <phidias51 at gmail.com> wrote:
> I've noticed that a lot of the email on the mailing list from users tends to
> revolve around the inability to parse a file of a given file type.  In most
> of the cases it turns out that the file either does not conform to the
> standard, or the data in the file apparently violates XML rules of
> well-formedness.
>
> It occurred to me that we might put a page in the Cookbook that describes
> basic troubleshooting techniques.  Richards past emails definitely contain a
> lot of useful information and could be used as a basis for the page.
>
> I also wondered if there were any plans in BioJava3 to include some sort of
> file validator (either as an integral part of the parsing framework or as a
> separate utility that could be run against any problematic file)?  In most
> cases, the user simply wants to know what part of the file is broken so that
> they can fix the file and carry on (or notify the data provider of the
> problem and have them address the issue).
>
> Regards,
>
> Mark Fortner
> _______________________________________________
> biojava-dev mailing list
> biojava-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-dev
>



More information about the biojava-dev mailing list