[Biojava-dev] File Validator

Richard Holland holland at eaglegenomics.com
Sun Dec 7 19:03:10 UTC 2008


I like the idea of a validator. It should probably just be the standard
parser run with some kind of a 'report errors but carry on parsing
anyway' flag set (which currently doesn't exist). After all the standard
parser is conforming to the published format so should be able to spot
most errors.

Such a flag does not yet exist, but yes it would be nice to incorporate
it in future versions.

cheers,
Richard

Mark Schreiber wrote:
> I would agree that a file validator would be excellent although
> sometimes hard to write. The problem is mainly with the flat file
> formats. When we wrote the biojavax parsers we tried to make them
> conform to the descriptions given by NCBI etc. The problem is that
> they don't always conform to this.
> 
> I think a possible problem with NCBI is that all their flat files are
> produced from ASN.1 (kind of like XML). Like XML, ASN.1 can be
> validated quite easily. The flatfiles are produced by a transformation
> of the XML so they aren't always going to match the description.
> Finally other people produce 'Genbank' and 'EMBL' files that are
> really just a similar format but not the real thing.
> 
> One of the most troublesome formats is FASTA. Not because it is
> difficult but because people try to code all manner of metadata into
> the header without any convention existing.
> 
> Overall I would say whenever possible parse XML this should be the
> safest bet, although not always possible.
> 
> - Mark
> 
> On Sun, Dec 7, 2008 at 12:34 AM, Mark Fortner <phidias51 at gmail.com> wrote:
>> I've noticed that a lot of the email on the mailing list from users tends to
>> revolve around the inability to parse a file of a given file type.  In most
>> of the cases it turns out that the file either does not conform to the
>> standard, or the data in the file apparently violates XML rules of
>> well-formedness.
>>
>> It occurred to me that we might put a page in the Cookbook that describes
>> basic troubleshooting techniques.  Richards past emails definitely contain a
>> lot of useful information and could be used as a basis for the page.
>>
>> I also wondered if there were any plans in BioJava3 to include some sort of
>> file validator (either as an integral part of the parsing framework or as a
>> separate utility that could be run against any problematic file)?  In most
>> cases, the user simply wants to know what part of the file is broken so that
>> they can fix the file and carry on (or notify the data provider of the
>> problem and have them address the issue).
>>
>> Regards,
>>
>> Mark Fortner
>> _______________________________________________
>> biojava-dev mailing list
>> biojava-dev at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biojava-dev
>>
> _______________________________________________
> biojava-dev mailing list
> biojava-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-dev
> 

-- 
Richard Holland, BSc MBCS
Finance Director, Eagle Genomics Ltd
M: +44 7500 438846 | E: holland at eaglegenomics.com
http://www.eaglegenomics.com/



More information about the biojava-dev mailing list