[Biojava-dev] File Validator

Mark Schreiber markjschreiber at gmail.com
Mon Dec 8 03:57:40 UTC 2008


A way to skip bad files in a stream would be great for general purpose
use as well. Currently it's a pain to parse lots of files only to fail
three quaters of the way through.

- Mark

On Mon, Dec 8, 2008 at 3:03 AM, Richard Holland
<holland at eaglegenomics.com> wrote:
> I like the idea of a validator. It should probably just be the standard
> parser run with some kind of a 'report errors but carry on parsing
> anyway' flag set (which currently doesn't exist). After all the standard
> parser is conforming to the published format so should be able to spot
> most errors.
>
> Such a flag does not yet exist, but yes it would be nice to incorporate
> it in future versions.
>
> cheers,
> Richard
>
> Mark Schreiber wrote:
>> I would agree that a file validator would be excellent although
>> sometimes hard to write. The problem is mainly with the flat file
>> formats. When we wrote the biojavax parsers we tried to make them
>> conform to the descriptions given by NCBI etc. The problem is that
>> they don't always conform to this.
>>
>> I think a possible problem with NCBI is that all their flat files are
>> produced from ASN.1 (kind of like XML). Like XML, ASN.1 can be
>> validated quite easily. The flatfiles are produced by a transformation
>> of the XML so they aren't always going to match the description.
>> Finally other people produce 'Genbank' and 'EMBL' files that are
>> really just a similar format but not the real thing.
>>
>> One of the most troublesome formats is FASTA. Not because it is
>> difficult but because people try to code all manner of metadata into
>> the header without any convention existing.
>>
>> Overall I would say whenever possible parse XML this should be the
>> safest bet, although not always possible.
>>
>> - Mark
>>
>> On Sun, Dec 7, 2008 at 12:34 AM, Mark Fortner <phidias51 at gmail.com> wrote:
>>> I've noticed that a lot of the email on the mailing list from users tends to
>>> revolve around the inability to parse a file of a given file type.  In most
>>> of the cases it turns out that the file either does not conform to the
>>> standard, or the data in the file apparently violates XML rules of
>>> well-formedness.
>>>
>>> It occurred to me that we might put a page in the Cookbook that describes
>>> basic troubleshooting techniques.  Richards past emails definitely contain a
>>> lot of useful information and could be used as a basis for the page.
>>>
>>> I also wondered if there were any plans in BioJava3 to include some sort of
>>> file validator (either as an integral part of the parsing framework or as a
>>> separate utility that could be run against any problematic file)?  In most
>>> cases, the user simply wants to know what part of the file is broken so that
>>> they can fix the file and carry on (or notify the data provider of the
>>> problem and have them address the issue).
>>>
>>> Regards,
>>>
>>> Mark Fortner
>>> _______________________________________________
>>> biojava-dev mailing list
>>> biojava-dev at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/biojava-dev
>>>
>> _______________________________________________
>> biojava-dev mailing list
>> biojava-dev at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biojava-dev
>>
>
> --
> Richard Holland, BSc MBCS
> Finance Director, Eagle Genomics Ltd
> M: +44 7500 438846 | E: holland at eaglegenomics.com
> http://www.eaglegenomics.com/
>



More information about the biojava-dev mailing list