[Biojava-l] FASTA parsing bug ?

Mark Schreiber markjschreiber at gmail.com
Thu Apr 30 03:01:20 UTC 2009


A minimal XML equivalent to Fasta would look like this:

<BioEntry  id="foo" description="bar">
    <Sequence>ACGTGCACGCTGCACGT</Sequence>
</BioEntry>

I think a biologist could handle that and it is much easier to parse
than FASTA because it is well formed.  You don't even need to use an
XML parser. You could even convert this to FASTA using a text editor
with a few find and replace expressions. Possibly this would be easier
to handle even for someone who can't program at all?  Of course you
could make it a lot more sophisticated but then you are approximating
GenbankXML or something similar.

Remember BioJava has FASTA parsers made by experienced programmers and
over 10 years of testing and bug fixes and people still manage to
break them. This indicates to me that the FASTA format is bad and
should be voted off the island.

- Mark

On Thu, Apr 30, 2009 at 6:28 AM, simon rayner <simon.rayner.cn at gmail.com> wrote:
> don't forget that a lot of the people doing bioinformatics are biologists
> with no formal training.  They want to get the job done in the easiest
> possible way and aren't really concerned about the details.  If you want
> people to switch to XML for example, the whole concept needs to be made more
> accessible.  I'm still struggling to get my students to adopt XML.
>
> It seems that more basic tutorials would be useful - but in a less formal
> style that would be easier for newcomers to follow.   Is there any feelings
> about trying to develop this side of the Biojava project?  I thought about
> trying to add some stuff, but my java programming is embarrassingly poor and
> i thought i would be laughed off the website.
>
> Simon
>
> On Wed, Apr 29, 2009 at 10:33 PM, Mark Schreiber <markjschreiber at gmail.com>
> wrote:
>>
>> I can understand a bench scientist wanting FASTA but a computational
>> biologist. They should be ashamed! With some of the friendly XPath
>> implementations in common scripting languages there really is no excuse.
>> It's easier to parse XML than FASTA in Groovy, Perl, Python and Ruby.
>> Probably Java and C as well.
>>
>> The state of bioinformatics data formats is cringe worthy. Let's try and
>> enter the 21st century!
>>
>> OK I'm ranting again. Maybe I'll go join twitter.
>>
>> - Mark
>>
>> On 29 Apr 2009, 10:04 PM, "Josh Goodman" <jogoodma at indiana.edu> wrote:
>>
>>
>> Hi Mark,
>>
>> I couldn't agree with you more, which is why we also provide this data in
>> GFF and Chado XML formats, Chado PostgreSQL dumps, and a public read only
>> Chado database.  However, no matter how much we try to encourage use of
>> the other formats users still flock to the good old FASTA files.  There
>> are a variety of reasons but the most common case involves bench
>> scientists and/or programmers who run at the sight of anything more
>> complex than a FASTA file.
>>
>> I've toyed with the idea of reducing the data we cram into the headers to
>> gently try to encourage use of the other more sensible formats.  However,
>> at the end of the day we (FlyBase) serve at the behest of our user
>> community and this is what they want to see.
>>
>> Cheers,
>> Josh
>>
>> On Wed, 29 Apr 2009, Mark Schreiber wrote: > People who know me will know
>> I
>> am not a big fan of F...
>> _______________________________________________
>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>
>
>
> --
> Simon Rayner
>
> State Key Laboratory of Virology
> Wuhan Institute of Virology
> Chinese Academy of Sciences
> Wuhan, Hubei 430071
> P.R.China
>
> +86 (27) 87199895 (office)
> +86 15972923715 (cell)
>
>




More information about the Biojava-l mailing list