[Biojava-dev] [Biojava-l] FASTA Header Parser

Scooter Willis HWillis at scripps.edu
Wed Jan 11 15:24:48 UTC 2012


Hannes

Looks like the length= is something I am specifically looking for to
truncate as being redundant information and not part of the unique id for
a header. Are you using "HG7JTKN01BFWC8 rank=0000030 x=474.5 y=10.0" as
your unique ID?

Is this a custom header or something output from a sequencing
instrument/software?

Does it make sense to formally support this header where the unique ID
would be HG7JTKN01BFWC8 and rank x y would be added as meta data to the
sequence?

I checked in the change to GenericFastaHeaderParser

Scooter

On 1/11/12 9:54 AM, "Hannes Brandstätter-Müller" <biojava at hannes.oib.com>
wrote:

>a simple nucleotide fasta file: (randomized the sequence for privacy)
>
>>HG7JTKN01BFWC8 rank=0000030 x=474.5 y=10.0 length=57
>ACGTGACTGTCGTGCTGCTACTAGCTGATCA
>
>produces the key "HG7JTKN01BFWC8 rank=0000030 x=474.5 y=10.0"
>The originalHeader is correct.
>
>I have to read another file too (QUAL file) that is similar, but the
>fasta reader can not handle it, so I wrote my own parser... this one
>gives me the full and correct header. I think it's reasonable to
>expect the fasta header parser to behave similarly. I would prefer not
>to change the full header string, because you can never know what
>special headers you might encounter.
>
>Hannes
>
>On Wed, Jan 11, 2012 at 15:49, Scooter Willis <HWillis at scripps.edu> wrote:
>> Can you send me a sample fasta file and what you are finding vs
>>expecting.
>>
>> ----- Reply message -----
>> From: "Hannes Brandstätter-Müller" <biojava at hannes.oib.com>
>> To: "Scooter Willis" <HWillis at scripps.edu>
>> Subject: [Biojava-l] FASTA Header Parser
>> Date: Wed, Jan 11, 2012 9:43 am
>>
>>
>>
>> nope, the header is in the hashmap in total, except for everything
>> after length= -- there are whitespaces before that.
>>
>>
>> either make it work like you say or even better, leave the header as-is.
>>
>> I need to quickly find the sequence, I don't want to iterate over all
>> my 35k sequences and look up the original headers.
>>
>> Hannes
>>
>> On Wed, Jan 11, 2012 at 15:38, Scooter Willis <HWillis at scripps.edu>
>>wrote:
>>> It should parse until the first space as the unique id. Lots of extra
>>>info
>>> gets added in to the header. You should find a getOriginalHeader method
>>> that
>>> will preserve to contents of the header. I use this when writing the
>>> sequences back to disk to restore the original header.
>>>
>>> You can also do your own custom header parser which we use to support
>>>the
>>> known different fasta headers. If you have extra information in the
>>>header
>>> you can formally associate that with the sequence at the time of the
>>> parse.
>>> We can also add support for your header if it is standard ouput from a
>>> device.
>>>
>>> Thanks
>>>
>>> Scooter
>>>
>>>
>>> ----- Reply message -----
>>> From: "Hannes Brandstätter-Müller" <biojava at hannes.oib.com>
>>> To: "biojava-l" <biojava-l at lists.open-bio.org>
>>> Subject: [Biojava-l] FASTA Header Parser
>>> Date: Wed, Jan 11, 2012 9:30 am
>>>
>>>
>>>
>>> Hi there -
>>>
>>> I just came across a puzzling "feature" of the
>>>GenericFastaHeaderParser.
>>> It seems to throw away everything in the header after (and including)
>>> "length="
>>> (see GenericFastaHeaderParser.java lines 71-76)
>>>
>>> ... Why?
>>>
>>> Also, is there a Fasta Header Parser I can use that does not mess
>>> about with the header?
>>>
>>> I really would like to have that as key (still working on my
>>> FASTA/QUAL parsing) and not having that (only in the originalHeader,
>>> not in the Hashmap key) really breaks stuff.
>>>
>>> Hannes
>>> _______________________________________________
>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/biojava-l





More information about the biojava-dev mailing list