NCBI fasta format [was: Re: [Bioperl-l] loading data intobioperl-db]

Tue Jun 10 15:06:40 EDT 2003

Peter Wilkinson wrote:
> Yes that ncbi doc should be what its based on. And yes the lines are 
> separated by an 'esc' sequence, I am not sure what we should do about 
> that list .... I can not see any immediate use for keeping the list in 
> the sequence. Perhaps as a first implementation we will just drop the 
> list and keep the first annotation.

Possibly, emit one set of sequence parsing events for each esc seperated 
ID line so that if there are 3 of these, you'd get the 3 sequences out 
again? This is sort of like de-compressing the "compressed" fasta.

> 
> Can anyone think of a pressing use for the list of definitions?

Only data completeness. Someone is going to want to look up an entry by 
an ID other than the first one listed and will be confused/angry/noisy 
when it doesn't get retrieved. If you just plugged this stuff directly 
into the obda flat-file indexer, it should realy sort of work out OK, 
don't you think? But then that would be extra work.

Matthew