[emboss-dev] Regression in GenBank/GenPept parsing?

Tue Jul 21 10:21:34 UTC 2009

On Tue, Jul 21, 2009 at 11:01 AM, Peter Rice<pmr at ebi.ac.uk> wrote:
>
> Peter C. wrote:
>> I guess "refseqp" means refseq protein? Another name for GenPept?
>
> Not quite ... because genpept has yet another variation of GenBank format.
>
> refseqp is the protein part of refseq.
>
>> Is "refseqp" a public EMBOSS format name, or something internal? I've
>> never noticed it in the documentation, e.g.
>> http://emboss.sourceforge.net/docs/themes/SequenceFormats.html#in
>
> We're in the process of updating that. Somewhere in among writing the
> books and creating the wiki the old website got left behind.
>
> My next task (once I've made sure your bugs are fixed) is to regenerate
> all the tables of formats.

Great. This may save you having to answer my next question,
which was could you expand on what EMBOSS considers to be
the differences between "genbank", "genpept" and "refseqp" as
file formats? Of course, I may come up with further questions ;)

>> Biopython treats "genbank" format as meaning either a GenBank file
>> (with nucleotides) or a GenPept file (with amino acids). We detect this
>> based on the LOCUS line containing "bp" or "aa".
>
> So do we ... but we need two versions of the 'aa' LOCUS lines. We try to
> pick up the rest of the details for reuse in output.

Why do you need two versions of the 'aa' LOCUS line? Is this
the "genpept" format versus "refseqp" issue alluded to earlier?

>> [Do you want to forward this back to the mailing list?]
>
> Will do.
>
> Peter

I've CC'd this reply to the list.

Peter