[Biopython-dev] Creating a NCBIFastaIterator

Fri Oct 7 12:49:30 UTC 2011

On Fri, Oct 7, 2011 at 12:18 PM, Keith Hughitt <keith.hughitt at gmail.com> wrote:
> Okay, I took at stab at it. The code is in the master branch of my
> fork: https://github.com/khughitt/biopython/blob/75be77cf28d376329577adf5ec41a8880b7faf5c/Bio/SeqIO/FastaIO.py#L73

You are only handling gi|<gi_num>|ref|<accession>|<description>
whereas the NCBI have a *lot* of other variations to consider:

http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/formatdb_fastacmd.html

This is quite an open ended bit of work...

> I wasn't sure what the best choices are for id/name so for now I stored the
> gid in id (and also in the annotations), and the accession as name. Any
> suggestions?

I suggest collecting a selection of matched NCBI FASTA and
GenBank/GenPept files, and how Biopython handles the
GenBank/GenPept version (format name "genbank" alias "gb"
in Bio.SeqIO) and try to make handling the FASTA version as
"fasta-ncbi" do the same.

e.g. From our unit tests (from the NCBI FTP site), these are
a pair:

Tests/GenBank/NC_005816.gb
Tests/GenBank/NC_005816.fna

> I also haven't written any test code yet. Should I parameterize
> TitleFunctions.simple_check and multi_check, or is there
> another approach you would advise?
> Keith

Probably write some completely new tests. e.g. Use the
existing test files mentioned above, and verify that both
the "genbank" and the "fasta-ncbi" parser give the same
results (ignoring things not in the FASTA file of course).

Peter