[Biopython-dev] Loading SeqRecords into BioSQL with NCBI taxon ID

Peter biopython at maubp.freeserve.co.uk
Tue May 12 16:05:15 UTC 2009

Over on Bug 2826, David wrote:

> Thank you. I'm new to BioPython.
> The goal was to take some whole-genome sequence (which isn't in Genbank) and
> attach a taxon to it, in order that it be written to a BioSQL database.

You've talked about trying to parse WGS GenBank files on Bug 2825 but
presumable if this new data isn't in GenBank, it is in another format.

What format is your  whole-genome sequence?  FASTA or something simple?

> Other records in the BioSQL database derive from NCBI and so have taxon_ids,
> so the additional WGS being in a similar format would make things simpler.

I see. Basically you need to import a SeqRecord into BioSQL with an
NCBI taxon ID.  You don't need to write out a GenBank file to do this.

First create the SeqRecord, e.g.

from Bio import SeqIO
record = SeqIO.read(handle, format, alphabet)

There are now two options - because the BioSQL loader will look for
the NCBI taxon ID in two places:

(Option 1) Record the NCBI taxon ID in the SeqRecord's annotation
dictionary under the "ncbi_taxid" key.  This should work (untested):

record.annotations["ncbi_taxid"] = 12345 #or single element list, [12345]

(Option 2) Mimic a SeqRecord from parsing a GenBank file with a source
feature containing the taxon ID. This should work (untested):

#Create the SeqRecord:
record = SeqIO.read(handle, format, alphabet)
#Create the source features:
from Bio.SeqFeature import SeqFeature, FeatureLocation
f = SeqFeature(FeatureLocation(0, len(record)), strand=+1, type="source")
f.qualifiers["db_xref"] = ["taxon:12345"]
record.features = [f] #or insert at start

If you don't really have a sequence, this second approach doesn't make
so much sense.

[Arguably there could be a third option via the dbxref's list]

Then in either case, load the modified SeqRecord into the database.
You may want to pre-load the NCBI taxonomy, see

Alternatively, using Biopython 1.49+ you can have this fetched from
Entrez on demand with the fetch_NCBI_taxonomy=True option.  The BioSQL
wiki page needs updating on this topic.


