[BioPython] [BioSQL-l] BioSQL : GenBank db_xref names in dbxref table
Chris Fields
cjfields at uiuc.edu
Fri Nov 23 00:42:12 UTC 2007
I think SeqIO checks the name for parsing reasons only, in cases
where the format changes based on the source (such as GenPept
DBSOURCE data). I don't think we go beyond that in Bioperl, probably
b/c modifying or expanding names for data persistence would lead to
volatile coding issues (i.e. consistency between parsers, constant
updating to cover new crossrefs, etc).
I would definitely suggest retaining the original DB as it appears in
the dbxref for consistency/sanity; if needed return expanded names
using a different method if they are designated.
chris
On Nov 22, 2007, at 10:37 AM, Mauricio Herrera Cuadra wrote:
> Hi Peter,
>
> In BioPerl, there's no such mapping for db_xref's that I'm aware of.
> Each parser handles db_xref records on its own. Take a look at the
> Bio::SeqIO::genbank code, inside the next_seq() method for example:
>
> http://code.open-bio.org/cgi-bin/viewcvs/viewcvs.cgi/bioperl-live/
> Bio/SeqIO/genbank.pm?rev=HEAD&content-type=text/vnd.viewcvs-markup
>
> Regards,
> Mauricio.
>
> Peter wrote:
>> Dear all,
>>
>> I'm one of the Biopython developers. I've recently got going with
>> BioSQL and have been getting to grips with the Biopython BioSQL
>> interface. I'm aware that we need to try and be consistent with
>> BioPerl and BioJava, so I'd like to pose my first question related to
>> that.
>>
>> When loading GenBank records, many features have db_xref qualifiers,
>> e.g. from a random CDS feature in E. coli K12:
>>
>> /db_xref="ASAP:1309"
>> /db_xref="GI:16128366"
>> /db_xref="ECOCYC:EG10213"
>> /db_xref="GeneID:945313"
>>
>> Bioython attempts to translate the strings "ASAP", "GI", "ECOCYC",
>> "GeneID" before using recording these entries in the
>> seqfeature_dbxref
>> and dbxref tables. For example, "GI" becomes "GeneIndex".
>> Biopython's current mapping is as follows:
>>
>> # Dictionary of database types, keyed by GenBank db_xref abbreviation
>> db_dict = {'GeneID': 'Entrez',
>> 'GI': 'GeneIndex',
>> 'COG': 'COG',
>> 'CDD': 'CDD',
>> 'DDBJ': 'DNA Databank of Japan',
>> 'Entrez': 'Entrez',
>> 'GeneIndex': 'GeneIndex',
>> 'PUBMED': 'PubMed',
>> 'taxon': 'Taxon',
>> 'ATCC': 'ATCC',
>> 'ISFinder': 'ISFinder',
>> 'GOA': 'Gene Ontology Annotation',
>> 'ASAP': 'ASAP',
>> 'PSEUDO': 'PSEUDO',
>> 'InterPro': 'InterPro',
>> 'GEO': 'Gene Expression Omnibus',
>> 'EMBL': 'EMBL',
>> 'UniProtKB/Swiss-Prot': 'UniProtKB/Swiss-Prot',
>> 'ECOCYC': 'EcoCyc',
>> 'UniProtKB/TrEMBL': 'UniProtKB/TrEMBL'
>> }
>>
>> In my testing, I've found several GenBank db_xref abbreviation for
>> which we don't have a mapping defined, such as "LocusID", "dbSNP",
>> "MGD", "MIM", or from an EMBL file, "REMTREMBL".
>>
>> I'd like to know if BioPerl and/or BioJava and/or BioRuby define a
>> similar mapping in their BioSQL code (or GenBank parser), so that
>> Biopython can follow your example.
>>
>> Thank you,
>>
>> Peter
>>
>> P.S. See also Biopython bug 2405
>> http://bugzilla.open-bio.org/show_bug.cgi?id=2405
>> _______________________________________________
>> BioSQL-l mailing list
>> BioSQL-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biosql-l
>>
>
> --
> MAURICIO HERRERA CUADRA
> arareko at campus.iztacala.unam.mx
> Laboratorio de Genética
> Unidad de Morfofisiología y Función
> Facultad de Estudios Superiores Iztacala, UNAM
>
> _______________________________________________
> BioSQL-l mailing list
> BioSQL-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biosql-l
Christopher Fields
Postdoctoral Researcher
Lab of Dr. Robert Switzer
Dept of Biochemistry
University of Illinois Urbana-Champaign
More information about the Biopython
mailing list