[Biopython-dev] [Bug 3175] Caret in genbank files leads to GenBank Parser crash in Biopython 1.54

bugzilla-daemon at portal.open-bio.org bugzilla-daemon at portal.open-bio.org
Thu Feb 10 14:05:33 UTC 2011


http://bugzilla.open-bio.org/show_bug.cgi?id=3175





------- Comment #8 from biopython-bugzilla at maubp.freeserve.co.uk  2011-02-10 09:05 EST -------
(In reply to comment #7)
> NT_022184.15 is the record containing IGKV2-40 (and the associated caret) in
> my file. What I said about Nucleotide still applies, though. 
> 

Yes, you're right. My mistake, NT_015926.15 was the last good record.

Had you noticed this was the last gene in this record? It runs right up to
the end of the sequence and beyond (missing the right most end, i.e. the 5'
start of the gene since it is on the reverse strand). From the FTP site:

LOCUS       NT_022184           68452323 bp    DNA     linear   CON 28-OCT-2010
DEFINITION  Homo sapiens chromosome 2 genomic contig, GRCh37.p2 reference
            primary assembly.
...
     gene            complement(68451760..>68452323)
                     /gene="IGKV2-40"
                     /gene_synonym="IGKV240; O11; O11a"
                     /note="Derived by automated computational analysis using
                     gene prediction method: Curated Genomic."
                     /db_xref="GeneID:28916"
                     /db_xref="HGNC:5789"
                     /db_xref="IMGT/GENE-DB:IGKV2-40"
     V_segment       complement(68451760..68452073^68452074)
                     /gene="IGKV2-40"
                     /gene_synonym="IGKV240; O11; O11a"
                     /standard_name="IGKV2-40"
                     /note="Derived by automated computational analysis using
                     gene prediction method: Curated Genomic."
                     /db_xref="GeneID:28916"
     CDS             complement(<68451760..68452072^68452073)
                     /gene="IGKV2-40"
                     /gene_synonym="IGKV240; O11; O11a"
                     /exception="rearrangement required for product"
                     /note="Derived by automated computational analysis using
                     gene prediction method: Curated Genomic."
                     /codon_start=1
                     /db_xref="GeneID:28916"
                     /db_xref="IMGT/LIGM:IGKV2-40"
                     /db_xref="HGNC:5789"
                     /db_xref="IMGT/GENE-DB:IGKV2-40"

If we look at the record via Entrez,
http://www.ncbi.nlm.nih.gov/nuccore/NT_022184.15?report=gbwithparts

     gene            complement(68451760..>68452323)
                     /gene="IGKV2-40"
                     /gene_synonym="IGKV240; O11; O11a"
                     /note="Derived by automated computational analysis using
                     gene prediction method: Curated Genomic."
                     /db_xref="GeneID:28916"
                     /db_xref="HGNC:5789"
                     /db_xref="IMGT/GENE-DB:IGKV2-40"
     V_segment       complement(68451760..68452074)
                     /gene="IGKV2-40"
                     /gene_synonym="IGKV240; O11; O11a"
                     /standard_name="IGKV2-40"
                     /note="Derived by automated computational analysis using
                     gene prediction method: Curated Genomic."
                     /db_xref="GeneID:28916"
     CDS             complement(<68451760..68452073)
                     /gene="IGKV2-40"
                     /gene_synonym="IGKV240; O11; O11a"
                     /exception="rearrangement required for product"
                     /note="Derived by automated computational analysis using
                     gene prediction method: Curated Genomic."
                     /codon_start=1
                     /db_xref="IMGT/LIGM:IGKV2-40"
                     /db_xref="GeneID:28916"
                     /db_xref="HGNC:5789"
                     /db_xref="IMGT/GENE-DB:IGKV2-40"

So this appears to have been updated to avoid the funny caret location,
but I think they made a mistake - surely the CDS should be
complement(68451760..>68452073) not complement(<68451760..68452073)
as stated?

Have you contacted the NCBI about this? If not, I will.

I believe that the caret location in the FTP GenBank file is invalid and
Biopython is right to reject it (but I would like to confirm this with the
NCBI). For now the simplest solution is for you to manually edit that feature.

Thanks,

Peter


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.



More information about the Biopython-dev mailing list