[Biojava-l] Parsing Genbank/EMBL/XML Sequences from binary NCBI ASN.1 daily update files

Fri Jun 2 02:24:58 UTC 2006

Hi Seth -

The BioJavaX parsers are still quite new and have not been heavily tested 
so your experiences can help us quite a lot. The parsers where initially 
designed to be quite strict and follow the GenBank etc specifications. 
However, there are often minor variations to those specs which cause 
things to break.

To help us find the bugs can you make sure you are using the very latest 
version of biojava from CVS, for example I was under the impression that 
the author = null problem had been solved. In each case an example file 
and the full stack trace is very useful as well. In some cases you have 
provided these so we have a starting point.

Also, if you have ideas on ways to fix the problems your suggestions would 
be greatly appreciated. We only have a very small team of active 
developers many of whom are unfortunately very busy just now.

Hopefully we can get to this soon.

- Mark

"Seth Johnson" <johnson.biotech at gmail.com>
Sent by: biojava-l-bounces at lists.open-bio.org
06/02/2006 06:03 AM

        To:     biojava-l at lists.open-bio.org
        cc:     (bcc: Mark Schreiber/GP/Novartis)
        Subject:        [Biojava-l] Parsing Genbank/EMBL/XML Sequences from binary NCBI ASN.1 
daily update files

Hi All,

I'm a newbie to the whole BioJava(X) API and was hoping to get some
clarification on several issues that I'm having.
I am developing a parser that would take as input "NCBI Incremental
ASN.1 Sequence Updates to Genbank" files (
ftp://ftp.ncbi.nih.gov/ncbi-asn1/daily-nc ) , gunzip them, and use the
ASN2GB converter (
ftp://ftp.ncbi.nih.gov/asn1-converters/by_program/asn2gb ) to convert
resulting sequences to a format parsable by BioJava(X) (
http://www.penguin-soft.com/penguin/man/1/asn2gb.html ). This is where
my problems start.

ISSUE 1:
I've tried to parse all of the formats that ASN2GB outputs ( GenBank
(default) , EMBL, nucleotide GBSet (XML), nucleotide INSDSet (XML),
tiny seq (XML) ) using either BioJava or BioJavaX API.  Only GenBank
format is recognized by the
"RichSequence.IOTools.readGenbankDNA(inBuf,gbNspace)" function with
some exceptions that I'll describe in issue #2.  This is the code that
I'm using to parse, for example, the EMBL output:

BufferedReader inBuf = new BufferedReader(new 
FileReader("embl_output.emb"));
Namespace gbNspace = (Namespace)
RichObjectFactory.getObject(SimpleNamespace.class, new
Object[]{"gbSpace"} );
RichSequenceIterator gbSeqs = 
RichSequence.IOTools.readEMBLDNA(inBuf,gbNspace);
while (gbSeqs.hasNext()) {
  try {
           RichSequence rs = gbSeqs.nextRichSequence();
           // Further processing or RichSequence object from here

       } catch (BioException be){
           be.printStackTrace();
       }
}

The multi-sequence EMBL file looks like this:
---------------------------------------------------------------------------------
ID   DQ472184  standard; DNA; INV; 546 BP.
XX
AC   DQ472184;
XX
SV   DQ472184.1
DT   15-MAY-2006
XX
DE   Trypanosoma cruzi strain CL Brener actin-related protein 3 (ARC21) 
gene,
DE   complete cds.
XX
KW   .
XX
OS   Trypanosoma cruzi strain CL Brener
OC   Eukaryota; Euglenozoa; Kinetoplastida; Trypanosomatidae; Trypanosoma;
OC   Schizotrypanum.
XX
RN   [1]
RP   1-546
RA   De Melo L.D.B.;
RT   "Actin of Trypanosoma cruzi: ubiquitous actin-binding proteins";
RL   Unpublished.
XX
RN   [2]
RP   1-546
RA   De Melo L.D.B.;
RT   ;
RL   Submitted (03-APR-2006) to the EMBL/GenBank/DDBJ databases.
RL   Instituto de Biofisica Carlos Chagas Filho, Universidade Federal do 
Rio
RL   de Janeiro, Cidade Universitaria, CCS, Bl.G, Sl.G157, Rio de Janeiro, 
RJ
RL   21949-900, Brazil
XX
FH   Key             Location/Qualifiers
FH
FT   source          1..546
FT                   /organism="Trypanosoma cruzi strain CL Brener"
FT                   /mol_type="genomic DNA"
FT                   /strain="CL Brener"
FT                   /db_xref="taxon:353153"
FT   gene            <1..>546
FT                   /gene="ARC21"
FT                   /note="TcARC21"
FT   mRNA            <1..>546
FT                   /gene="ARC21"
FT                   /product="actin-related protein 3"
FT   CDS             1..546
FT                   /gene="ARC21"
FT                   /note="actin-binding protein; ARPC3 21 kDa; putative
FT                   member of Arp2/3 complex"
FT                   /codon_start=1
FT                   /product="actin-related protein 3"
FT                   /protein_id="ABF13401.1"
FT                   /db_xref="GI:93360014"
FT /translation="MHSRWNGYEESSLLGCGVYPLRRTSRLTPPGPAPRMDEMIEEG
FT EEEPQDIVDEAFYFFKPHMFFRNFPIKGAGDRVILYLTMYLHECLKKIVQLKREEAH
FT SVLLNYATMPFASPGEKDFPFNAFFPAGNEEEQEKWREYAKQLRLEANARLIEKVFL
FT                   FPEKDGTGNKFWMAFAKRPFLASS"
     atgcacagca ggtggaatgg gtatgaagaa agtagtcttt tgggctgcgg tgtttatccg  60
     cttcgccgca cgtcacggct cactccaccc ggccctgcac cgcggatgga tgaaatgatt  
120
     gaggagggcg aagaggagcc acaagacatt gttgacgagg cattttactt ttttaagccc  
180
     cacatgtttt ttcgtaattt tcccattaag ggtgctggtg atcgtgtcat tctgtacttg  
240
     acgatgtacc ttcatgagtg tttgaagaaa attgtccagt tgaagcgtga agaggcccat  
300
     tctgtgcttc ttaactacgc tacgatgccg tttgcatcac caggggaaaa ggactttccg  
360
     tttaacgcgt ttttccctgc tgggaatgag gaggaacaag aaaaatggcg agagtatgca  
420
     aaacagcttc gattggaggc caacgcacgt ctcattgaga aggtttttct ttttccagag  
480
     aaggacggca ccggaaacaa gttctggatg gcgtttgcga agaggccttt cttggcttct  
540
     agttag   546
//
ID   DQ472185  standard; DNA; INV; 543 BP.
XX
AC   DQ472185;
XX
SV   DQ472185.1
DT   15-MAY-2006
XX
DE   Trypanosoma cruzi strain CL Brener actin-related protein 4 (ARC20) 
gene,
DE   complete cds.
XX
KW   .
XX
OS   Trypanosoma cruzi strain CL Brener
OC   Eukaryota; Euglenozoa; Kinetoplastida; Trypanosomatidae; Trypanosoma;
OC   Schizotrypanum.
XX
RN   [1]
RP   1-543
RA   De Melo L.D.B.;
RT   "Actin of Trypanosoma cruzi: ubiquitous actin-binding proteins";
RL   Unpublished.
XX
RN   [2]
RP   1-543
RA   De Melo L.D.B.;
RT   ;
RL   Submitted (03-APR-2006) to the EMBL/GenBank/DDBJ databases.
RL   Instituto de Biofisica Carlos Chagas Filho, Universidade Federal do 
Rio
RL   de Janeiro, Cidade Universitaria, CCS, Bl.G, Sl.G157, Rio de Janeiro, 
RJ
RL   21949-900, Brazil
XX
FH   Key             Location/Qualifiers
FH
FT   source          1..543
FT                   /organism="Trypanosoma cruzi strain CL Brener"
FT                   /mol_type="genomic DNA"
FT                   /strain="CL Brener"
FT                   /db_xref="taxon:353153"
FT   gene            <1..>543
FT                   /gene="ARC20"
FT                   /note="TcARC20"
FT   mRNA            <1..>543
FT                   /gene="ARC20"
FT                   /product="actin-related protein 4"
FT   CDS             1..543
FT                   /gene="ARC20"
FT                   /note="actin-binding protein; ARPC4 20 kDa; putative
FT                   member of Arp2/3 complex"
FT                   /codon_start=1
FT                   /product="actin-related protein 4"
FT                   /protein_id="ABF13402.1"
FT                   /db_xref="GI:93360016"
FT /translation="MATAYLPYYDCIKCTLHAALCIGNYPSCTVERHNKPEVEVADH
FT LENNGEIKVQDFLLNPIRIVRSEQESCLIEPSINSTRISVSFLKSDAIAEIIARKYV
FT GFLAQRAKQFHILRKKPIPGYDISFLISHEEVETMHRNRIIQFIITFLMDIDADIAA
FT                   MKLNVNQRARRAAMEFFLALNFT"
     atggcaaccg cctatttgcc ttactacgac tgcatcaagt gcacgttgca cgcggctttg  60
     tgcatcggga attatccttc atgtaccgtg gagcgtcata ataaaccaga agttgaggtt  
120
     gcagaccatc tggagaataa tggtgaaata aaagtacaag atttccttct taaccccata  
180
     cgcattgtgc gttcagaaca ggaaagttgt cttattgaac ctagtataaa cagcacacgc  
240
     atatctgtat cgtttctcaa gagcgacgct attgcagaga ttattgcccg aaagtacgtt  
300
     ggatttttag ctcagcgagc caaacagttt cacatcttga gaaaaaagcc tattccggga  
360
     tatgatataa gttttttgat ttctcacgag gaagtagaaa caatgcatag gaataggatt  
420
     attcaattta taattacttt cttgatggat attgatgctg acattgctgc aatgaagttg  
480
     aatgtgaatc aacgtgcacg tcgagcagcg atggaattct ttcttgcatt gaatttcaca  
540
     tga   543
//
-----------------------------------------------------------------------
I get an exception message "Could Not Read Sequence".  Same thing
happens if I use the readINSDSetDNA reader instead of readEMBLDNA one
with the following INSDset file (beginning of the file):

<?xml version="1.0"?>
<!DOCTYPE INSDSeq PUBLIC "-//NCBI//INSD INSDSeq/EN" "INSD_INSDSeq.dtd">
<INSDSeq>
  <INSDSeq_locus>DQ022078</INSDSeq_locus>
  <INSDSeq_length>16729</INSDSeq_length>
  <INSDSeq_moltype>DNA</INSDSeq_moltype>
  <INSDSeq_topology>linear</INSDSeq_topology>
  <INSDSeq_division>ENV</INSDSeq_division>
  <INSDSeq_update-date>15-MAY-2006</INSDSeq_update-date>
  <INSDSeq_create-date>15-MAY-2006</INSDSeq_create-date>
  <INSDSeq_definition>Uncultured bacterium WWRS-2005 putative
aminoglycoside phosphotransferase (a3.001), putative oxidoreductase
(a3.002), putative oxidoreductase (a3.003), putative beta-lactamase
class C (estA3), putative permease (a3.005), putative transmembrane
signal peptide (a3.006), thiol-disulfide isomerase (a3.007), histone
acetyltransferase HPA2 (a3.008), putative enzyme (a3.009), putative
asparaginase (a3.010), hypothetical protein (a3.011), hypothetical
protein (a3.012), putative membrane protease subunit (a3.013),
putative haloalkane dehalogenase (a3.014), putative transcriptional
regulator (a3.015), putative peptidyl-dipeptidase Dcp (a3.016), and
hypothetical protein (a3.017) genes, complete cds</INSDSeq_definition>
  <INSDSeq_primary-accession>DQ022078</INSDSeq_primary-accession>
  <INSDSeq_other-seqids>
    <INSDSeqid>gb|DQ022078.1|</INSDSeqid>
    <INSDSeqid>gi|71842722</INSDSeqid>
  </INSDSeq_other-seqids>
  <INSDSeq_keywords>
    <INSDKeyword>ENV</INSDKeyword>
  </INSDSeq_keywords>
  <INSDSeq_references>
    <INSDReference>
      <INSDReference_reference>?</INSDReference_reference>
      <INSDReference_position>1..16729</INSDReference_position>
      <INSDReference_authors>
        <INSDAuthor>Schmeisser,C.</INSDAuthor>
        <INSDAuthor>Elend,C.</INSDAuthor>
        <INSDAuthor>Streit,W.R.</INSDAuthor>
      </INSDReference_authors>
      <INSDReference_title>Isolation and biochemical characterization
of two novel metagenome derived esterases</INSDReference_title>
      <INSDReference_journal>Appl. Environ. Microbiol. 0:0-0
(2006)</INSDReference_journal>
    </INSDReference>
    <INSDReference>
      <INSDReference_reference>?</INSDReference_reference>
      <INSDReference_position>1..16729</INSDReference_position>
      <INSDReference_authors>
        <INSDAuthor>Schmeisser,C.</INSDAuthor>
        <INSDAuthor>Elend,C.</INSDAuthor>
        <INSDAuthor>Streit,W.R.</INSDAuthor>
      </INSDReference_authors>
      <INSDReference_journal>Submitted (29-APR-2005) to the
EMBL/GenBank/DDBJ databases. Molekulare Enzymtechnologie, University
Duisburg-Essen, Lotharstrasse 1, Duisburg D-47057,
Germany</INSDReference_journal>
    </INSDReference>
  </INSDSeq_references>

So my question is wether the ASN2GB produces output that's
incompatible with BioJava parsers or is there a problem with the
sequence themselves or the problems with the majority of parsers???
Could it be that I'm using the API wrongly for the above formats,
although GenBank parser works as advertised with some exceptions
below:

ISSUE #2:
When I try to parse GenBank files using the following code:

BufferedReader inBuf = new BufferedReader(new 
FileReader("genbank_output.gb"));
Namespace gbNspace = (Namespace)
RichObjectFactory.getObject(SimpleNamespace.class, new
Object[]{"gbSpace"} );
RichSequenceIterator gbSeqs =
RichSequence.IOTools.readGenbankDNA(inBuf,gbNspace);
while (gbSeqs.hasNext()) {
  try {
           RichSequence rs = gbSeqs.nextRichSequence();
           // Further processing or RichSequence object from here

       } catch (BioException be){
           be.printStackTrace();
       }
}

Genbank file in question:

LOCUS       BC074905                 838 bp    mRNA    linear   PRI 
15-APR-2006
DEFINITION  Homo sapiens kallikrein 14, mRNA (cDNA clone MGC:104038
            IMAGE:30915482), complete cds.
ACCESSION   BC074905
VERSION     BC074905.2  GI:50959825
KEYWORDS    MGC.
SOURCE      Homo sapiens (human)
  ORGANISM  Homo sapiens
            Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; 
Euteleostomi;
            Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;
            Catarrhini; Hominidae; Homo.
REFERENCE   1  (bases 1 to 838)
  AUTHORS   Strausberg,R.L., Feingold,E.A., Grouse,L.H., Derge,J.G.,
            Klausner,R.D., Collins,F.S., Wagner,L., Shenmen,C.M., 
Schuler,G.D.,
            Altschul,S.F., Zeeberg,B., Buetow,K.H., Schaefer,C.F., 
Bhat,N.K.,
            Hopkins,R.F., Jordan,H., Moore,T., Max,S.I., Wang,J., 
Hsieh,F.,
            Diatchenko,L., Marusina,K., Farmer,A.A., Rubin,G.M., Hong,L.,
            Stapleton,M., Soares,M.B., Bonaldo,M.F., Casavant,T.L.,
            Scheetz,T.E., Brownstein,M.J., Usdin,T.B., Toshiyuki,S.,
            Carninci,P., Prange,C., Raha,S.S., Loquellano,N.A., 
Peters,G.J.,
            Abramson,R.D., Mullahy,S.J., Bosak,S.A., McEwan,P.J.,
            McKernan,K.J., Malek,J.A., Gunaratne,P.H., Richards,S.,
            Worley,K.C., Hale,S., Garcia,A.M., Gay,L.J., Hulyk,S.W.,
            Villalon,D.K., Muzny,D.M., Sodergren,E.J., Lu,X., Gibbs,R.A.,
            Fahey,J., Helton,E., Ketteman,M., Madan,A., Rodrigues,S.,
            Sanchez,A., Whiting,M., Madan,A., Young,A.C., Shevchenko,Y.,
            Bouffard,G.G., Blakesley,R.W., Touchman,J.W., Green,E.D.,
            Dickson,M.C., Rodriguez,A.C., Grimwood,J., Schmutz,J., 
Myers,R.M.,
            Butterfield,Y.S., Krzywinski,M.I., Skalska,U., Smailus,D.E.,
            Schnerch,A., Schein,J.E., Jones,S.J. and Marra,M.A.
  CONSRTM   Mammalian Gene Collection Program Team
  TITLE     Generation and initial analysis of more than 15,000 
full-length
            human and mouse cDNA sequences
  JOURNAL   Proc. Natl. Acad. Sci. U.S.A. 99 (26), 16899-16903 (2002)
   PUBMED   12477932
REFERENCE   2  (bases 1 to 838)
  CONSRTM   NIH MGC Project
  TITLE     Direct Submission
  JOURNAL   Submitted (25-JUN-2004) National Institutes of Health, 
Mammalian
            Gene Collection (MGC), Bethesda, MD 20892-2590, USA
  REMARK    NIH-MGC Project URL: http://mgc.nci.nih.gov
COMMENT     On Aug 4, 2004 this sequence version replaced gi:49901832.
            Contact: MGC help desk
            Email: cgapbs-r at mail.nih.gov
            Tissue Procurement: Genome Sequence Centre, British Columbia 
Cancer
            Center
            cDNA Library Preparation: British Columbia Cancer Research 
Center
            cDNA Library Arrayed by: The I.M.A.G.E. Consortium (LLNL)
            DNA Sequencing by: Genome Sequence Centre,
            BC Cancer Agency, Vancouver, BC, Canada
            info at bcgsc.bc.ca
            Martin Hirst, Thomas Zeng, Ryan Morin, Michelle Moksa, Johnson
            Pang, Diana Mah, Jing Wang, Kieth Fichter, Eric Chuah, Allen
            Delaney, Rob Kirkpatrick, Agnes Baross, Sarah Barber, Mabel
            Brown-John, Steve S. Chand, William Chow, Ryan Babakaiff, Dave
            Wong, Corey Matsuo, Jaclyn Beland, Susan Gibson, Luis delRio, 
Ruth
            Featherstone, Malachi Griffith, Obi Griffith, Ran Guin, Nancy 
Liao,
            Kim MacDonald,  Mike R. Mayo, Josh Moran, Diana Palmquist, JR
            Santos, Duane Smailus, Jeff Stott, Miranda Tsai, George Yang,
            Jacquie Schein, Asim Siddiqui,Steven Jones, Rob Holt, Marco 
Marra.

            Clone distribution: MGC clone distribution information can be 
found
            through the I.M.A.G.E. Consortium/LLNL at: http://image.llnl.gov
            Series: IRBU Plate: 4 Row: C Column: 3.

            Differences found between this sequence and the human 
reference
            genome (build 36) are described in misc_difference features 
below.
FEATURES             Location/Qualifiers
     source          1..838
                     /organism="Homo sapiens"
                     /mol_type="mRNA"
                     /db_xref="taxon:9606"
                     /clone="MGC:104038 IMAGE:30915482"
                     /tissue_type="Lung, PCR rescued clones"
                     /clone_lib="NIH_MGC_273"
                     /lab_host="DH10B"
                     /note="Vector: pCR4 Topo TA with reversed insert"
     gene            1..838
                     /gene="KLK14"
                     /note="synonym: KLK-L6"
                     /db_xref="GeneID:43847"
                     /db_xref="HGNC:6362"
                     /db_xref="IMGT/GENE-DB:6362"
                     /db_xref="MIM:606135"
     CDS             49..804
                     /gene="KLK14"
                     /codon_start=1
                     /product="KLK14 protein"
                     /protein_id="AAH74905.1"
                     /db_xref="GI:50959826"
                     /db_xref="GeneID:43847"
                     /db_xref="HGNC:6362"
                     /db_xref="IMGT/GENE-DB:6362"
                     /db_xref="MIM:606135"
 /translation="MFLLLTALQVLAIAMTRSQEDENKIIGGYTCTRSSQPWQAALLA
 GPRRRFLCGGALLSGQWVITAAHCGRPILQVALGKHNLRRWEATQQVLRVVRQVTHPN
 YNSRTHDNDLMLLQLQQPARIGRAVRPIEVTQACASPGTSCRVSGWGTISSPIARYPA
 SLQCVNINISPDEVCQKAYPRTITPGMVCAGVPQGGKDSCQGDSGGPLVCRGQLQGLV
                     SWGMERCALPGYPGVYTNLCKYRSWIEETMRDK"
     misc_difference 98
                     /gene="KLK14"
                     /note="'G' in cDNA is 'A' in the human genome; amino 
acid
                     difference: 'R' in cDNA, 'Q' in the human genome."
     misc_difference 133
                     /gene="KLK14"
                     /note="'T' in cDNA is 'C' in the human genome; amino 
acid
                     difference: 'Y' in cDNA, 'H' in the human genome."
ORIGIN
        1 atgtccctga gggtcttggg ctctgggacc tggccctcag cccctaaaat 
gttcctcctg
       61 ctgacagcac ttcaagtcct ggctatagcc atgacacgga gccaagagga 
tgagaacaag
      121 ataattggtg gctatacgtg cacccggagc tcccagccgt ggcaggcggc 
cctgctggcg
      181 ggtcccaggc gccgcttcct ctgcggaggc gccctgcttt caggccagtg 
ggtcatcact
      241 gctgctcact gcggccgccc gatccttcag gttgccctgg gcaagcacaa 
cctgaggagg
      301 tgggaggcca cccagcaggt gctgcgcgtg gttcgtcagg tgacgcaccc 
caactacaac
      361 tcccggaccc acgacaacga cctcatgctg ctgcagctac agcagcccgc 
acggatcggg
      421 agggcagtca ggcccattga ggtcacccag gcctgtgcca gccccgggac 
ctcctgccga
      481 gtgtcaggct ggggaactat atccagcccc atcgccaggt accccgcctc 
tctgcaatgc
      541 gtgaacatca acatctcccc ggatgaggtg tgccagaagg cctatcctag 
aaccatcacg
      601 cctggcatgg tctgtgcagg agttccccag ggcgggaagg actcttgtca 
gggtgactct
      661 gggggacccc tggtgtgcag aggacagctc cagggcctcg tgtcttgggg 
aatggagcgc
      721 tgcgccctgc ctggctaccc cggtgtctac accaacctgt gcaagtacag 
aagctggatt
      781 gaggaaacga tgcgggacaa atgatggtct tcacggtggg atggacctcg tcagctgc
//

I get the following exception:

java.lang.IllegalArgumentException: Authors string cannot be null
org.biojava.bio.BioException: Could not read sequence
        at 
org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112)
        at 
exonhit.parsers.GenBankParser.getSequences(GenBankParser.java:107)
        at 
exonhit.parsers.GenBankParser.runGBparser(GenBankParser.java:258)
        at exonhit.parsers.GenBankParser.main(GenBankParser.java:341)
Caused by: java.lang.IllegalArgumentException: Authors string cannot be 
null
        at 
org.biojavax.DocRefAuthor$Tools.parseAuthorString(DocRefAuthor.java:76)
        at 
org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.java:356)
        at 
org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109)

-----------------------------------------------------------------------

I'm trying to see what could be the problem with this particular
sequence.  Looks to me like the AUTHORS portion is not getting parsed
correctly.  Any ideas would be greatly appreciated!

-- 
Best Regards,

Seth Johnson
Senior Bioinformatics Associate

Ph: (202) 470-0900
Fx: (775) 251-0358
_______________________________________________
Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/biojava-l