[Biojava-l] Parsing Genbank/EMBL/XML Sequences from binary NCBI ASN.1 daily update files

Seth Johnson johnson.biotech at gmail.com
Fri Jun 2 18:46:26 UTC 2006


Hi Mark,

Thank you for your suggestions.  I've followed your suggestions and it
seems to have found a bug that caused an exception in readINSDseqDNA
parser.

http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=nucleotide&val=94481355

The problem int the above sequence in INSDseq format was caused by the
presence of <INSDQualifier_name> tags without the corresponding
<INSDQualifier_value> tags:

          <INSDQualifier>
            <INSDQualifier_name>environmental_sample</INSDQualifier_name>
          </INSDQualifier>

I have not checked wether it's handled correctly by other parsers when
it is converted from original NCBI ASN.1 format.

Could the code be adjusted so if there's no <INSDQualifier_value> tags
it would assume the value to be 'null' ???

Regards,

Seth

On 6/1/06, mark.schreiber at novartis.com <mark.schreiber at novartis.com> wrote:
> Hi Seth -
>
> The BioJavaX parsers are still quite new and have not been heavily tested
> so your experiences can help us quite a lot. The parsers where initially
> designed to be quite strict and follow the GenBank etc specifications.
> However, there are often minor variations to those specs which cause
> things to break.
>
> To help us find the bugs can you make sure you are using the very latest
> version of biojava from CVS, for example I was under the impression that
> the author = null problem had been solved. In each case an example file
> and the full stack trace is very useful as well. In some cases you have
> provided these so we have a starting point.
>
> Also, if you have ideas on ways to fix the problems your suggestions would
> be greatly appreciated. We only have a very small team of active
> developers many of whom are unfortunately very busy just now.
>
> Hopefully we can get to this soon.
>
> - Mark
>
>
>
>
>
> "Seth Johnson" <johnson.biotech at gmail.com>
> Sent by: biojava-l-bounces at lists.open-bio.org
> 06/02/2006 06:03 AM
>
>
>         To:     biojava-l at lists.open-bio.org
>         cc:     (bcc: Mark Schreiber/GP/Novartis)
>         Subject:        [Biojava-l] Parsing Genbank/EMBL/XML Sequences from binary NCBI ASN.1
> daily update files
>
>
> Hi All,
>
> I'm a newbie to the whole BioJava(X) API and was hoping to get some
> clarification on several issues that I'm having.
> I am developing a parser that would take as input "NCBI Incremental
> ASN.1 Sequence Updates to Genbank" files (
> ftp://ftp.ncbi.nih.gov/ncbi-asn1/daily-nc ) , gunzip them, and use the
> ASN2GB converter (
> ftp://ftp.ncbi.nih.gov/asn1-converters/by_program/asn2gb ) to convert
> resulting sequences to a format parsable by BioJava(X) (
> http://www.penguin-soft.com/penguin/man/1/asn2gb.html ). This is where
> my problems start.
>
> ISSUE 1:
> I've tried to parse all of the formats that ASN2GB outputs ( GenBank
> (default) , EMBL, nucleotide GBSet (XML), nucleotide INSDSet (XML),
> tiny seq (XML) ) using either BioJava or BioJavaX API.  Only GenBank
> format is recognized by the
> "RichSequence.IOTools.readGenbankDNA(inBuf,gbNspace)" function with
> some exceptions that I'll describe in issue #2.  This is the code that
> I'm using to parse, for example, the EMBL output:
>
> BufferedReader inBuf = new BufferedReader(new
> FileReader("embl_output.emb"));
> Namespace gbNspace = (Namespace)
> RichObjectFactory.getObject(SimpleNamespace.class, new
> Object[]{"gbSpace"} );
> RichSequenceIterator gbSeqs =
> RichSequence.IOTools.readEMBLDNA(inBuf,gbNspace);
> while (gbSeqs.hasNext()) {
>   try {
>            RichSequence rs = gbSeqs.nextRichSequence();
>            // Further processing or RichSequence object from here
>
>        } catch (BioException be){
>            be.printStackTrace();
>        }
> }
>
> The multi-sequence EMBL file looks like this:
> ---------------------------------------------------------------------------------
> ID   DQ472184  standard; DNA; INV; 546 BP.
> XX
> AC   DQ472184;
> XX
> SV   DQ472184.1
> DT   15-MAY-2006
> XX
> DE   Trypanosoma cruzi strain CL Brener actin-related protein 3 (ARC21)
> gene,
> DE   complete cds.
> XX
> KW   .
> XX
> OS   Trypanosoma cruzi strain CL Brener
> OC   Eukaryota; Euglenozoa; Kinetoplastida; Trypanosomatidae; Trypanosoma;
> OC   Schizotrypanum.
> XX
> RN   [1]
> RP   1-546
> RA   De Melo L.D.B.;
> RT   "Actin of Trypanosoma cruzi: ubiquitous actin-binding proteins";
> RL   Unpublished.
> XX
> RN   [2]
> RP   1-546
> RA   De Melo L.D.B.;
> RT   ;
> RL   Submitted (03-APR-2006) to the EMBL/GenBank/DDBJ databases.
> RL   Instituto de Biofisica Carlos Chagas Filho, Universidade Federal do
> Rio
> RL   de Janeiro, Cidade Universitaria, CCS, Bl.G, Sl.G157, Rio de Janeiro,
> RJ
> RL   21949-900, Brazil
> XX
> FH   Key             Location/Qualifiers
> FH
> FT   source          1..546
> FT                   /organism="Trypanosoma cruzi strain CL Brener"
> FT                   /mol_type="genomic DNA"
> FT                   /strain="CL Brener"
> FT                   /db_xref="taxon:353153"
> FT   gene            <1..>546
> FT                   /gene="ARC21"
> FT                   /note="TcARC21"
> FT   mRNA            <1..>546
> FT                   /gene="ARC21"
> FT                   /product="actin-related protein 3"
> FT   CDS             1..546
> FT                   /gene="ARC21"
> FT                   /note="actin-binding protein; ARPC3 21 kDa; putative
> FT                   member of Arp2/3 complex"
> FT                   /codon_start=1
> FT                   /product="actin-related protein 3"
> FT                   /protein_id="ABF13401.1"
> FT                   /db_xref="GI:93360014"
> FT /translation="MHSRWNGYEESSLLGCGVYPLRRTSRLTPPGPAPRMDEMIEEG
> FT EEEPQDIVDEAFYFFKPHMFFRNFPIKGAGDRVILYLTMYLHECLKKIVQLKREEAH
> FT SVLLNYATMPFASPGEKDFPFNAFFPAGNEEEQEKWREYAKQLRLEANARLIEKVFL
> FT                   FPEKDGTGNKFWMAFAKRPFLASS"
>      atgcacagca ggtggaatgg gtatgaagaa agtagtcttt tgggctgcgg tgtttatccg  60
>      cttcgccgca cgtcacggct cactccaccc ggccctgcac cgcggatgga tgaaatgatt
> 120
>      gaggagggcg aagaggagcc acaagacatt gttgacgagg cattttactt ttttaagccc
> 180
>      cacatgtttt ttcgtaattt tcccattaag ggtgctggtg atcgtgtcat tctgtacttg
> 240
>      acgatgtacc ttcatgagtg tttgaagaaa attgtccagt tgaagcgtga agaggcccat
> 300
>      tctgtgcttc ttaactacgc tacgatgccg tttgcatcac caggggaaaa ggactttccg
> 360
>      tttaacgcgt ttttccctgc tgggaatgag gaggaacaag aaaaatggcg agagtatgca
> 420
>      aaacagcttc gattggaggc caacgcacgt ctcattgaga aggtttttct ttttccagag
> 480
>      aaggacggca ccggaaacaa gttctggatg gcgtttgcga agaggccttt cttggcttct
> 540
>      agttag   546
> //
> ID   DQ472185  standard; DNA; INV; 543 BP.
> XX
> AC   DQ472185;
> XX
> SV   DQ472185.1
> DT   15-MAY-2006
> XX
> DE   Trypanosoma cruzi strain CL Brener actin-related protein 4 (ARC20)
> gene,
> DE   complete cds.
> XX
> KW   .
> XX
> OS   Trypanosoma cruzi strain CL Brener
> OC   Eukaryota; Euglenozoa; Kinetoplastida; Trypanosomatidae; Trypanosoma;
> OC   Schizotrypanum.
> XX
> RN   [1]
> RP   1-543
> RA   De Melo L.D.B.;
> RT   "Actin of Trypanosoma cruzi: ubiquitous actin-binding proteins";
> RL   Unpublished.
> XX
> RN   [2]
> RP   1-543
> RA   De Melo L.D.B.;
> RT   ;
> RL   Submitted (03-APR-2006) to the EMBL/GenBank/DDBJ databases.
> RL   Instituto de Biofisica Carlos Chagas Filho, Universidade Federal do
> Rio
> RL   de Janeiro, Cidade Universitaria, CCS, Bl.G, Sl.G157, Rio de Janeiro,
> RJ
> RL   21949-900, Brazil
> XX
> FH   Key             Location/Qualifiers
> FH
> FT   source          1..543
> FT                   /organism="Trypanosoma cruzi strain CL Brener"
> FT                   /mol_type="genomic DNA"
> FT                   /strain="CL Brener"
> FT                   /db_xref="taxon:353153"
> FT   gene            <1..>543
> FT                   /gene="ARC20"
> FT                   /note="TcARC20"
> FT   mRNA            <1..>543
> FT                   /gene="ARC20"
> FT                   /product="actin-related protein 4"
> FT   CDS             1..543
> FT                   /gene="ARC20"
> FT                   /note="actin-binding protein; ARPC4 20 kDa; putative
> FT                   member of Arp2/3 complex"
> FT                   /codon_start=1
> FT                   /product="actin-related protein 4"
> FT                   /protein_id="ABF13402.1"
> FT                   /db_xref="GI:93360016"
> FT /translation="MATAYLPYYDCIKCTLHAALCIGNYPSCTVERHNKPEVEVADH
> FT LENNGEIKVQDFLLNPIRIVRSEQESCLIEPSINSTRISVSFLKSDAIAEIIARKYV
> FT GFLAQRAKQFHILRKKPIPGYDISFLISHEEVETMHRNRIIQFIITFLMDIDADIAA
> FT                   MKLNVNQRARRAAMEFFLALNFT"
>      atggcaaccg cctatttgcc ttactacgac tgcatcaagt gcacgttgca cgcggctttg  60
>      tgcatcggga attatccttc atgtaccgtg gagcgtcata ataaaccaga agttgaggtt
> 120
>      gcagaccatc tggagaataa tggtgaaata aaagtacaag atttccttct taaccccata
> 180
>      cgcattgtgc gttcagaaca ggaaagttgt cttattgaac ctagtataaa cagcacacgc
> 240
>      atatctgtat cgtttctcaa gagcgacgct attgcagaga ttattgcccg aaagtacgtt
> 300
>      ggatttttag ctcagcgagc caaacagttt cacatcttga gaaaaaagcc tattccggga
> 360
>      tatgatataa gttttttgat ttctcacgag gaagtagaaa caatgcatag gaataggatt
> 420
>      attcaattta taattacttt cttgatggat attgatgctg acattgctgc aatgaagttg
> 480
>      aatgtgaatc aacgtgcacg tcgagcagcg atggaattct ttcttgcatt gaatttcaca
> 540
>      tga   543
> //
> -----------------------------------------------------------------------
> I get an exception message "Could Not Read Sequence".  Same thing
> happens if I use the readINSDSetDNA reader instead of readEMBLDNA one
> with the following INSDset file (beginning of the file):
>
> <?xml version="1.0"?>
> <!DOCTYPE INSDSeq PUBLIC "-//NCBI//INSD INSDSeq/EN" "INSD_INSDSeq.dtd">
> <INSDSeq>
>   <INSDSeq_locus>DQ022078</INSDSeq_locus>
>   <INSDSeq_length>16729</INSDSeq_length>
>   <INSDSeq_moltype>DNA</INSDSeq_moltype>
>   <INSDSeq_topology>linear</INSDSeq_topology>
>   <INSDSeq_division>ENV</INSDSeq_division>
>   <INSDSeq_update-date>15-MAY-2006</INSDSeq_update-date>
>   <INSDSeq_create-date>15-MAY-2006</INSDSeq_create-date>
>   <INSDSeq_definition>Uncultured bacterium WWRS-2005 putative
> aminoglycoside phosphotransferase (a3.001), putative oxidoreductase
> (a3.002), putative oxidoreductase (a3.003), putative beta-lactamase
> class C (estA3), putative permease (a3.005), putative transmembrane
> signal peptide (a3.006), thiol-disulfide isomerase (a3.007), histone
> acetyltransferase HPA2 (a3.008), putative enzyme (a3.009), putative
> asparaginase (a3.010), hypothetical protein (a3.011), hypothetical
> protein (a3.012), putative membrane protease subunit (a3.013),
> putative haloalkane dehalogenase (a3.014), putative transcriptional
> regulator (a3.015), putative peptidyl-dipeptidase Dcp (a3.016), and
> hypothetical protein (a3.017) genes, complete cds</INSDSeq_definition>
>   <INSDSeq_primary-accession>DQ022078</INSDSeq_primary-accession>
>   <INSDSeq_other-seqids>
>     <INSDSeqid>gb|DQ022078.1|</INSDSeqid>
>     <INSDSeqid>gi|71842722</INSDSeqid>
>   </INSDSeq_other-seqids>
>   <INSDSeq_keywords>
>     <INSDKeyword>ENV</INSDKeyword>
>   </INSDSeq_keywords>
>   <INSDSeq_references>
>     <INSDReference>
>       <INSDReference_reference>?</INSDReference_reference>
>       <INSDReference_position>1..16729</INSDReference_position>
>       <INSDReference_authors>
>         <INSDAuthor>Schmeisser,C.</INSDAuthor>
>         <INSDAuthor>Elend,C.</INSDAuthor>
>         <INSDAuthor>Streit,W.R.</INSDAuthor>
>       </INSDReference_authors>
>       <INSDReference_title>Isolation and biochemical characterization
> of two novel metagenome derived esterases</INSDReference_title>
>       <INSDReference_journal>Appl. Environ. Microbiol. 0:0-0
> (2006)</INSDReference_journal>
>     </INSDReference>
>     <INSDReference>
>       <INSDReference_reference>?</INSDReference_reference>
>       <INSDReference_position>1..16729</INSDReference_position>
>       <INSDReference_authors>
>         <INSDAuthor>Schmeisser,C.</INSDAuthor>
>         <INSDAuthor>Elend,C.</INSDAuthor>
>         <INSDAuthor>Streit,W.R.</INSDAuthor>
>       </INSDReference_authors>
>       <INSDReference_journal>Submitted (29-APR-2005) to the
> EMBL/GenBank/DDBJ databases. Molekulare Enzymtechnologie, University
> Duisburg-Essen, Lotharstrasse 1, Duisburg D-47057,
> Germany</INSDReference_journal>
>     </INSDReference>
>   </INSDSeq_references>
>
> So my question is wether the ASN2GB produces output that's
> incompatible with BioJava parsers or is there a problem with the
> sequence themselves or the problems with the majority of parsers???
> Could it be that I'm using the API wrongly for the above formats,
> although GenBank parser works as advertised with some exceptions
> below:
>
> ISSUE #2:
> When I try to parse GenBank files using the following code:
>
> BufferedReader inBuf = new BufferedReader(new
> FileReader("genbank_output.gb"));
> Namespace gbNspace = (Namespace)
> RichObjectFactory.getObject(SimpleNamespace.class, new
> Object[]{"gbSpace"} );
> RichSequenceIterator gbSeqs =
> RichSequence.IOTools.readGenbankDNA(inBuf,gbNspace);
> while (gbSeqs.hasNext()) {
>   try {
>            RichSequence rs = gbSeqs.nextRichSequence();
>            // Further processing or RichSequence object from here
>
>        } catch (BioException be){
>            be.printStackTrace();
>        }
> }
>
> Genbank file in question:
>
> LOCUS       BC074905                 838 bp    mRNA    linear   PRI
> 15-APR-2006
> DEFINITION  Homo sapiens kallikrein 14, mRNA (cDNA clone MGC:104038
>             IMAGE:30915482), complete cds.
> ACCESSION   BC074905
> VERSION     BC074905.2  GI:50959825
> KEYWORDS    MGC.
> SOURCE      Homo sapiens (human)
>   ORGANISM  Homo sapiens
>             Eukaryota; Metazoa; Chordata; Craniata; Vertebrata;
> Euteleostomi;
>             Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;
>             Catarrhini; Hominidae; Homo.
> REFERENCE   1  (bases 1 to 838)
>   AUTHORS   Strausberg,R.L., Feingold,E.A., Grouse,L.H., Derge,J.G.,
>             Klausner,R.D., Collins,F.S., Wagner,L., Shenmen,C.M.,
> Schuler,G.D.,
>             Altschul,S.F., Zeeberg,B., Buetow,K.H., Schaefer,C.F.,
> Bhat,N.K.,
>             Hopkins,R.F., Jordan,H., Moore,T., Max,S.I., Wang,J.,
> Hsieh,F.,
>             Diatchenko,L., Marusina,K., Farmer,A.A., Rubin,G.M., Hong,L.,
>             Stapleton,M., Soares,M.B., Bonaldo,M.F., Casavant,T.L.,
>             Scheetz,T.E., Brownstein,M.J., Usdin,T.B., Toshiyuki,S.,
>             Carninci,P., Prange,C., Raha,S.S., Loquellano,N.A.,
> Peters,G.J.,
>             Abramson,R.D., Mullahy,S.J., Bosak,S.A., McEwan,P.J.,
>             McKernan,K.J., Malek,J.A., Gunaratne,P.H., Richards,S.,
>             Worley,K.C., Hale,S., Garcia,A.M., Gay,L.J., Hulyk,S.W.,
>             Villalon,D.K., Muzny,D.M., Sodergren,E.J., Lu,X., Gibbs,R.A.,
>             Fahey,J., Helton,E., Ketteman,M., Madan,A., Rodrigues,S.,
>             Sanchez,A., Whiting,M., Madan,A., Young,A.C., Shevchenko,Y.,
>             Bouffard,G.G., Blakesley,R.W., Touchman,J.W., Green,E.D.,
>             Dickson,M.C., Rodriguez,A.C., Grimwood,J., Schmutz,J.,
> Myers,R.M.,
>             Butterfield,Y.S., Krzywinski,M.I., Skalska,U., Smailus,D.E.,
>             Schnerch,A., Schein,J.E., Jones,S.J. and Marra,M.A.
>   CONSRTM   Mammalian Gene Collection Program Team
>   TITLE     Generation and initial analysis of more than 15,000
> full-length
>             human and mouse cDNA sequences
>   JOURNAL   Proc. Natl. Acad. Sci. U.S.A. 99 (26), 16899-16903 (2002)
>    PUBMED   12477932
> REFERENCE   2  (bases 1 to 838)
>   CONSRTM   NIH MGC Project
>   TITLE     Direct Submission
>   JOURNAL   Submitted (25-JUN-2004) National Institutes of Health,
> Mammalian
>             Gene Collection (MGC), Bethesda, MD 20892-2590, USA
>   REMARK    NIH-MGC Project URL: http://mgc.nci.nih.gov
> COMMENT     On Aug 4, 2004 this sequence version replaced gi:49901832.
>             Contact: MGC help desk
>             Email: cgapbs-r at mail.nih.gov
>             Tissue Procurement: Genome Sequence Centre, British Columbia
> Cancer
>             Center
>             cDNA Library Preparation: British Columbia Cancer Research
> Center
>             cDNA Library Arrayed by: The I.M.A.G.E. Consortium (LLNL)
>             DNA Sequencing by: Genome Sequence Centre,
>             BC Cancer Agency, Vancouver, BC, Canada
>             info at bcgsc.bc.ca
>             Martin Hirst, Thomas Zeng, Ryan Morin, Michelle Moksa, Johnson
>             Pang, Diana Mah, Jing Wang, Kieth Fichter, Eric Chuah, Allen
>             Delaney, Rob Kirkpatrick, Agnes Baross, Sarah Barber, Mabel
>             Brown-John, Steve S. Chand, William Chow, Ryan Babakaiff, Dave
>             Wong, Corey Matsuo, Jaclyn Beland, Susan Gibson, Luis delRio,
> Ruth
>             Featherstone, Malachi Griffith, Obi Griffith, Ran Guin, Nancy
> Liao,
>             Kim MacDonald,  Mike R. Mayo, Josh Moran, Diana Palmquist, JR
>             Santos, Duane Smailus, Jeff Stott, Miranda Tsai, George Yang,
>             Jacquie Schein, Asim Siddiqui,Steven Jones, Rob Holt, Marco
> Marra.
>
>             Clone distribution: MGC clone distribution information can be
> found
>             through the I.M.A.G.E. Consortium/LLNL at: http://image.llnl.gov
>             Series: IRBU Plate: 4 Row: C Column: 3.
>
>             Differences found between this sequence and the human
> reference
>             genome (build 36) are described in misc_difference features
> below.
> FEATURES             Location/Qualifiers
>      source          1..838
>                      /organism="Homo sapiens"
>                      /mol_type="mRNA"
>                      /db_xref="taxon:9606"
>                      /clone="MGC:104038 IMAGE:30915482"
>                      /tissue_type="Lung, PCR rescued clones"
>                      /clone_lib="NIH_MGC_273"
>                      /lab_host="DH10B"
>                      /note="Vector: pCR4 Topo TA with reversed insert"
>      gene            1..838
>                      /gene="KLK14"
>                      /note="synonym: KLK-L6"
>                      /db_xref="GeneID:43847"
>                      /db_xref="HGNC:6362"
>                      /db_xref="IMGT/GENE-DB:6362"
>                      /db_xref="MIM:606135"
>      CDS             49..804
>                      /gene="KLK14"
>                      /codon_start=1
>                      /product="KLK14 protein"
>                      /protein_id="AAH74905.1"
>                      /db_xref="GI:50959826"
>                      /db_xref="GeneID:43847"
>                      /db_xref="HGNC:6362"
>                      /db_xref="IMGT/GENE-DB:6362"
>                      /db_xref="MIM:606135"
>  /translation="MFLLLTALQVLAIAMTRSQEDENKIIGGYTCTRSSQPWQAALLA
>  GPRRRFLCGGALLSGQWVITAAHCGRPILQVALGKHNLRRWEATQQVLRVVRQVTHPN
>  YNSRTHDNDLMLLQLQQPARIGRAVRPIEVTQACASPGTSCRVSGWGTISSPIARYPA
>  SLQCVNINISPDEVCQKAYPRTITPGMVCAGVPQGGKDSCQGDSGGPLVCRGQLQGLV
>                      SWGMERCALPGYPGVYTNLCKYRSWIEETMRDK"
>      misc_difference 98
>                      /gene="KLK14"
>                      /note="'G' in cDNA is 'A' in the human genome; amino
> acid
>                      difference: 'R' in cDNA, 'Q' in the human genome."
>      misc_difference 133
>                      /gene="KLK14"
>                      /note="'T' in cDNA is 'C' in the human genome; amino
> acid
>                      difference: 'Y' in cDNA, 'H' in the human genome."
> ORIGIN
>         1 atgtccctga gggtcttggg ctctgggacc tggccctcag cccctaaaat
> gttcctcctg
>        61 ctgacagcac ttcaagtcct ggctatagcc atgacacgga gccaagagga
> tgagaacaag
>       121 ataattggtg gctatacgtg cacccggagc tcccagccgt ggcaggcggc
> cctgctggcg
>       181 ggtcccaggc gccgcttcct ctgcggaggc gccctgcttt caggccagtg
> ggtcatcact
>       241 gctgctcact gcggccgccc gatccttcag gttgccctgg gcaagcacaa
> cctgaggagg
>       301 tgggaggcca cccagcaggt gctgcgcgtg gttcgtcagg tgacgcaccc
> caactacaac
>       361 tcccggaccc acgacaacga cctcatgctg ctgcagctac agcagcccgc
> acggatcggg
>       421 agggcagtca ggcccattga ggtcacccag gcctgtgcca gccccgggac
> ctcctgccga
>       481 gtgtcaggct ggggaactat atccagcccc atcgccaggt accccgcctc
> tctgcaatgc
>       541 gtgaacatca acatctcccc ggatgaggtg tgccagaagg cctatcctag
> aaccatcacg
>       601 cctggcatgg tctgtgcagg agttccccag ggcgggaagg actcttgtca
> gggtgactct
>       661 gggggacccc tggtgtgcag aggacagctc cagggcctcg tgtcttgggg
> aatggagcgc
>       721 tgcgccctgc ctggctaccc cggtgtctac accaacctgt gcaagtacag
> aagctggatt
>       781 gaggaaacga tgcgggacaa atgatggtct tcacggtggg atggacctcg tcagctgc
> //
>
> I get the following exception:
>
> java.lang.IllegalArgumentException: Authors string cannot be null
> org.biojava.bio.BioException: Could not read sequence
>         at
> org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112)
>         at
> exonhit.parsers.GenBankParser.getSequences(GenBankParser.java:107)
>         at
> exonhit.parsers.GenBankParser.runGBparser(GenBankParser.java:258)
>         at exonhit.parsers.GenBankParser.main(GenBankParser.java:341)
> Caused by: java.lang.IllegalArgumentException: Authors string cannot be
> null
>         at
> org.biojavax.DocRefAuthor$Tools.parseAuthorString(DocRefAuthor.java:76)
>         at
> org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.java:356)
>         at
> org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109)
>
> -----------------------------------------------------------------------
>
> I'm trying to see what could be the problem with this particular
> sequence.  Looks to me like the AUTHORS portion is not getting parsed
> correctly.  Any ideas would be greatly appreciated!
>
> --
> Best Regards,
>
>
> Seth Johnson
> Senior Bioinformatics Associate
>
> Ph: (202) 470-0900
> Fx: (775) 251-0358
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>
>
>
>


-- 
Best Regards,


Seth Johnson
Senior Bioinformatics Associate

Ph: (202) 470-0900
Fx: (775) 251-0358



More information about the Biojava-l mailing list