[Biojava-l] Parsing Genbank/EMBL/XML Sequences from binary NCBI ASN.1 daily update files

Richard Holland richard.holland at ebi.ac.uk
Mon Jun 5 15:16:37 UTC 2006


Doh!

I am in desparate need of coffee methinks... that's the second error in
EMBLFormat directly related to me being stupid when I cut-and-pasted the
stuff for the new 87+ ID line format...

Should be fixed now in CVS (as of about 30 seconds ago).

cheers,
Richard

On Mon, 2006-06-05 at 11:05 -0400, Seth Johnson wrote:
> Hi Richard,
> 
> I go another exception on EMBL format:
> =============================
> org.biojava.bio.BioException: Could not read sequence
>         at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112)
>         at exonhit.parsers.GenBankParser.main(GenBankParser.java:347)
> Caused by: java.lang.IllegalStateException: No match found
>         at java.util.regex.Matcher.group(Matcher.java:461)
>         at org.biojavax.bio.seq.io.EMBLFormat.readRichSequence(EMBLFormat.java:311)
>         at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109)
>         ... 1 more
> Java Result: -1
> =============================
> I used the same file from test directory:(AY069118.em)
> 
> 
> Seth
> 
> On 6/5/06, Richard Holland <richard.holland at ebi.ac.uk> wrote:
> > This one should be fixed in CVS now. Typo on my behalf - I put in code
> > to make it work with both 87+ and pre-87 version of EMBL, then got the
> > regexes the wrong way round!!
> >
> ...
> >
> > cheers,
> > Richard
> >
> >
> > On Fri, 2006-06-02 at 13:04 -0400, Seth Johnson wrote:
> > > Hi Richard,
> > >
> > > I made sure I have the latest source code from CVS compiled
> > > (EMBLFormat.java & GenbankFormat.java are from 05/24/06).  I'm happy
> > > to report that GenBank issue is solved!!!!
> > > As far as EMBL parsing, I apologize for not providing the stack dump
> > > for ISSUE #1.  Here's the dump of the exception:
> > > --------------------------------------------------------
> > > org.biojava.bio.BioException: Could not read sequence
> > >         at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112)
> > >         at exonhit.parsers.GenBankParser.main(GenBankParser.java:359)
> > > Caused by: java.lang.NumberFormatException: null
> > >         at java.lang.Integer.parseInt(Integer.java:415)
> > >         at java.lang.Integer.parseInt(Integer.java:497)
> > >         at org.biojavax.bio.seq.io.EMBLFormat.readRichSequence(EMBLFormat.java:299)
> > >         at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109)
> > >         ... 1 more
> > > Java Result: -1
> > > -------------------------------------------------------
> > > Here, again, is the code that I'm using to to parse:
> > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > >         BufferedReader gbBR = null;
> > >         try {
> > >             gbBR = new BufferedReader(new
> > > FileReader("C:\\Download\\ASN2BSML\\seth_06_02.emb"));
> > >         } catch (FileNotFoundException fnfe) {
> > >             fnfe.printStackTrace();
> > >             System.exit(-1);
> > >         }
> > >         Namespace gbNspace = (Namespace)
> > > RichObjectFactory.getObject(SimpleNamespace.class, new
> > > Object[]{"gbSpace"} );
> > >         RichSequenceIterator gbSeqs =
> > > RichSequence.IOTools.readEMBLDNA(gbBR,gbNspace);
> > >         while (gbSeqs.hasNext()) {
> > >             try {
> > >                 RichSequence rs = gbSeqs.nextRichSequence();
> > >                 NCBITaxon myTaxon = rs.getTaxon();
> > >             }catch (BioException be){
> > >                 be.printStackTrace();
> > >                 System.exit(-1);
> > >             }
> > >         }
> > > ~~~~~~~~~~~~~~~~~~~~~~~~~
> > > And here's the EMBL file that I'm trying to parse:
> > > +++++++++++++++++++++++++
> > > ID   DQ472184  standard; DNA; INV; 546 BP.
> > > XX
> > > AC   DQ472184;
> > > XX
> > > SV   DQ472184.1
> > > DT   15-MAY-2006
> > > XX
> > > DE   Trypanosoma cruzi strain CL Brener actin-related protein 3 (ARC21) gene,
> > > DE   complete cds.
> > > XX
> > > KW   .
> > > XX
> > > OS   Trypanosoma cruzi strain CL Brener
> > > OC   Eukaryota; Euglenozoa; Kinetoplastida; Trypanosomatidae; Trypanosoma;
> > > OC   Schizotrypanum.
> > > XX
> > > RN   [1]
> > > RP   1-546
> > > RA   De Melo L.D.B.;
> > > RT   "Actin of Trypanosoma cruzi: ubiquitous actin-binding proteins";
> > > RL   Unpublished.
> > > XX
> > > RN   [2]
> > > RP   1-546
> > > RA   De Melo L.D.B.;
> > > RT   ;
> > > RL   Submitted (03-APR-2006) to the EMBL/GenBank/DDBJ databases.
> > > RL   Instituto de Biofisica Carlos Chagas Filho, Universidade Federal do Rio
> > > RL   de Janeiro, Cidade Universitaria, CCS, Bl.G, Sl.G157, Rio de Janeiro, RJ
> > > RL   21949-900, Brazil
> > > XX
> > > FH   Key             Location/Qualifiers
> > > FH
> > > FT   source          1..546
> > > FT                   /organism="Trypanosoma cruzi strain CL Brener"
> > > FT                   /mol_type="genomic DNA"
> > > FT                   /strain="CL Brener"
> > > FT                   /db_xref="taxon:353153"
> > > FT   gene            <1..>546
> > > FT                   /gene="ARC21"
> > > FT                   /note="TcARC21"
> > > FT   mRNA            <1..>546
> > > FT                   /gene="ARC21"
> > > FT                   /product="actin-related protein 3"
> > > FT   CDS             1..546
> > > FT                   /gene="ARC21"
> > > FT                   /note="actin-binding protein; ARPC3 21 kDa; putative
> > > FT                   member of Arp2/3 complex"
> > > FT                   /codon_start=1
> > > FT                   /product="actin-related protein 3"
> > > FT                   /protein_id="ABF13401.1"
> > > FT                   /db_xref="GI:93360014"
> > > FT                   /translation="MHSRWNGYEESSLLGCGVYPLRRTSRLTPPGPAPRMDEMIEEG
> > > FT                   EEEPQDIVDEAFYFFKPHMFFRNFPIKGAGDRVILYLTMYLHECLKKIVQLKREEAH
> > > FT                   SVLLNYATMPFASPGEKDFPFNAFFPAGNEEEQEKWREYAKQLRLEANARLIEKVFL
> > > FT                   FPEKDGTGNKFWMAFAKRPFLASS"
> > >      atgcacagca ggtggaatgg gtatgaagaa agtagtcttt tgggctgcgg tgtttatccg        60
> > >      cttcgccgca cgtcacggct cactccaccc ggccctgcac cgcggatgga tgaaatgatt       120
> > >      gaggagggcg aagaggagcc acaagacatt gttgacgagg cattttactt ttttaagccc       180
> > >      cacatgtttt ttcgtaattt tcccattaag ggtgctggtg atcgtgtcat tctgtacttg       240
> > >      acgatgtacc ttcatgagtg tttgaagaaa attgtccagt tgaagcgtga agaggcccat       300
> > >      tctgtgcttc ttaactacgc tacgatgccg tttgcatcac caggggaaaa ggactttccg       360
> > >      tttaacgcgt ttttccctgc tgggaatgag gaggaacaag aaaaatggcg agagtatgca       420
> > >      aaacagcttc gattggaggc caacgcacgt ctcattgaga aggtttttct ttttccagag       480
> > >      aaggacggca ccggaaacaa gttctggatg gcgtttgcga agaggccttt cttggcttct       540
> > >      agttag                                                                  546
> > > //
> > > ID   DQ472185  standard; DNA; INV; 543 BP.
> > > XX
> > > AC   DQ472185;
> > > XX
> > > SV   DQ472185.1
> > > DT   15-MAY-2006
> > > XX
> > > DE   Trypanosoma cruzi strain CL Brener actin-related protein 4 (ARC20) gene,
> > > DE   complete cds.
> > > XX
> > > KW   .
> > > XX
> > > OS   Trypanosoma cruzi strain CL Brener
> > > OC   Eukaryota; Euglenozoa; Kinetoplastida; Trypanosomatidae; Trypanosoma;
> > > OC   Schizotrypanum.
> > > XX
> > > RN   [1]
> > > RP   1-543
> > > RA   De Melo L.D.B.;
> > > RT   "Actin of Trypanosoma cruzi: ubiquitous actin-binding proteins";
> > > RL   Unpublished.
> > > XX
> > > RN   [2]
> > > RP   1-543
> > > RA   De Melo L.D.B.;
> > > RT   ;
> > > RL   Submitted (03-APR-2006) to the EMBL/GenBank/DDBJ databases.
> > > RL   Instituto de Biofisica Carlos Chagas Filho, Universidade Federal do Rio
> > > RL   de Janeiro, Cidade Universitaria, CCS, Bl.G, Sl.G157, Rio de Janeiro, RJ
> > > RL   21949-900, Brazil
> > > XX
> > > FH   Key             Location/Qualifiers
> > > FH
> > > FT   source          1..543
> > > FT                   /organism="Trypanosoma cruzi strain CL Brener"
> > > FT                   /mol_type="genomic DNA"
> > > FT                   /strain="CL Brener"
> > > FT                   /db_xref="taxon:353153"
> > > FT   gene            <1..>543
> > > FT                   /gene="ARC20"
> > > FT                   /note="TcARC20"
> > > FT   mRNA            <1..>543
> > > FT                   /gene="ARC20"
> > > FT                   /product="actin-related protein 4"
> > > FT   CDS             1..543
> > > FT                   /gene="ARC20"
> > > FT                   /note="actin-binding protein; ARPC4 20 kDa; putative
> > > FT                   member of Arp2/3 complex"
> > > FT                   /codon_start=1
> > > FT                   /product="actin-related protein 4"
> > > FT                   /protein_id="ABF13402.1"
> > > FT                   /db_xref="GI:93360016"
> > > FT                   /translation="MATAYLPYYDCIKCTLHAALCIGNYPSCTVERHNKPEVEVADH
> > > FT                   LENNGEIKVQDFLLNPIRIVRSEQESCLIEPSINSTRISVSFLKSDAIAEIIARKYV
> > > FT                   GFLAQRAKQFHILRKKPIPGYDISFLISHEEVETMHRNRIIQFIITFLMDIDADIAA
> > > FT                   MKLNVNQRARRAAMEFFLALNFT"
> > >      atggcaaccg cctatttgcc ttactacgac tgcatcaagt gcacgttgca cgcggctttg        60
> > >      tgcatcggga attatccttc atgtaccgtg gagcgtcata ataaaccaga agttgaggtt       120
> > >      gcagaccatc tggagaataa tggtgaaata aaagtacaag atttccttct taaccccata       180
> > >      cgcattgtgc gttcagaaca ggaaagttgt cttattgaac ctagtataaa cagcacacgc       240
> > >      atatctgtat cgtttctcaa gagcgacgct attgcagaga ttattgcccg aaagtacgtt       300
> > >      ggatttttag ctcagcgagc caaacagttt cacatcttga gaaaaaagcc tattccggga       360
> > >      tatgatataa gttttttgat ttctcacgag gaagtagaaa caatgcatag gaataggatt       420
> > >      attcaattta taattacttt cttgatggat attgatgctg acattgctgc aatgaagttg       480
> > >      aatgtgaatc aacgtgcacg tcgagcagcg atggaattct ttcttgcatt gaatttcaca       540
> > >      tga                                                                     543
> > > //
> > > +++++++++++++++++++++++++++++++++
> > >
> > > It looks to me like there's some kind of problem with parsing the
> > > sequence version number. I even tried the sequence from test directory
> > > (AY069118.em) with same outcome.
> > >
> > > Regards,
> > >
> > > Seth
> > >
> > > On 6/2/06, Richard Holland <richard.holland at ebi.ac.uk> wrote:
> > > > Hi Seth.
> > > >
> > > > Your second point, about the authors string not being read correctly in
> > > > Genbank format, has been fixed (or should have been if I got the code
> > > > right!). Could you check the latest version of biojava-live out of CVS
> > > > and give it another go? Basically the parser did not recognise the
> > > > CONSRTM tag, as it is not mentioned in the sample record provided by
> > > > NCBI, which is what I based the parser on.
> > > >
> > > > I've set it up now so that it reads the CONSRTM tag, but the value is
> > > > merged with the authors tag with (consortium) appended. There will still
> > > > be problems if the consortium value has commas in it - not sure how to
> > > > fix this yet.
> > > >
> > > > Your first point is harder to solve because you did not provide a
> > > > complete stack trace for the exceptions you are getting. The complete
> > > > stack trace would enable me to identify exactly where things are going
> > > > wrong and give me a better chance of fixing them. Could you send the
> > > > stack trace, and I'll see what I can do.
> > > >
> > > > cheers,
> > > > Richard
> > > >
> > > >
> > > > On Thu, 2006-06-01 at 18:03 -0400, Seth Johnson wrote:
> > > > > Hi All,
> > > > >
> > > > > I'm a newbie to the whole BioJava(X) API and was hoping to get some
> > > > > clarification on several issues that I'm having.
> > > > > I am developing a parser that would take as input "NCBI Incremental
> > > > > ASN.1 Sequence Updates to Genbank" files (
> > > > > ftp://ftp.ncbi.nih.gov/ncbi-asn1/daily-nc ) , gunzip them, and use the
> > > > > ASN2GB converter (
> > > > > ftp://ftp.ncbi.nih.gov/asn1-converters/by_program/asn2gb ) to convert
> > > > > resulting sequences to a format parsable by BioJava(X) (
> > > > > http://www.penguin-soft.com/penguin/man/1/asn2gb.html ). This is where
> > > > > my problems start.
> > > > >
> > > > > ISSUE 1:
> > > > > I've tried to parse all of the formats that ASN2GB outputs ( GenBank
> > > > > (default) , EMBL, nucleotide GBSet (XML), nucleotide INSDSet (XML),
> > > > > tiny seq (XML) ) using either BioJava or BioJavaX API.  Only GenBank
> > > > > format is recognized by the
> > > > > "RichSequence.IOTools.readGenbankDNA(inBuf,gbNspace)" function with
> > > > > some exceptions that I'll describe in issue #2.  This is the code that
> > > > > I'm using to parse, for example, the EMBL output:
> > > > >
> > > > > BufferedReader inBuf = new BufferedReader(new FileReader("embl_output.emb"));
> > > > > Namespace gbNspace = (Namespace)
> > > > > RichObjectFactory.getObject(SimpleNamespace.class, new
> > > > > Object[]{"gbSpace"} );
> > > > > RichSequenceIterator gbSeqs = RichSequence.IOTools.readEMBLDNA(inBuf,gbNspace);
> > > > > while (gbSeqs.hasNext()) {
> > > > >   try {
> > > > >            RichSequence rs = gbSeqs.nextRichSequence();
> > > > >            // Further processing or RichSequence object from here
> > > > >
> > > > >        } catch (BioException be){
> > > > >            be.printStackTrace();
> > > > >        }
> > > > > }
> > > > >
> > > > > The multi-sequence EMBL file looks like this:
> > > > > ---------------------------------------------------------------------------------
> > > > > ID   DQ472184  standard; DNA; INV; 546 BP.
> > > > > XX
> > > > > AC   DQ472184;
> > > > > XX
> > > > > SV   DQ472184.1
> > > > > DT   15-MAY-2006
> > > > > XX
> > > > > DE   Trypanosoma cruzi strain CL Brener actin-related protein 3 (ARC21) gene,
> > > > > DE   complete cds.
> > > > > XX
> > > > > KW   .
> > > > > XX
> > > > > OS   Trypanosoma cruzi strain CL Brener
> > > > > OC   Eukaryota; Euglenozoa; Kinetoplastida; Trypanosomatidae; Trypanosoma;
> > > > > OC   Schizotrypanum.
> > > > > XX
> > > > > RN   [1]
> > > > > RP   1-546
> > > > > RA   De Melo L.D.B.;
> > > > > RT   "Actin of Trypanosoma cruzi: ubiquitous actin-binding proteins";
> > > > > RL   Unpublished.
> > > > > XX
> > > > > RN   [2]
> > > > > RP   1-546
> > > > > RA   De Melo L.D.B.;
> > > > > RT   ;
> > > > > RL   Submitted (03-APR-2006) to the EMBL/GenBank/DDBJ databases.
> > > > > RL   Instituto de Biofisica Carlos Chagas Filho, Universidade Federal do Rio
> > > > > RL   de Janeiro, Cidade Universitaria, CCS, Bl.G, Sl.G157, Rio de Janeiro, RJ
> > > > > RL   21949-900, Brazil
> > > > > XX
> > > > > FH   Key             Location/Qualifiers
> > > > > FH
> > > > > FT   source          1..546
> > > > > FT                   /organism="Trypanosoma cruzi strain CL Brener"
> > > > > FT                   /mol_type="genomic DNA"
> > > > > FT                   /strain="CL Brener"
> > > > > FT                   /db_xref="taxon:353153"
> > > > > FT   gene            <1..>546
> > > > > FT                   /gene="ARC21"
> > > > > FT                   /note="TcARC21"
> > > > > FT   mRNA            <1..>546
> > > > > FT                   /gene="ARC21"
> > > > > FT                   /product="actin-related protein 3"
> > > > > FT   CDS             1..546
> > > > > FT                   /gene="ARC21"
> > > > > FT                   /note="actin-binding protein; ARPC3 21 kDa; putative
> > > > > FT                   member of Arp2/3 complex"
> > > > > FT                   /codon_start=1
> > > > > FT                   /product="actin-related protein 3"
> > > > > FT                   /protein_id="ABF13401.1"
> > > > > FT                   /db_xref="GI:93360014"
> > > > > FT                   /translation="MHSRWNGYEESSLLGCGVYPLRRTSRLTPPGPAPRMDEMIEEG
> > > > > FT                   EEEPQDIVDEAFYFFKPHMFFRNFPIKGAGDRVILYLTMYLHECLKKIVQLKREEAH
> > > > > FT                   SVLLNYATMPFASPGEKDFPFNAFFPAGNEEEQEKWREYAKQLRLEANARLIEKVFL
> > > > > FT                   FPEKDGTGNKFWMAFAKRPFLASS"
> > > > >      atgcacagca ggtggaatgg gtatgaagaa agtagtcttt tgggctgcgg tgtttatccg        60
> > > > >      cttcgccgca cgtcacggct cactccaccc ggccctgcac cgcggatgga tgaaatgatt       120
> > > > >      gaggagggcg aagaggagcc acaagacatt gttgacgagg cattttactt ttttaagccc       180
> > > > >      cacatgtttt ttcgtaattt tcccattaag ggtgctggtg atcgtgtcat tctgtacttg       240
> > > > >      acgatgtacc ttcatgagtg tttgaagaaa attgtccagt tgaagcgtga agaggcccat       300
> > > > >      tctgtgcttc ttaactacgc tacgatgccg tttgcatcac caggggaaaa ggactttccg       360
> > > > >      tttaacgcgt ttttccctgc tgggaatgag gaggaacaag aaaaatggcg agagtatgca       420
> > > > >      aaacagcttc gattggaggc caacgcacgt ctcattgaga aggtttttct ttttccagag       480
> > > > >      aaggacggca ccggaaacaa gttctggatg gcgtttgcga agaggccttt cttggcttct       540
> > > > >      agttag                                                                  546
> > > > > //
> > > > > ID   DQ472185  standard; DNA; INV; 543 BP.
> > > > > XX
> > > > > AC   DQ472185;
> > > > > XX
> > > > > SV   DQ472185.1
> > > > > DT   15-MAY-2006
> > > > > XX
> > > > > DE   Trypanosoma cruzi strain CL Brener actin-related protein 4 (ARC20) gene,
> > > > > DE   complete cds.
> > > > > XX
> > > > > KW   .
> > > > > XX
> > > > > OS   Trypanosoma cruzi strain CL Brener
> > > > > OC   Eukaryota; Euglenozoa; Kinetoplastida; Trypanosomatidae; Trypanosoma;
> > > > > OC   Schizotrypanum.
> > > > > XX
> > > > > RN   [1]
> > > > > RP   1-543
> > > > > RA   De Melo L.D.B.;
> > > > > RT   "Actin of Trypanosoma cruzi: ubiquitous actin-binding proteins";
> > > > > RL   Unpublished.
> > > > > XX
> > > > > RN   [2]
> > > > > RP   1-543
> > > > > RA   De Melo L.D.B.;
> > > > > RT   ;
> > > > > RL   Submitted (03-APR-2006) to the EMBL/GenBank/DDBJ databases.
> > > > > RL   Instituto de Biofisica Carlos Chagas Filho, Universidade Federal do Rio
> > > > > RL   de Janeiro, Cidade Universitaria, CCS, Bl.G, Sl.G157, Rio de Janeiro, RJ
> > > > > RL   21949-900, Brazil
> > > > > XX
> > > > > FH   Key             Location/Qualifiers
> > > > > FH
> > > > > FT   source          1..543
> > > > > FT                   /organism="Trypanosoma cruzi strain CL Brener"
> > > > > FT                   /mol_type="genomic DNA"
> > > > > FT                   /strain="CL Brener"
> > > > > FT                   /db_xref="taxon:353153"
> > > > > FT   gene            <1..>543
> > > > > FT                   /gene="ARC20"
> > > > > FT                   /note="TcARC20"
> > > > > FT   mRNA            <1..>543
> > > > > FT                   /gene="ARC20"
> > > > > FT                   /product="actin-related protein 4"
> > > > > FT   CDS             1..543
> > > > > FT                   /gene="ARC20"
> > > > > FT                   /note="actin-binding protein; ARPC4 20 kDa; putative
> > > > > FT                   member of Arp2/3 complex"
> > > > > FT                   /codon_start=1
> > > > > FT                   /product="actin-related protein 4"
> > > > > FT                   /protein_id="ABF13402.1"
> > > > > FT                   /db_xref="GI:93360016"
> > > > > FT                   /translation="MATAYLPYYDCIKCTLHAALCIGNYPSCTVERHNKPEVEVADH
> > > > > FT                   LENNGEIKVQDFLLNPIRIVRSEQESCLIEPSINSTRISVSFLKSDAIAEIIARKYV
> > > > > FT                   GFLAQRAKQFHILRKKPIPGYDISFLISHEEVETMHRNRIIQFIITFLMDIDADIAA
> > > > > FT                   MKLNVNQRARRAAMEFFLALNFT"
> > > > >      atggcaaccg cctatttgcc ttactacgac tgcatcaagt gcacgttgca cgcggctttg        60
> > > > >      tgcatcggga attatccttc atgtaccgtg gagcgtcata ataaaccaga agttgaggtt       120
> > > > >      gcagaccatc tggagaataa tggtgaaata aaagtacaag atttccttct taaccccata       180
> > > > >      cgcattgtgc gttcagaaca ggaaagttgt cttattgaac ctagtataaa cagcacacgc       240
> > > > >      atatctgtat cgtttctcaa gagcgacgct attgcagaga ttattgcccg aaagtacgtt       300
> > > > >      ggatttttag ctcagcgagc caaacagttt cacatcttga gaaaaaagcc tattccggga       360
> > > > >      tatgatataa gttttttgat ttctcacgag gaagtagaaa caatgcatag gaataggatt       420
> > > > >      attcaattta taattacttt cttgatggat attgatgctg acattgctgc aatgaagttg       480
> > > > >      aatgtgaatc aacgtgcacg tcgagcagcg atggaattct ttcttgcatt gaatttcaca       540
> > > > >      tga                                                                     543
> > > > > //
> > > > > -----------------------------------------------------------------------
> > > > > I get an exception message "Could Not Read Sequence".  Same thing
> > > > > happens if I use the readINSDSetDNA reader instead of readEMBLDNA one
> > > > > with the following INSDset file (beginning of the file):
> > > > >
> > > > > <?xml version="1.0"?>
> > > > > <!DOCTYPE INSDSeq PUBLIC "-//NCBI//INSD INSDSeq/EN" "INSD_INSDSeq.dtd">
> > > > > <INSDSeq>
> > > > >   <INSDSeq_locus>DQ022078</INSDSeq_locus>
> > > > >   <INSDSeq_length>16729</INSDSeq_length>
> > > > >   <INSDSeq_moltype>DNA</INSDSeq_moltype>
> > > > >   <INSDSeq_topology>linear</INSDSeq_topology>
> > > > >   <INSDSeq_division>ENV</INSDSeq_division>
> > > > >   <INSDSeq_update-date>15-MAY-2006</INSDSeq_update-date>
> > > > >   <INSDSeq_create-date>15-MAY-2006</INSDSeq_create-date>
> > > > >   <INSDSeq_definition>Uncultured bacterium WWRS-2005 putative
> > > > > aminoglycoside phosphotransferase (a3.001), putative oxidoreductase
> > > > > (a3.002), putative oxidoreductase (a3.003), putative beta-lactamase
> > > > > class C (estA3), putative permease (a3.005), putative transmembrane
> > > > > signal peptide (a3.006), thiol-disulfide isomerase (a3.007), histone
> > > > > acetyltransferase HPA2 (a3.008), putative enzyme (a3.009), putative
> > > > > asparaginase (a3.010), hypothetical protein (a3.011), hypothetical
> > > > > protein (a3.012), putative membrane protease subunit (a3.013),
> > > > > putative haloalkane dehalogenase (a3.014), putative transcriptional
> > > > > regulator (a3.015), putative peptidyl-dipeptidase Dcp (a3.016), and
> > > > > hypothetical protein (a3.017) genes, complete cds</INSDSeq_definition>
> > > > >   <INSDSeq_primary-accession>DQ022078</INSDSeq_primary-accession>
> > > > >   <INSDSeq_other-seqids>
> > > > >     <INSDSeqid>gb|DQ022078.1|</INSDSeqid>
> > > > >     <INSDSeqid>gi|71842722</INSDSeqid>
> > > > >   </INSDSeq_other-seqids>
> > > > >   <INSDSeq_keywords>
> > > > >     <INSDKeyword>ENV</INSDKeyword>
> > > > >   </INSDSeq_keywords>
> > > > >   <INSDSeq_references>
> > > > >     <INSDReference>
> > > > >       <INSDReference_reference>?</INSDReference_reference>
> > > > >       <INSDReference_position>1..16729</INSDReference_position>
> > > > >       <INSDReference_authors>
> > > > >         <INSDAuthor>Schmeisser,C.</INSDAuthor>
> > > > >         <INSDAuthor>Elend,C.</INSDAuthor>
> > > > >         <INSDAuthor>Streit,W.R.</INSDAuthor>
> > > > >       </INSDReference_authors>
> > > > >       <INSDReference_title>Isolation and biochemical characterization
> > > > > of two novel metagenome derived esterases</INSDReference_title>
> > > > >       <INSDReference_journal>Appl. Environ. Microbiol. 0:0-0
> > > > > (2006)</INSDReference_journal>
> > > > >     </INSDReference>
> > > > >     <INSDReference>
> > > > >       <INSDReference_reference>?</INSDReference_reference>
> > > > >       <INSDReference_position>1..16729</INSDReference_position>
> > > > >       <INSDReference_authors>
> > > > >         <INSDAuthor>Schmeisser,C.</INSDAuthor>
> > > > >         <INSDAuthor>Elend,C.</INSDAuthor>
> > > > >         <INSDAuthor>Streit,W.R.</INSDAuthor>
> > > > >       </INSDReference_authors>
> > > > >       <INSDReference_journal>Submitted (29-APR-2005) to the
> > > > > EMBL/GenBank/DDBJ databases. Molekulare Enzymtechnologie, University
> > > > > Duisburg-Essen, Lotharstrasse 1, Duisburg D-47057,
> > > > > Germany</INSDReference_journal>
> > > > >     </INSDReference>
> > > > >   </INSDSeq_references>
> > > > >
> > > > > So my question is wether the ASN2GB produces output that's
> > > > > incompatible with BioJava parsers or is there a problem with the
> > > > > sequence themselves or the problems with the majority of parsers???
> > > > > Could it be that I'm using the API wrongly for the above formats,
> > > > > although GenBank parser works as advertised with some exceptions
> > > > > below:
> > > > >
> > > > > ISSUE #2:
> > > > > When I try to parse GenBank files using the following code:
> > > > >
> > > > > BufferedReader inBuf = new BufferedReader(new FileReader("genbank_output.gb"));
> > > > > Namespace gbNspace = (Namespace)
> > > > > RichObjectFactory.getObject(SimpleNamespace.class, new
> > > > > Object[]{"gbSpace"} );
> > > > > RichSequenceIterator gbSeqs =
> > > > > RichSequence.IOTools.readGenbankDNA(inBuf,gbNspace);
> > > > > while (gbSeqs.hasNext()) {
> > > > >   try {
> > > > >            RichSequence rs = gbSeqs.nextRichSequence();
> > > > >            // Further processing or RichSequence object from here
> > > > >
> > > > >        } catch (BioException be){
> > > > >            be.printStackTrace();
> > > > >        }
> > > > > }
> > > > >
> > > > > Genbank file in question:
> > > > >
> > > > > LOCUS       BC074905                 838 bp    mRNA    linear   PRI 15-APR-2006
> > > > > DEFINITION  Homo sapiens kallikrein 14, mRNA (cDNA clone MGC:104038
> > > > >             IMAGE:30915482), complete cds.
> > > > > ACCESSION   BC074905
> > > > > VERSION     BC074905.2  GI:50959825
> > > > > KEYWORDS    MGC.
> > > > > SOURCE      Homo sapiens (human)
> > > > >   ORGANISM  Homo sapiens
> > > > >             Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
> > > > >             Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;
> > > > >             Catarrhini; Hominidae; Homo.
> > > > > REFERENCE   1  (bases 1 to 838)
> > > > >   AUTHORS   Strausberg,R.L., Feingold,E.A., Grouse,L.H., Derge,J.G.,
> > > > >             Klausner,R.D., Collins,F.S., Wagner,L., Shenmen,C.M., Schuler,G.D.,
> > > > >             Altschul,S.F., Zeeberg,B., Buetow,K.H., Schaefer,C.F., Bhat,N.K.,
> > > > >             Hopkins,R.F., Jordan,H., Moore,T., Max,S.I., Wang,J., Hsieh,F.,
> > > > >             Diatchenko,L., Marusina,K., Farmer,A.A., Rubin,G.M., Hong,L.,
> > > > >             Stapleton,M., Soares,M.B., Bonaldo,M.F., Casavant,T.L.,
> > > > >             Scheetz,T.E., Brownstein,M.J., Usdin,T.B., Toshiyuki,S.,
> > > > >             Carninci,P., Prange,C., Raha,S.S., Loquellano,N.A., Peters,G.J.,
> > > > >             Abramson,R.D., Mullahy,S.J., Bosak,S.A., McEwan,P.J.,
> > > > >             McKernan,K.J., Malek,J.A., Gunaratne,P.H., Richards,S.,
> > > > >             Worley,K.C., Hale,S., Garcia,A.M., Gay,L.J., Hulyk,S.W.,
> > > > >             Villalon,D.K., Muzny,D.M., Sodergren,E.J., Lu,X., Gibbs,R.A.,
> > > > >             Fahey,J., Helton,E., Ketteman,M., Madan,A., Rodrigues,S.,
> > > > >             Sanchez,A., Whiting,M., Madan,A., Young,A.C., Shevchenko,Y.,
> > > > >             Bouffard,G.G., Blakesley,R.W., Touchman,J.W., Green,E.D.,
> > > > >             Dickson,M.C., Rodriguez,A.C., Grimwood,J., Schmutz,J., Myers,R.M.,
> > > > >             Butterfield,Y.S., Krzywinski,M.I., Skalska,U., Smailus,D.E.,
> > > > >             Schnerch,A., Schein,J.E., Jones,S.J. and Marra,M.A.
> > > > >   CONSRTM   Mammalian Gene Collection Program Team
> > > > >   TITLE     Generation and initial analysis of more than 15,000 full-length
> > > > >             human and mouse cDNA sequences
> > > > >   JOURNAL   Proc. Natl. Acad. Sci. U.S.A. 99 (26), 16899-16903 (2002)
> > > > >    PUBMED   12477932
> > > > > REFERENCE   2  (bases 1 to 838)
> > > > >   CONSRTM   NIH MGC Project
> > > > >   TITLE     Direct Submission
> > > > >   JOURNAL   Submitted (25-JUN-2004) National Institutes of Health, Mammalian
> > > > >             Gene Collection (MGC), Bethesda, MD 20892-2590, USA
> > > > >   REMARK    NIH-MGC Project URL: http://mgc.nci.nih.gov
> > > > > COMMENT     On Aug 4, 2004 this sequence version replaced gi:49901832.
> > > > >             Contact: MGC help desk
> > > > >             Email: cgapbs-r at mail.nih.gov
> > > > >             Tissue Procurement: Genome Sequence Centre, British Columbia Cancer
> > > > >             Center
> > > > >             cDNA Library Preparation: British Columbia Cancer Research Center
> > > > >             cDNA Library Arrayed by: The I.M.A.G.E. Consortium (LLNL)
> > > > >             DNA Sequencing by: Genome Sequence Centre,
> > > > >             BC Cancer Agency, Vancouver, BC, Canada
> > > > >             info at bcgsc.bc.ca
> > > > >             Martin Hirst, Thomas Zeng, Ryan Morin, Michelle Moksa, Johnson
> > > > >             Pang, Diana Mah, Jing Wang, Kieth Fichter, Eric Chuah, Allen
> > > > >             Delaney, Rob Kirkpatrick, Agnes Baross, Sarah Barber, Mabel
> > > > >             Brown-John, Steve S. Chand, William Chow, Ryan Babakaiff, Dave
> > > > >             Wong, Corey Matsuo, Jaclyn Beland, Susan Gibson, Luis delRio, Ruth
> > > > >             Featherstone, Malachi Griffith, Obi Griffith, Ran Guin, Nancy Liao,
> > > > >             Kim MacDonald,  Mike R. Mayo, Josh Moran, Diana Palmquist, JR
> > > > >             Santos, Duane Smailus, Jeff Stott, Miranda Tsai, George Yang,
> > > > >             Jacquie Schein, Asim Siddiqui,Steven Jones, Rob Holt, Marco Marra.
> > > > >
> > > > >             Clone distribution: MGC clone distribution information can be found
> > > > >             through the I.M.A.G.E. Consortium/LLNL at: http://image.llnl.gov
> > > > >             Series: IRBU Plate: 4 Row: C Column: 3.
> > > > >
> > > > >             Differences found between this sequence and the human reference
> > > > >             genome (build 36) are described in misc_difference features below.
> > > > > FEATURES             Location/Qualifiers
> > > > >      source          1..838
> > > > >                      /organism="Homo sapiens"
> > > > >                      /mol_type="mRNA"
> > > > >                      /db_xref="taxon:9606"
> > > > >                      /clone="MGC:104038 IMAGE:30915482"
> > > > >                      /tissue_type="Lung, PCR rescued clones"
> > > > >                      /clone_lib="NIH_MGC_273"
> > > > >                      /lab_host="DH10B"
> > > > >                      /note="Vector: pCR4 Topo TA with reversed insert"
> > > > >      gene            1..838
> > > > >                      /gene="KLK14"
> > > > >                      /note="synonym: KLK-L6"
> > > > >                      /db_xref="GeneID:43847"
> > > > >                      /db_xref="HGNC:6362"
> > > > >                      /db_xref="IMGT/GENE-DB:6362"
> > > > >                      /db_xref="MIM:606135"
> > > > >      CDS             49..804
> > > > >                      /gene="KLK14"
> > > > >                      /codon_start=1
> > > > >                      /product="KLK14 protein"
> > > > >                      /protein_id="AAH74905.1"
> > > > >                      /db_xref="GI:50959826"
> > > > >                      /db_xref="GeneID:43847"
> > > > >                      /db_xref="HGNC:6362"
> > > > >                      /db_xref="IMGT/GENE-DB:6362"
> > > > >                      /db_xref="MIM:606135"
> > > > >                      /translation="MFLLLTALQVLAIAMTRSQEDENKIIGGYTCTRSSQPWQAALLA
> > > > >                      GPRRRFLCGGALLSGQWVITAAHCGRPILQVALGKHNLRRWEATQQVLRVVRQVTHPN
> > > > >                      YNSRTHDNDLMLLQLQQPARIGRAVRPIEVTQACASPGTSCRVSGWGTISSPIARYPA
> > > > >                      SLQCVNINISPDEVCQKAYPRTITPGMVCAGVPQGGKDSCQGDSGGPLVCRGQLQGLV
> > > > >                      SWGMERCALPGYPGVYTNLCKYRSWIEETMRDK"
> > > > >      misc_difference 98
> > > > >                      /gene="KLK14"
> > > > >                      /note="'G' in cDNA is 'A' in the human genome; amino acid
> > > > >                      difference: 'R' in cDNA, 'Q' in the human genome."
> > > > >      misc_difference 133
> > > > >                      /gene="KLK14"
> > > > >                      /note="'T' in cDNA is 'C' in the human genome; amino acid
> > > > >                      difference: 'Y' in cDNA, 'H' in the human genome."
> > > > > ORIGIN
> > > > >         1 atgtccctga gggtcttggg ctctgggacc tggccctcag cccctaaaat gttcctcctg
> > > > >        61 ctgacagcac ttcaagtcct ggctatagcc atgacacgga gccaagagga tgagaacaag
> > > > >       121 ataattggtg gctatacgtg cacccggagc tcccagccgt ggcaggcggc cctgctggcg
> > > > >       181 ggtcccaggc gccgcttcct ctgcggaggc gccctgcttt caggccagtg ggtcatcact
> > > > >       241 gctgctcact gcggccgccc gatccttcag gttgccctgg gcaagcacaa cctgaggagg
> > > > >       301 tgggaggcca cccagcaggt gctgcgcgtg gttcgtcagg tgacgcaccc caactacaac
> > > > >       361 tcccggaccc acgacaacga cctcatgctg ctgcagctac agcagcccgc acggatcggg
> > > > >       421 agggcagtca ggcccattga ggtcacccag gcctgtgcca gccccgggac ctcctgccga
> > > > >       481 gtgtcaggct ggggaactat atccagcccc atcgccaggt accccgcctc tctgcaatgc
> > > > >       541 gtgaacatca acatctcccc ggatgaggtg tgccagaagg cctatcctag aaccatcacg
> > > > >       601 cctggcatgg tctgtgcagg agttccccag ggcgggaagg actcttgtca gggtgactct
> > > > >       661 gggggacccc tggtgtgcag aggacagctc cagggcctcg tgtcttgggg aatggagcgc
> > > > >       721 tgcgccctgc ctggctaccc cggtgtctac accaacctgt gcaagtacag aagctggatt
> > > > >       781 gaggaaacga tgcgggacaa atgatggtct tcacggtggg atggacctcg tcagctgc
> > > > > //
> > > > >
> > > > > I get the following exception:
> > > > >
> > > > > java.lang.IllegalArgumentException: Authors string cannot be null
> > > > > org.biojava.bio.BioException: Could not read sequence
> > > > >         at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112)
> > > > >         at exonhit.parsers.GenBankParser.getSequences(GenBankParser.java:107)
> > > > >         at exonhit.parsers.GenBankParser.runGBparser(GenBankParser.java:258)
> > > > >         at exonhit.parsers.GenBankParser.main(GenBankParser.java:341)
> > > > > Caused by: java.lang.IllegalArgumentException: Authors string cannot be null
> > > > >         at org.biojavax.DocRefAuthor$Tools.parseAuthorString(DocRefAuthor.java:76)
> > > > >         at org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.java:356)
> > > > >         at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109)
> > > > >
> > > > > -----------------------------------------------------------------------
> > > > >
> > > > > I'm trying to see what could be the problem with this particular
> > > > > sequence.  Looks to me like the AUTHORS portion is not getting parsed
> > > > > correctly.  Any ideas would be greatly appreciated!
> > > > >
> > > > --
> > > > Richard Holland (BioMart Team)
> > > > EMBL-EBI
> > > > Wellcome Trust Genome Campus
> > > > Hinxton
> > > > Cambridge CB10 1SD
> > > > UNITED KINGDOM
> > > > Tel: +44-(0)1223-494416
> > > >
> > > >
> > >
> > >
> > --
> > Richard Holland (BioMart Team)
> > EMBL-EBI
> > Wellcome Trust Genome Campus
> > Hinxton
> > Cambridge CB10 1SD
> > UNITED KINGDOM
> > Tel: +44-(0)1223-494416
> >
> >
> 
> 
-- 
Richard Holland (BioMart Team)
EMBL-EBI
Wellcome Trust Genome Campus
Hinxton
Cambridge CB10 1SD
UNITED KINGDOM
Tel: +44-(0)1223-494416




More information about the Biojava-l mailing list