[Biojava-l] Parsing Genbank/EMBL/XML Sequences from binary NCBI ASN.1 daily update files
Richard Holland
richard.holland at ebi.ac.uk
Mon Jun 5 15:16:37 UTC 2006
Doh!
I am in desparate need of coffee methinks... that's the second error in
EMBLFormat directly related to me being stupid when I cut-and-pasted the
stuff for the new 87+ ID line format...
Should be fixed now in CVS (as of about 30 seconds ago).
cheers,
Richard
On Mon, 2006-06-05 at 11:05 -0400, Seth Johnson wrote:
> Hi Richard,
>
> I go another exception on EMBL format:
> =============================
> org.biojava.bio.BioException: Could not read sequence
> at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112)
> at exonhit.parsers.GenBankParser.main(GenBankParser.java:347)
> Caused by: java.lang.IllegalStateException: No match found
> at java.util.regex.Matcher.group(Matcher.java:461)
> at org.biojavax.bio.seq.io.EMBLFormat.readRichSequence(EMBLFormat.java:311)
> at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109)
> ... 1 more
> Java Result: -1
> =============================
> I used the same file from test directory:(AY069118.em)
>
>
> Seth
>
> On 6/5/06, Richard Holland <richard.holland at ebi.ac.uk> wrote:
> > This one should be fixed in CVS now. Typo on my behalf - I put in code
> > to make it work with both 87+ and pre-87 version of EMBL, then got the
> > regexes the wrong way round!!
> >
> ...
> >
> > cheers,
> > Richard
> >
> >
> > On Fri, 2006-06-02 at 13:04 -0400, Seth Johnson wrote:
> > > Hi Richard,
> > >
> > > I made sure I have the latest source code from CVS compiled
> > > (EMBLFormat.java & GenbankFormat.java are from 05/24/06). I'm happy
> > > to report that GenBank issue is solved!!!!
> > > As far as EMBL parsing, I apologize for not providing the stack dump
> > > for ISSUE #1. Here's the dump of the exception:
> > > --------------------------------------------------------
> > > org.biojava.bio.BioException: Could not read sequence
> > > at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112)
> > > at exonhit.parsers.GenBankParser.main(GenBankParser.java:359)
> > > Caused by: java.lang.NumberFormatException: null
> > > at java.lang.Integer.parseInt(Integer.java:415)
> > > at java.lang.Integer.parseInt(Integer.java:497)
> > > at org.biojavax.bio.seq.io.EMBLFormat.readRichSequence(EMBLFormat.java:299)
> > > at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109)
> > > ... 1 more
> > > Java Result: -1
> > > -------------------------------------------------------
> > > Here, again, is the code that I'm using to to parse:
> > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > > BufferedReader gbBR = null;
> > > try {
> > > gbBR = new BufferedReader(new
> > > FileReader("C:\\Download\\ASN2BSML\\seth_06_02.emb"));
> > > } catch (FileNotFoundException fnfe) {
> > > fnfe.printStackTrace();
> > > System.exit(-1);
> > > }
> > > Namespace gbNspace = (Namespace)
> > > RichObjectFactory.getObject(SimpleNamespace.class, new
> > > Object[]{"gbSpace"} );
> > > RichSequenceIterator gbSeqs =
> > > RichSequence.IOTools.readEMBLDNA(gbBR,gbNspace);
> > > while (gbSeqs.hasNext()) {
> > > try {
> > > RichSequence rs = gbSeqs.nextRichSequence();
> > > NCBITaxon myTaxon = rs.getTaxon();
> > > }catch (BioException be){
> > > be.printStackTrace();
> > > System.exit(-1);
> > > }
> > > }
> > > ~~~~~~~~~~~~~~~~~~~~~~~~~
> > > And here's the EMBL file that I'm trying to parse:
> > > +++++++++++++++++++++++++
> > > ID DQ472184 standard; DNA; INV; 546 BP.
> > > XX
> > > AC DQ472184;
> > > XX
> > > SV DQ472184.1
> > > DT 15-MAY-2006
> > > XX
> > > DE Trypanosoma cruzi strain CL Brener actin-related protein 3 (ARC21) gene,
> > > DE complete cds.
> > > XX
> > > KW .
> > > XX
> > > OS Trypanosoma cruzi strain CL Brener
> > > OC Eukaryota; Euglenozoa; Kinetoplastida; Trypanosomatidae; Trypanosoma;
> > > OC Schizotrypanum.
> > > XX
> > > RN [1]
> > > RP 1-546
> > > RA De Melo L.D.B.;
> > > RT "Actin of Trypanosoma cruzi: ubiquitous actin-binding proteins";
> > > RL Unpublished.
> > > XX
> > > RN [2]
> > > RP 1-546
> > > RA De Melo L.D.B.;
> > > RT ;
> > > RL Submitted (03-APR-2006) to the EMBL/GenBank/DDBJ databases.
> > > RL Instituto de Biofisica Carlos Chagas Filho, Universidade Federal do Rio
> > > RL de Janeiro, Cidade Universitaria, CCS, Bl.G, Sl.G157, Rio de Janeiro, RJ
> > > RL 21949-900, Brazil
> > > XX
> > > FH Key Location/Qualifiers
> > > FH
> > > FT source 1..546
> > > FT /organism="Trypanosoma cruzi strain CL Brener"
> > > FT /mol_type="genomic DNA"
> > > FT /strain="CL Brener"
> > > FT /db_xref="taxon:353153"
> > > FT gene <1..>546
> > > FT /gene="ARC21"
> > > FT /note="TcARC21"
> > > FT mRNA <1..>546
> > > FT /gene="ARC21"
> > > FT /product="actin-related protein 3"
> > > FT CDS 1..546
> > > FT /gene="ARC21"
> > > FT /note="actin-binding protein; ARPC3 21 kDa; putative
> > > FT member of Arp2/3 complex"
> > > FT /codon_start=1
> > > FT /product="actin-related protein 3"
> > > FT /protein_id="ABF13401.1"
> > > FT /db_xref="GI:93360014"
> > > FT /translation="MHSRWNGYEESSLLGCGVYPLRRTSRLTPPGPAPRMDEMIEEG
> > > FT EEEPQDIVDEAFYFFKPHMFFRNFPIKGAGDRVILYLTMYLHECLKKIVQLKREEAH
> > > FT SVLLNYATMPFASPGEKDFPFNAFFPAGNEEEQEKWREYAKQLRLEANARLIEKVFL
> > > FT FPEKDGTGNKFWMAFAKRPFLASS"
> > > atgcacagca ggtggaatgg gtatgaagaa agtagtcttt tgggctgcgg tgtttatccg 60
> > > cttcgccgca cgtcacggct cactccaccc ggccctgcac cgcggatgga tgaaatgatt 120
> > > gaggagggcg aagaggagcc acaagacatt gttgacgagg cattttactt ttttaagccc 180
> > > cacatgtttt ttcgtaattt tcccattaag ggtgctggtg atcgtgtcat tctgtacttg 240
> > > acgatgtacc ttcatgagtg tttgaagaaa attgtccagt tgaagcgtga agaggcccat 300
> > > tctgtgcttc ttaactacgc tacgatgccg tttgcatcac caggggaaaa ggactttccg 360
> > > tttaacgcgt ttttccctgc tgggaatgag gaggaacaag aaaaatggcg agagtatgca 420
> > > aaacagcttc gattggaggc caacgcacgt ctcattgaga aggtttttct ttttccagag 480
> > > aaggacggca ccggaaacaa gttctggatg gcgtttgcga agaggccttt cttggcttct 540
> > > agttag 546
> > > //
> > > ID DQ472185 standard; DNA; INV; 543 BP.
> > > XX
> > > AC DQ472185;
> > > XX
> > > SV DQ472185.1
> > > DT 15-MAY-2006
> > > XX
> > > DE Trypanosoma cruzi strain CL Brener actin-related protein 4 (ARC20) gene,
> > > DE complete cds.
> > > XX
> > > KW .
> > > XX
> > > OS Trypanosoma cruzi strain CL Brener
> > > OC Eukaryota; Euglenozoa; Kinetoplastida; Trypanosomatidae; Trypanosoma;
> > > OC Schizotrypanum.
> > > XX
> > > RN [1]
> > > RP 1-543
> > > RA De Melo L.D.B.;
> > > RT "Actin of Trypanosoma cruzi: ubiquitous actin-binding proteins";
> > > RL Unpublished.
> > > XX
> > > RN [2]
> > > RP 1-543
> > > RA De Melo L.D.B.;
> > > RT ;
> > > RL Submitted (03-APR-2006) to the EMBL/GenBank/DDBJ databases.
> > > RL Instituto de Biofisica Carlos Chagas Filho, Universidade Federal do Rio
> > > RL de Janeiro, Cidade Universitaria, CCS, Bl.G, Sl.G157, Rio de Janeiro, RJ
> > > RL 21949-900, Brazil
> > > XX
> > > FH Key Location/Qualifiers
> > > FH
> > > FT source 1..543
> > > FT /organism="Trypanosoma cruzi strain CL Brener"
> > > FT /mol_type="genomic DNA"
> > > FT /strain="CL Brener"
> > > FT /db_xref="taxon:353153"
> > > FT gene <1..>543
> > > FT /gene="ARC20"
> > > FT /note="TcARC20"
> > > FT mRNA <1..>543
> > > FT /gene="ARC20"
> > > FT /product="actin-related protein 4"
> > > FT CDS 1..543
> > > FT /gene="ARC20"
> > > FT /note="actin-binding protein; ARPC4 20 kDa; putative
> > > FT member of Arp2/3 complex"
> > > FT /codon_start=1
> > > FT /product="actin-related protein 4"
> > > FT /protein_id="ABF13402.1"
> > > FT /db_xref="GI:93360016"
> > > FT /translation="MATAYLPYYDCIKCTLHAALCIGNYPSCTVERHNKPEVEVADH
> > > FT LENNGEIKVQDFLLNPIRIVRSEQESCLIEPSINSTRISVSFLKSDAIAEIIARKYV
> > > FT GFLAQRAKQFHILRKKPIPGYDISFLISHEEVETMHRNRIIQFIITFLMDIDADIAA
> > > FT MKLNVNQRARRAAMEFFLALNFT"
> > > atggcaaccg cctatttgcc ttactacgac tgcatcaagt gcacgttgca cgcggctttg 60
> > > tgcatcggga attatccttc atgtaccgtg gagcgtcata ataaaccaga agttgaggtt 120
> > > gcagaccatc tggagaataa tggtgaaata aaagtacaag atttccttct taaccccata 180
> > > cgcattgtgc gttcagaaca ggaaagttgt cttattgaac ctagtataaa cagcacacgc 240
> > > atatctgtat cgtttctcaa gagcgacgct attgcagaga ttattgcccg aaagtacgtt 300
> > > ggatttttag ctcagcgagc caaacagttt cacatcttga gaaaaaagcc tattccggga 360
> > > tatgatataa gttttttgat ttctcacgag gaagtagaaa caatgcatag gaataggatt 420
> > > attcaattta taattacttt cttgatggat attgatgctg acattgctgc aatgaagttg 480
> > > aatgtgaatc aacgtgcacg tcgagcagcg atggaattct ttcttgcatt gaatttcaca 540
> > > tga 543
> > > //
> > > +++++++++++++++++++++++++++++++++
> > >
> > > It looks to me like there's some kind of problem with parsing the
> > > sequence version number. I even tried the sequence from test directory
> > > (AY069118.em) with same outcome.
> > >
> > > Regards,
> > >
> > > Seth
> > >
> > > On 6/2/06, Richard Holland <richard.holland at ebi.ac.uk> wrote:
> > > > Hi Seth.
> > > >
> > > > Your second point, about the authors string not being read correctly in
> > > > Genbank format, has been fixed (or should have been if I got the code
> > > > right!). Could you check the latest version of biojava-live out of CVS
> > > > and give it another go? Basically the parser did not recognise the
> > > > CONSRTM tag, as it is not mentioned in the sample record provided by
> > > > NCBI, which is what I based the parser on.
> > > >
> > > > I've set it up now so that it reads the CONSRTM tag, but the value is
> > > > merged with the authors tag with (consortium) appended. There will still
> > > > be problems if the consortium value has commas in it - not sure how to
> > > > fix this yet.
> > > >
> > > > Your first point is harder to solve because you did not provide a
> > > > complete stack trace for the exceptions you are getting. The complete
> > > > stack trace would enable me to identify exactly where things are going
> > > > wrong and give me a better chance of fixing them. Could you send the
> > > > stack trace, and I'll see what I can do.
> > > >
> > > > cheers,
> > > > Richard
> > > >
> > > >
> > > > On Thu, 2006-06-01 at 18:03 -0400, Seth Johnson wrote:
> > > > > Hi All,
> > > > >
> > > > > I'm a newbie to the whole BioJava(X) API and was hoping to get some
> > > > > clarification on several issues that I'm having.
> > > > > I am developing a parser that would take as input "NCBI Incremental
> > > > > ASN.1 Sequence Updates to Genbank" files (
> > > > > ftp://ftp.ncbi.nih.gov/ncbi-asn1/daily-nc ) , gunzip them, and use the
> > > > > ASN2GB converter (
> > > > > ftp://ftp.ncbi.nih.gov/asn1-converters/by_program/asn2gb ) to convert
> > > > > resulting sequences to a format parsable by BioJava(X) (
> > > > > http://www.penguin-soft.com/penguin/man/1/asn2gb.html ). This is where
> > > > > my problems start.
> > > > >
> > > > > ISSUE 1:
> > > > > I've tried to parse all of the formats that ASN2GB outputs ( GenBank
> > > > > (default) , EMBL, nucleotide GBSet (XML), nucleotide INSDSet (XML),
> > > > > tiny seq (XML) ) using either BioJava or BioJavaX API. Only GenBank
> > > > > format is recognized by the
> > > > > "RichSequence.IOTools.readGenbankDNA(inBuf,gbNspace)" function with
> > > > > some exceptions that I'll describe in issue #2. This is the code that
> > > > > I'm using to parse, for example, the EMBL output:
> > > > >
> > > > > BufferedReader inBuf = new BufferedReader(new FileReader("embl_output.emb"));
> > > > > Namespace gbNspace = (Namespace)
> > > > > RichObjectFactory.getObject(SimpleNamespace.class, new
> > > > > Object[]{"gbSpace"} );
> > > > > RichSequenceIterator gbSeqs = RichSequence.IOTools.readEMBLDNA(inBuf,gbNspace);
> > > > > while (gbSeqs.hasNext()) {
> > > > > try {
> > > > > RichSequence rs = gbSeqs.nextRichSequence();
> > > > > // Further processing or RichSequence object from here
> > > > >
> > > > > } catch (BioException be){
> > > > > be.printStackTrace();
> > > > > }
> > > > > }
> > > > >
> > > > > The multi-sequence EMBL file looks like this:
> > > > > ---------------------------------------------------------------------------------
> > > > > ID DQ472184 standard; DNA; INV; 546 BP.
> > > > > XX
> > > > > AC DQ472184;
> > > > > XX
> > > > > SV DQ472184.1
> > > > > DT 15-MAY-2006
> > > > > XX
> > > > > DE Trypanosoma cruzi strain CL Brener actin-related protein 3 (ARC21) gene,
> > > > > DE complete cds.
> > > > > XX
> > > > > KW .
> > > > > XX
> > > > > OS Trypanosoma cruzi strain CL Brener
> > > > > OC Eukaryota; Euglenozoa; Kinetoplastida; Trypanosomatidae; Trypanosoma;
> > > > > OC Schizotrypanum.
> > > > > XX
> > > > > RN [1]
> > > > > RP 1-546
> > > > > RA De Melo L.D.B.;
> > > > > RT "Actin of Trypanosoma cruzi: ubiquitous actin-binding proteins";
> > > > > RL Unpublished.
> > > > > XX
> > > > > RN [2]
> > > > > RP 1-546
> > > > > RA De Melo L.D.B.;
> > > > > RT ;
> > > > > RL Submitted (03-APR-2006) to the EMBL/GenBank/DDBJ databases.
> > > > > RL Instituto de Biofisica Carlos Chagas Filho, Universidade Federal do Rio
> > > > > RL de Janeiro, Cidade Universitaria, CCS, Bl.G, Sl.G157, Rio de Janeiro, RJ
> > > > > RL 21949-900, Brazil
> > > > > XX
> > > > > FH Key Location/Qualifiers
> > > > > FH
> > > > > FT source 1..546
> > > > > FT /organism="Trypanosoma cruzi strain CL Brener"
> > > > > FT /mol_type="genomic DNA"
> > > > > FT /strain="CL Brener"
> > > > > FT /db_xref="taxon:353153"
> > > > > FT gene <1..>546
> > > > > FT /gene="ARC21"
> > > > > FT /note="TcARC21"
> > > > > FT mRNA <1..>546
> > > > > FT /gene="ARC21"
> > > > > FT /product="actin-related protein 3"
> > > > > FT CDS 1..546
> > > > > FT /gene="ARC21"
> > > > > FT /note="actin-binding protein; ARPC3 21 kDa; putative
> > > > > FT member of Arp2/3 complex"
> > > > > FT /codon_start=1
> > > > > FT /product="actin-related protein 3"
> > > > > FT /protein_id="ABF13401.1"
> > > > > FT /db_xref="GI:93360014"
> > > > > FT /translation="MHSRWNGYEESSLLGCGVYPLRRTSRLTPPGPAPRMDEMIEEG
> > > > > FT EEEPQDIVDEAFYFFKPHMFFRNFPIKGAGDRVILYLTMYLHECLKKIVQLKREEAH
> > > > > FT SVLLNYATMPFASPGEKDFPFNAFFPAGNEEEQEKWREYAKQLRLEANARLIEKVFL
> > > > > FT FPEKDGTGNKFWMAFAKRPFLASS"
> > > > > atgcacagca ggtggaatgg gtatgaagaa agtagtcttt tgggctgcgg tgtttatccg 60
> > > > > cttcgccgca cgtcacggct cactccaccc ggccctgcac cgcggatgga tgaaatgatt 120
> > > > > gaggagggcg aagaggagcc acaagacatt gttgacgagg cattttactt ttttaagccc 180
> > > > > cacatgtttt ttcgtaattt tcccattaag ggtgctggtg atcgtgtcat tctgtacttg 240
> > > > > acgatgtacc ttcatgagtg tttgaagaaa attgtccagt tgaagcgtga agaggcccat 300
> > > > > tctgtgcttc ttaactacgc tacgatgccg tttgcatcac caggggaaaa ggactttccg 360
> > > > > tttaacgcgt ttttccctgc tgggaatgag gaggaacaag aaaaatggcg agagtatgca 420
> > > > > aaacagcttc gattggaggc caacgcacgt ctcattgaga aggtttttct ttttccagag 480
> > > > > aaggacggca ccggaaacaa gttctggatg gcgtttgcga agaggccttt cttggcttct 540
> > > > > agttag 546
> > > > > //
> > > > > ID DQ472185 standard; DNA; INV; 543 BP.
> > > > > XX
> > > > > AC DQ472185;
> > > > > XX
> > > > > SV DQ472185.1
> > > > > DT 15-MAY-2006
> > > > > XX
> > > > > DE Trypanosoma cruzi strain CL Brener actin-related protein 4 (ARC20) gene,
> > > > > DE complete cds.
> > > > > XX
> > > > > KW .
> > > > > XX
> > > > > OS Trypanosoma cruzi strain CL Brener
> > > > > OC Eukaryota; Euglenozoa; Kinetoplastida; Trypanosomatidae; Trypanosoma;
> > > > > OC Schizotrypanum.
> > > > > XX
> > > > > RN [1]
> > > > > RP 1-543
> > > > > RA De Melo L.D.B.;
> > > > > RT "Actin of Trypanosoma cruzi: ubiquitous actin-binding proteins";
> > > > > RL Unpublished.
> > > > > XX
> > > > > RN [2]
> > > > > RP 1-543
> > > > > RA De Melo L.D.B.;
> > > > > RT ;
> > > > > RL Submitted (03-APR-2006) to the EMBL/GenBank/DDBJ databases.
> > > > > RL Instituto de Biofisica Carlos Chagas Filho, Universidade Federal do Rio
> > > > > RL de Janeiro, Cidade Universitaria, CCS, Bl.G, Sl.G157, Rio de Janeiro, RJ
> > > > > RL 21949-900, Brazil
> > > > > XX
> > > > > FH Key Location/Qualifiers
> > > > > FH
> > > > > FT source 1..543
> > > > > FT /organism="Trypanosoma cruzi strain CL Brener"
> > > > > FT /mol_type="genomic DNA"
> > > > > FT /strain="CL Brener"
> > > > > FT /db_xref="taxon:353153"
> > > > > FT gene <1..>543
> > > > > FT /gene="ARC20"
> > > > > FT /note="TcARC20"
> > > > > FT mRNA <1..>543
> > > > > FT /gene="ARC20"
> > > > > FT /product="actin-related protein 4"
> > > > > FT CDS 1..543
> > > > > FT /gene="ARC20"
> > > > > FT /note="actin-binding protein; ARPC4 20 kDa; putative
> > > > > FT member of Arp2/3 complex"
> > > > > FT /codon_start=1
> > > > > FT /product="actin-related protein 4"
> > > > > FT /protein_id="ABF13402.1"
> > > > > FT /db_xref="GI:93360016"
> > > > > FT /translation="MATAYLPYYDCIKCTLHAALCIGNYPSCTVERHNKPEVEVADH
> > > > > FT LENNGEIKVQDFLLNPIRIVRSEQESCLIEPSINSTRISVSFLKSDAIAEIIARKYV
> > > > > FT GFLAQRAKQFHILRKKPIPGYDISFLISHEEVETMHRNRIIQFIITFLMDIDADIAA
> > > > > FT MKLNVNQRARRAAMEFFLALNFT"
> > > > > atggcaaccg cctatttgcc ttactacgac tgcatcaagt gcacgttgca cgcggctttg 60
> > > > > tgcatcggga attatccttc atgtaccgtg gagcgtcata ataaaccaga agttgaggtt 120
> > > > > gcagaccatc tggagaataa tggtgaaata aaagtacaag atttccttct taaccccata 180
> > > > > cgcattgtgc gttcagaaca ggaaagttgt cttattgaac ctagtataaa cagcacacgc 240
> > > > > atatctgtat cgtttctcaa gagcgacgct attgcagaga ttattgcccg aaagtacgtt 300
> > > > > ggatttttag ctcagcgagc caaacagttt cacatcttga gaaaaaagcc tattccggga 360
> > > > > tatgatataa gttttttgat ttctcacgag gaagtagaaa caatgcatag gaataggatt 420
> > > > > attcaattta taattacttt cttgatggat attgatgctg acattgctgc aatgaagttg 480
> > > > > aatgtgaatc aacgtgcacg tcgagcagcg atggaattct ttcttgcatt gaatttcaca 540
> > > > > tga 543
> > > > > //
> > > > > -----------------------------------------------------------------------
> > > > > I get an exception message "Could Not Read Sequence". Same thing
> > > > > happens if I use the readINSDSetDNA reader instead of readEMBLDNA one
> > > > > with the following INSDset file (beginning of the file):
> > > > >
> > > > > <?xml version="1.0"?>
> > > > > <!DOCTYPE INSDSeq PUBLIC "-//NCBI//INSD INSDSeq/EN" "INSD_INSDSeq.dtd">
> > > > > <INSDSeq>
> > > > > <INSDSeq_locus>DQ022078</INSDSeq_locus>
> > > > > <INSDSeq_length>16729</INSDSeq_length>
> > > > > <INSDSeq_moltype>DNA</INSDSeq_moltype>
> > > > > <INSDSeq_topology>linear</INSDSeq_topology>
> > > > > <INSDSeq_division>ENV</INSDSeq_division>
> > > > > <INSDSeq_update-date>15-MAY-2006</INSDSeq_update-date>
> > > > > <INSDSeq_create-date>15-MAY-2006</INSDSeq_create-date>
> > > > > <INSDSeq_definition>Uncultured bacterium WWRS-2005 putative
> > > > > aminoglycoside phosphotransferase (a3.001), putative oxidoreductase
> > > > > (a3.002), putative oxidoreductase (a3.003), putative beta-lactamase
> > > > > class C (estA3), putative permease (a3.005), putative transmembrane
> > > > > signal peptide (a3.006), thiol-disulfide isomerase (a3.007), histone
> > > > > acetyltransferase HPA2 (a3.008), putative enzyme (a3.009), putative
> > > > > asparaginase (a3.010), hypothetical protein (a3.011), hypothetical
> > > > > protein (a3.012), putative membrane protease subunit (a3.013),
> > > > > putative haloalkane dehalogenase (a3.014), putative transcriptional
> > > > > regulator (a3.015), putative peptidyl-dipeptidase Dcp (a3.016), and
> > > > > hypothetical protein (a3.017) genes, complete cds</INSDSeq_definition>
> > > > > <INSDSeq_primary-accession>DQ022078</INSDSeq_primary-accession>
> > > > > <INSDSeq_other-seqids>
> > > > > <INSDSeqid>gb|DQ022078.1|</INSDSeqid>
> > > > > <INSDSeqid>gi|71842722</INSDSeqid>
> > > > > </INSDSeq_other-seqids>
> > > > > <INSDSeq_keywords>
> > > > > <INSDKeyword>ENV</INSDKeyword>
> > > > > </INSDSeq_keywords>
> > > > > <INSDSeq_references>
> > > > > <INSDReference>
> > > > > <INSDReference_reference>?</INSDReference_reference>
> > > > > <INSDReference_position>1..16729</INSDReference_position>
> > > > > <INSDReference_authors>
> > > > > <INSDAuthor>Schmeisser,C.</INSDAuthor>
> > > > > <INSDAuthor>Elend,C.</INSDAuthor>
> > > > > <INSDAuthor>Streit,W.R.</INSDAuthor>
> > > > > </INSDReference_authors>
> > > > > <INSDReference_title>Isolation and biochemical characterization
> > > > > of two novel metagenome derived esterases</INSDReference_title>
> > > > > <INSDReference_journal>Appl. Environ. Microbiol. 0:0-0
> > > > > (2006)</INSDReference_journal>
> > > > > </INSDReference>
> > > > > <INSDReference>
> > > > > <INSDReference_reference>?</INSDReference_reference>
> > > > > <INSDReference_position>1..16729</INSDReference_position>
> > > > > <INSDReference_authors>
> > > > > <INSDAuthor>Schmeisser,C.</INSDAuthor>
> > > > > <INSDAuthor>Elend,C.</INSDAuthor>
> > > > > <INSDAuthor>Streit,W.R.</INSDAuthor>
> > > > > </INSDReference_authors>
> > > > > <INSDReference_journal>Submitted (29-APR-2005) to the
> > > > > EMBL/GenBank/DDBJ databases. Molekulare Enzymtechnologie, University
> > > > > Duisburg-Essen, Lotharstrasse 1, Duisburg D-47057,
> > > > > Germany</INSDReference_journal>
> > > > > </INSDReference>
> > > > > </INSDSeq_references>
> > > > >
> > > > > So my question is wether the ASN2GB produces output that's
> > > > > incompatible with BioJava parsers or is there a problem with the
> > > > > sequence themselves or the problems with the majority of parsers???
> > > > > Could it be that I'm using the API wrongly for the above formats,
> > > > > although GenBank parser works as advertised with some exceptions
> > > > > below:
> > > > >
> > > > > ISSUE #2:
> > > > > When I try to parse GenBank files using the following code:
> > > > >
> > > > > BufferedReader inBuf = new BufferedReader(new FileReader("genbank_output.gb"));
> > > > > Namespace gbNspace = (Namespace)
> > > > > RichObjectFactory.getObject(SimpleNamespace.class, new
> > > > > Object[]{"gbSpace"} );
> > > > > RichSequenceIterator gbSeqs =
> > > > > RichSequence.IOTools.readGenbankDNA(inBuf,gbNspace);
> > > > > while (gbSeqs.hasNext()) {
> > > > > try {
> > > > > RichSequence rs = gbSeqs.nextRichSequence();
> > > > > // Further processing or RichSequence object from here
> > > > >
> > > > > } catch (BioException be){
> > > > > be.printStackTrace();
> > > > > }
> > > > > }
> > > > >
> > > > > Genbank file in question:
> > > > >
> > > > > LOCUS BC074905 838 bp mRNA linear PRI 15-APR-2006
> > > > > DEFINITION Homo sapiens kallikrein 14, mRNA (cDNA clone MGC:104038
> > > > > IMAGE:30915482), complete cds.
> > > > > ACCESSION BC074905
> > > > > VERSION BC074905.2 GI:50959825
> > > > > KEYWORDS MGC.
> > > > > SOURCE Homo sapiens (human)
> > > > > ORGANISM Homo sapiens
> > > > > Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
> > > > > Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;
> > > > > Catarrhini; Hominidae; Homo.
> > > > > REFERENCE 1 (bases 1 to 838)
> > > > > AUTHORS Strausberg,R.L., Feingold,E.A., Grouse,L.H., Derge,J.G.,
> > > > > Klausner,R.D., Collins,F.S., Wagner,L., Shenmen,C.M., Schuler,G.D.,
> > > > > Altschul,S.F., Zeeberg,B., Buetow,K.H., Schaefer,C.F., Bhat,N.K.,
> > > > > Hopkins,R.F., Jordan,H., Moore,T., Max,S.I., Wang,J., Hsieh,F.,
> > > > > Diatchenko,L., Marusina,K., Farmer,A.A., Rubin,G.M., Hong,L.,
> > > > > Stapleton,M., Soares,M.B., Bonaldo,M.F., Casavant,T.L.,
> > > > > Scheetz,T.E., Brownstein,M.J., Usdin,T.B., Toshiyuki,S.,
> > > > > Carninci,P., Prange,C., Raha,S.S., Loquellano,N.A., Peters,G.J.,
> > > > > Abramson,R.D., Mullahy,S.J., Bosak,S.A., McEwan,P.J.,
> > > > > McKernan,K.J., Malek,J.A., Gunaratne,P.H., Richards,S.,
> > > > > Worley,K.C., Hale,S., Garcia,A.M., Gay,L.J., Hulyk,S.W.,
> > > > > Villalon,D.K., Muzny,D.M., Sodergren,E.J., Lu,X., Gibbs,R.A.,
> > > > > Fahey,J., Helton,E., Ketteman,M., Madan,A., Rodrigues,S.,
> > > > > Sanchez,A., Whiting,M., Madan,A., Young,A.C., Shevchenko,Y.,
> > > > > Bouffard,G.G., Blakesley,R.W., Touchman,J.W., Green,E.D.,
> > > > > Dickson,M.C., Rodriguez,A.C., Grimwood,J., Schmutz,J., Myers,R.M.,
> > > > > Butterfield,Y.S., Krzywinski,M.I., Skalska,U., Smailus,D.E.,
> > > > > Schnerch,A., Schein,J.E., Jones,S.J. and Marra,M.A.
> > > > > CONSRTM Mammalian Gene Collection Program Team
> > > > > TITLE Generation and initial analysis of more than 15,000 full-length
> > > > > human and mouse cDNA sequences
> > > > > JOURNAL Proc. Natl. Acad. Sci. U.S.A. 99 (26), 16899-16903 (2002)
> > > > > PUBMED 12477932
> > > > > REFERENCE 2 (bases 1 to 838)
> > > > > CONSRTM NIH MGC Project
> > > > > TITLE Direct Submission
> > > > > JOURNAL Submitted (25-JUN-2004) National Institutes of Health, Mammalian
> > > > > Gene Collection (MGC), Bethesda, MD 20892-2590, USA
> > > > > REMARK NIH-MGC Project URL: http://mgc.nci.nih.gov
> > > > > COMMENT On Aug 4, 2004 this sequence version replaced gi:49901832.
> > > > > Contact: MGC help desk
> > > > > Email: cgapbs-r at mail.nih.gov
> > > > > Tissue Procurement: Genome Sequence Centre, British Columbia Cancer
> > > > > Center
> > > > > cDNA Library Preparation: British Columbia Cancer Research Center
> > > > > cDNA Library Arrayed by: The I.M.A.G.E. Consortium (LLNL)
> > > > > DNA Sequencing by: Genome Sequence Centre,
> > > > > BC Cancer Agency, Vancouver, BC, Canada
> > > > > info at bcgsc.bc.ca
> > > > > Martin Hirst, Thomas Zeng, Ryan Morin, Michelle Moksa, Johnson
> > > > > Pang, Diana Mah, Jing Wang, Kieth Fichter, Eric Chuah, Allen
> > > > > Delaney, Rob Kirkpatrick, Agnes Baross, Sarah Barber, Mabel
> > > > > Brown-John, Steve S. Chand, William Chow, Ryan Babakaiff, Dave
> > > > > Wong, Corey Matsuo, Jaclyn Beland, Susan Gibson, Luis delRio, Ruth
> > > > > Featherstone, Malachi Griffith, Obi Griffith, Ran Guin, Nancy Liao,
> > > > > Kim MacDonald, Mike R. Mayo, Josh Moran, Diana Palmquist, JR
> > > > > Santos, Duane Smailus, Jeff Stott, Miranda Tsai, George Yang,
> > > > > Jacquie Schein, Asim Siddiqui,Steven Jones, Rob Holt, Marco Marra.
> > > > >
> > > > > Clone distribution: MGC clone distribution information can be found
> > > > > through the I.M.A.G.E. Consortium/LLNL at: http://image.llnl.gov
> > > > > Series: IRBU Plate: 4 Row: C Column: 3.
> > > > >
> > > > > Differences found between this sequence and the human reference
> > > > > genome (build 36) are described in misc_difference features below.
> > > > > FEATURES Location/Qualifiers
> > > > > source 1..838
> > > > > /organism="Homo sapiens"
> > > > > /mol_type="mRNA"
> > > > > /db_xref="taxon:9606"
> > > > > /clone="MGC:104038 IMAGE:30915482"
> > > > > /tissue_type="Lung, PCR rescued clones"
> > > > > /clone_lib="NIH_MGC_273"
> > > > > /lab_host="DH10B"
> > > > > /note="Vector: pCR4 Topo TA with reversed insert"
> > > > > gene 1..838
> > > > > /gene="KLK14"
> > > > > /note="synonym: KLK-L6"
> > > > > /db_xref="GeneID:43847"
> > > > > /db_xref="HGNC:6362"
> > > > > /db_xref="IMGT/GENE-DB:6362"
> > > > > /db_xref="MIM:606135"
> > > > > CDS 49..804
> > > > > /gene="KLK14"
> > > > > /codon_start=1
> > > > > /product="KLK14 protein"
> > > > > /protein_id="AAH74905.1"
> > > > > /db_xref="GI:50959826"
> > > > > /db_xref="GeneID:43847"
> > > > > /db_xref="HGNC:6362"
> > > > > /db_xref="IMGT/GENE-DB:6362"
> > > > > /db_xref="MIM:606135"
> > > > > /translation="MFLLLTALQVLAIAMTRSQEDENKIIGGYTCTRSSQPWQAALLA
> > > > > GPRRRFLCGGALLSGQWVITAAHCGRPILQVALGKHNLRRWEATQQVLRVVRQVTHPN
> > > > > YNSRTHDNDLMLLQLQQPARIGRAVRPIEVTQACASPGTSCRVSGWGTISSPIARYPA
> > > > > SLQCVNINISPDEVCQKAYPRTITPGMVCAGVPQGGKDSCQGDSGGPLVCRGQLQGLV
> > > > > SWGMERCALPGYPGVYTNLCKYRSWIEETMRDK"
> > > > > misc_difference 98
> > > > > /gene="KLK14"
> > > > > /note="'G' in cDNA is 'A' in the human genome; amino acid
> > > > > difference: 'R' in cDNA, 'Q' in the human genome."
> > > > > misc_difference 133
> > > > > /gene="KLK14"
> > > > > /note="'T' in cDNA is 'C' in the human genome; amino acid
> > > > > difference: 'Y' in cDNA, 'H' in the human genome."
> > > > > ORIGIN
> > > > > 1 atgtccctga gggtcttggg ctctgggacc tggccctcag cccctaaaat gttcctcctg
> > > > > 61 ctgacagcac ttcaagtcct ggctatagcc atgacacgga gccaagagga tgagaacaag
> > > > > 121 ataattggtg gctatacgtg cacccggagc tcccagccgt ggcaggcggc cctgctggcg
> > > > > 181 ggtcccaggc gccgcttcct ctgcggaggc gccctgcttt caggccagtg ggtcatcact
> > > > > 241 gctgctcact gcggccgccc gatccttcag gttgccctgg gcaagcacaa cctgaggagg
> > > > > 301 tgggaggcca cccagcaggt gctgcgcgtg gttcgtcagg tgacgcaccc caactacaac
> > > > > 361 tcccggaccc acgacaacga cctcatgctg ctgcagctac agcagcccgc acggatcggg
> > > > > 421 agggcagtca ggcccattga ggtcacccag gcctgtgcca gccccgggac ctcctgccga
> > > > > 481 gtgtcaggct ggggaactat atccagcccc atcgccaggt accccgcctc tctgcaatgc
> > > > > 541 gtgaacatca acatctcccc ggatgaggtg tgccagaagg cctatcctag aaccatcacg
> > > > > 601 cctggcatgg tctgtgcagg agttccccag ggcgggaagg actcttgtca gggtgactct
> > > > > 661 gggggacccc tggtgtgcag aggacagctc cagggcctcg tgtcttgggg aatggagcgc
> > > > > 721 tgcgccctgc ctggctaccc cggtgtctac accaacctgt gcaagtacag aagctggatt
> > > > > 781 gaggaaacga tgcgggacaa atgatggtct tcacggtggg atggacctcg tcagctgc
> > > > > //
> > > > >
> > > > > I get the following exception:
> > > > >
> > > > > java.lang.IllegalArgumentException: Authors string cannot be null
> > > > > org.biojava.bio.BioException: Could not read sequence
> > > > > at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112)
> > > > > at exonhit.parsers.GenBankParser.getSequences(GenBankParser.java:107)
> > > > > at exonhit.parsers.GenBankParser.runGBparser(GenBankParser.java:258)
> > > > > at exonhit.parsers.GenBankParser.main(GenBankParser.java:341)
> > > > > Caused by: java.lang.IllegalArgumentException: Authors string cannot be null
> > > > > at org.biojavax.DocRefAuthor$Tools.parseAuthorString(DocRefAuthor.java:76)
> > > > > at org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.java:356)
> > > > > at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109)
> > > > >
> > > > > -----------------------------------------------------------------------
> > > > >
> > > > > I'm trying to see what could be the problem with this particular
> > > > > sequence. Looks to me like the AUTHORS portion is not getting parsed
> > > > > correctly. Any ideas would be greatly appreciated!
> > > > >
> > > > --
> > > > Richard Holland (BioMart Team)
> > > > EMBL-EBI
> > > > Wellcome Trust Genome Campus
> > > > Hinxton
> > > > Cambridge CB10 1SD
> > > > UNITED KINGDOM
> > > > Tel: +44-(0)1223-494416
> > > >
> > > >
> > >
> > >
> > --
> > Richard Holland (BioMart Team)
> > EMBL-EBI
> > Wellcome Trust Genome Campus
> > Hinxton
> > Cambridge CB10 1SD
> > UNITED KINGDOM
> > Tel: +44-(0)1223-494416
> >
> >
>
>
--
Richard Holland (BioMart Team)
EMBL-EBI
Wellcome Trust Genome Campus
Hinxton
Cambridge CB10 1SD
UNITED KINGDOM
Tel: +44-(0)1223-494416
More information about the Biojava-l
mailing list