[Biojava-l] Parsing Genbank/EMBL/XML Sequences from binary NCBI ASN.1 daily update files

Seth Johnson johnson.biotech at gmail.com
Fri Jun 2 17:04:59 UTC 2006


Hi Richard,

I made sure I have the latest source code from CVS compiled
(EMBLFormat.java & GenbankFormat.java are from 05/24/06).  I'm happy
to report that GenBank issue is solved!!!!
As far as EMBL parsing, I apologize for not providing the stack dump
for ISSUE #1.  Here's the dump of the exception:
--------------------------------------------------------
org.biojava.bio.BioException: Could not read sequence
        at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112)
        at exonhit.parsers.GenBankParser.main(GenBankParser.java:359)
Caused by: java.lang.NumberFormatException: null
        at java.lang.Integer.parseInt(Integer.java:415)
        at java.lang.Integer.parseInt(Integer.java:497)
        at org.biojavax.bio.seq.io.EMBLFormat.readRichSequence(EMBLFormat.java:299)
        at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109)
        ... 1 more
Java Result: -1
-------------------------------------------------------
Here, again, is the code that I'm using to to parse:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        BufferedReader gbBR = null;
        try {
            gbBR = new BufferedReader(new
FileReader("C:\\Download\\ASN2BSML\\seth_06_02.emb"));
        } catch (FileNotFoundException fnfe) {
            fnfe.printStackTrace();
            System.exit(-1);
        }
        Namespace gbNspace = (Namespace)
RichObjectFactory.getObject(SimpleNamespace.class, new
Object[]{"gbSpace"} );
        RichSequenceIterator gbSeqs =
RichSequence.IOTools.readEMBLDNA(gbBR,gbNspace);
        while (gbSeqs.hasNext()) {
            try {
                RichSequence rs = gbSeqs.nextRichSequence();
                NCBITaxon myTaxon = rs.getTaxon();
            }catch (BioException be){
                be.printStackTrace();
                System.exit(-1);
            }
        }
~~~~~~~~~~~~~~~~~~~~~~~~~
And here's the EMBL file that I'm trying to parse:
+++++++++++++++++++++++++
ID   DQ472184  standard; DNA; INV; 546 BP.
XX
AC   DQ472184;
XX
SV   DQ472184.1
DT   15-MAY-2006
XX
DE   Trypanosoma cruzi strain CL Brener actin-related protein 3 (ARC21) gene,
DE   complete cds.
XX
KW   .
XX
OS   Trypanosoma cruzi strain CL Brener
OC   Eukaryota; Euglenozoa; Kinetoplastida; Trypanosomatidae; Trypanosoma;
OC   Schizotrypanum.
XX
RN   [1]
RP   1-546
RA   De Melo L.D.B.;
RT   "Actin of Trypanosoma cruzi: ubiquitous actin-binding proteins";
RL   Unpublished.
XX
RN   [2]
RP   1-546
RA   De Melo L.D.B.;
RT   ;
RL   Submitted (03-APR-2006) to the EMBL/GenBank/DDBJ databases.
RL   Instituto de Biofisica Carlos Chagas Filho, Universidade Federal do Rio
RL   de Janeiro, Cidade Universitaria, CCS, Bl.G, Sl.G157, Rio de Janeiro, RJ
RL   21949-900, Brazil
XX
FH   Key             Location/Qualifiers
FH
FT   source          1..546
FT                   /organism="Trypanosoma cruzi strain CL Brener"
FT                   /mol_type="genomic DNA"
FT                   /strain="CL Brener"
FT                   /db_xref="taxon:353153"
FT   gene            <1..>546
FT                   /gene="ARC21"
FT                   /note="TcARC21"
FT   mRNA            <1..>546
FT                   /gene="ARC21"
FT                   /product="actin-related protein 3"
FT   CDS             1..546
FT                   /gene="ARC21"
FT                   /note="actin-binding protein; ARPC3 21 kDa; putative
FT                   member of Arp2/3 complex"
FT                   /codon_start=1
FT                   /product="actin-related protein 3"
FT                   /protein_id="ABF13401.1"
FT                   /db_xref="GI:93360014"
FT                   /translation="MHSRWNGYEESSLLGCGVYPLRRTSRLTPPGPAPRMDEMIEEG
FT                   EEEPQDIVDEAFYFFKPHMFFRNFPIKGAGDRVILYLTMYLHECLKKIVQLKREEAH
FT                   SVLLNYATMPFASPGEKDFPFNAFFPAGNEEEQEKWREYAKQLRLEANARLIEKVFL
FT                   FPEKDGTGNKFWMAFAKRPFLASS"
     atgcacagca ggtggaatgg gtatgaagaa agtagtcttt tgggctgcgg tgtttatccg        60
     cttcgccgca cgtcacggct cactccaccc ggccctgcac cgcggatgga tgaaatgatt       120
     gaggagggcg aagaggagcc acaagacatt gttgacgagg cattttactt ttttaagccc       180
     cacatgtttt ttcgtaattt tcccattaag ggtgctggtg atcgtgtcat tctgtacttg       240
     acgatgtacc ttcatgagtg tttgaagaaa attgtccagt tgaagcgtga agaggcccat       300
     tctgtgcttc ttaactacgc tacgatgccg tttgcatcac caggggaaaa ggactttccg       360
     tttaacgcgt ttttccctgc tgggaatgag gaggaacaag aaaaatggcg agagtatgca       420
     aaacagcttc gattggaggc caacgcacgt ctcattgaga aggtttttct ttttccagag       480
     aaggacggca ccggaaacaa gttctggatg gcgtttgcga agaggccttt cttggcttct       540
     agttag                                                                  546
//
ID   DQ472185  standard; DNA; INV; 543 BP.
XX
AC   DQ472185;
XX
SV   DQ472185.1
DT   15-MAY-2006
XX
DE   Trypanosoma cruzi strain CL Brener actin-related protein 4 (ARC20) gene,
DE   complete cds.
XX
KW   .
XX
OS   Trypanosoma cruzi strain CL Brener
OC   Eukaryota; Euglenozoa; Kinetoplastida; Trypanosomatidae; Trypanosoma;
OC   Schizotrypanum.
XX
RN   [1]
RP   1-543
RA   De Melo L.D.B.;
RT   "Actin of Trypanosoma cruzi: ubiquitous actin-binding proteins";
RL   Unpublished.
XX
RN   [2]
RP   1-543
RA   De Melo L.D.B.;
RT   ;
RL   Submitted (03-APR-2006) to the EMBL/GenBank/DDBJ databases.
RL   Instituto de Biofisica Carlos Chagas Filho, Universidade Federal do Rio
RL   de Janeiro, Cidade Universitaria, CCS, Bl.G, Sl.G157, Rio de Janeiro, RJ
RL   21949-900, Brazil
XX
FH   Key             Location/Qualifiers
FH
FT   source          1..543
FT                   /organism="Trypanosoma cruzi strain CL Brener"
FT                   /mol_type="genomic DNA"
FT                   /strain="CL Brener"
FT                   /db_xref="taxon:353153"
FT   gene            <1..>543
FT                   /gene="ARC20"
FT                   /note="TcARC20"
FT   mRNA            <1..>543
FT                   /gene="ARC20"
FT                   /product="actin-related protein 4"
FT   CDS             1..543
FT                   /gene="ARC20"
FT                   /note="actin-binding protein; ARPC4 20 kDa; putative
FT                   member of Arp2/3 complex"
FT                   /codon_start=1
FT                   /product="actin-related protein 4"
FT                   /protein_id="ABF13402.1"
FT                   /db_xref="GI:93360016"
FT                   /translation="MATAYLPYYDCIKCTLHAALCIGNYPSCTVERHNKPEVEVADH
FT                   LENNGEIKVQDFLLNPIRIVRSEQESCLIEPSINSTRISVSFLKSDAIAEIIARKYV
FT                   GFLAQRAKQFHILRKKPIPGYDISFLISHEEVETMHRNRIIQFIITFLMDIDADIAA
FT                   MKLNVNQRARRAAMEFFLALNFT"
     atggcaaccg cctatttgcc ttactacgac tgcatcaagt gcacgttgca cgcggctttg        60
     tgcatcggga attatccttc atgtaccgtg gagcgtcata ataaaccaga agttgaggtt       120
     gcagaccatc tggagaataa tggtgaaata aaagtacaag atttccttct taaccccata       180
     cgcattgtgc gttcagaaca ggaaagttgt cttattgaac ctagtataaa cagcacacgc       240
     atatctgtat cgtttctcaa gagcgacgct attgcagaga ttattgcccg aaagtacgtt       300
     ggatttttag ctcagcgagc caaacagttt cacatcttga gaaaaaagcc tattccggga       360
     tatgatataa gttttttgat ttctcacgag gaagtagaaa caatgcatag gaataggatt       420
     attcaattta taattacttt cttgatggat attgatgctg acattgctgc aatgaagttg       480
     aatgtgaatc aacgtgcacg tcgagcagcg atggaattct ttcttgcatt gaatttcaca       540
     tga                                                                     543
//
+++++++++++++++++++++++++++++++++

It looks to me like there's some kind of problem with parsing the
sequence version number. I even tried the sequence from test directory
(AY069118.em) with same outcome.

Regards,

Seth

On 6/2/06, Richard Holland <richard.holland at ebi.ac.uk> wrote:
> Hi Seth.
>
> Your second point, about the authors string not being read correctly in
> Genbank format, has been fixed (or should have been if I got the code
> right!). Could you check the latest version of biojava-live out of CVS
> and give it another go? Basically the parser did not recognise the
> CONSRTM tag, as it is not mentioned in the sample record provided by
> NCBI, which is what I based the parser on.
>
> I've set it up now so that it reads the CONSRTM tag, but the value is
> merged with the authors tag with (consortium) appended. There will still
> be problems if the consortium value has commas in it - not sure how to
> fix this yet.
>
> Your first point is harder to solve because you did not provide a
> complete stack trace for the exceptions you are getting. The complete
> stack trace would enable me to identify exactly where things are going
> wrong and give me a better chance of fixing them. Could you send the
> stack trace, and I'll see what I can do.
>
> cheers,
> Richard
>
>
> On Thu, 2006-06-01 at 18:03 -0400, Seth Johnson wrote:
> > Hi All,
> >
> > I'm a newbie to the whole BioJava(X) API and was hoping to get some
> > clarification on several issues that I'm having.
> > I am developing a parser that would take as input "NCBI Incremental
> > ASN.1 Sequence Updates to Genbank" files (
> > ftp://ftp.ncbi.nih.gov/ncbi-asn1/daily-nc ) , gunzip them, and use the
> > ASN2GB converter (
> > ftp://ftp.ncbi.nih.gov/asn1-converters/by_program/asn2gb ) to convert
> > resulting sequences to a format parsable by BioJava(X) (
> > http://www.penguin-soft.com/penguin/man/1/asn2gb.html ). This is where
> > my problems start.
> >
> > ISSUE 1:
> > I've tried to parse all of the formats that ASN2GB outputs ( GenBank
> > (default) , EMBL, nucleotide GBSet (XML), nucleotide INSDSet (XML),
> > tiny seq (XML) ) using either BioJava or BioJavaX API.  Only GenBank
> > format is recognized by the
> > "RichSequence.IOTools.readGenbankDNA(inBuf,gbNspace)" function with
> > some exceptions that I'll describe in issue #2.  This is the code that
> > I'm using to parse, for example, the EMBL output:
> >
> > BufferedReader inBuf = new BufferedReader(new FileReader("embl_output.emb"));
> > Namespace gbNspace = (Namespace)
> > RichObjectFactory.getObject(SimpleNamespace.class, new
> > Object[]{"gbSpace"} );
> > RichSequenceIterator gbSeqs = RichSequence.IOTools.readEMBLDNA(inBuf,gbNspace);
> > while (gbSeqs.hasNext()) {
> >   try {
> >            RichSequence rs = gbSeqs.nextRichSequence();
> >            // Further processing or RichSequence object from here
> >
> >        } catch (BioException be){
> >            be.printStackTrace();
> >        }
> > }
> >
> > The multi-sequence EMBL file looks like this:
> > ---------------------------------------------------------------------------------
> > ID   DQ472184  standard; DNA; INV; 546 BP.
> > XX
> > AC   DQ472184;
> > XX
> > SV   DQ472184.1
> > DT   15-MAY-2006
> > XX
> > DE   Trypanosoma cruzi strain CL Brener actin-related protein 3 (ARC21) gene,
> > DE   complete cds.
> > XX
> > KW   .
> > XX
> > OS   Trypanosoma cruzi strain CL Brener
> > OC   Eukaryota; Euglenozoa; Kinetoplastida; Trypanosomatidae; Trypanosoma;
> > OC   Schizotrypanum.
> > XX
> > RN   [1]
> > RP   1-546
> > RA   De Melo L.D.B.;
> > RT   "Actin of Trypanosoma cruzi: ubiquitous actin-binding proteins";
> > RL   Unpublished.
> > XX
> > RN   [2]
> > RP   1-546
> > RA   De Melo L.D.B.;
> > RT   ;
> > RL   Submitted (03-APR-2006) to the EMBL/GenBank/DDBJ databases.
> > RL   Instituto de Biofisica Carlos Chagas Filho, Universidade Federal do Rio
> > RL   de Janeiro, Cidade Universitaria, CCS, Bl.G, Sl.G157, Rio de Janeiro, RJ
> > RL   21949-900, Brazil
> > XX
> > FH   Key             Location/Qualifiers
> > FH
> > FT   source          1..546
> > FT                   /organism="Trypanosoma cruzi strain CL Brener"
> > FT                   /mol_type="genomic DNA"
> > FT                   /strain="CL Brener"
> > FT                   /db_xref="taxon:353153"
> > FT   gene            <1..>546
> > FT                   /gene="ARC21"
> > FT                   /note="TcARC21"
> > FT   mRNA            <1..>546
> > FT                   /gene="ARC21"
> > FT                   /product="actin-related protein 3"
> > FT   CDS             1..546
> > FT                   /gene="ARC21"
> > FT                   /note="actin-binding protein; ARPC3 21 kDa; putative
> > FT                   member of Arp2/3 complex"
> > FT                   /codon_start=1
> > FT                   /product="actin-related protein 3"
> > FT                   /protein_id="ABF13401.1"
> > FT                   /db_xref="GI:93360014"
> > FT                   /translation="MHSRWNGYEESSLLGCGVYPLRRTSRLTPPGPAPRMDEMIEEG
> > FT                   EEEPQDIVDEAFYFFKPHMFFRNFPIKGAGDRVILYLTMYLHECLKKIVQLKREEAH
> > FT                   SVLLNYATMPFASPGEKDFPFNAFFPAGNEEEQEKWREYAKQLRLEANARLIEKVFL
> > FT                   FPEKDGTGNKFWMAFAKRPFLASS"
> >      atgcacagca ggtggaatgg gtatgaagaa agtagtcttt tgggctgcgg tgtttatccg        60
> >      cttcgccgca cgtcacggct cactccaccc ggccctgcac cgcggatgga tgaaatgatt       120
> >      gaggagggcg aagaggagcc acaagacatt gttgacgagg cattttactt ttttaagccc       180
> >      cacatgtttt ttcgtaattt tcccattaag ggtgctggtg atcgtgtcat tctgtacttg       240
> >      acgatgtacc ttcatgagtg tttgaagaaa attgtccagt tgaagcgtga agaggcccat       300
> >      tctgtgcttc ttaactacgc tacgatgccg tttgcatcac caggggaaaa ggactttccg       360
> >      tttaacgcgt ttttccctgc tgggaatgag gaggaacaag aaaaatggcg agagtatgca       420
> >      aaacagcttc gattggaggc caacgcacgt ctcattgaga aggtttttct ttttccagag       480
> >      aaggacggca ccggaaacaa gttctggatg gcgtttgcga agaggccttt cttggcttct       540
> >      agttag                                                                  546
> > //
> > ID   DQ472185  standard; DNA; INV; 543 BP.
> > XX
> > AC   DQ472185;
> > XX
> > SV   DQ472185.1
> > DT   15-MAY-2006
> > XX
> > DE   Trypanosoma cruzi strain CL Brener actin-related protein 4 (ARC20) gene,
> > DE   complete cds.
> > XX
> > KW   .
> > XX
> > OS   Trypanosoma cruzi strain CL Brener
> > OC   Eukaryota; Euglenozoa; Kinetoplastida; Trypanosomatidae; Trypanosoma;
> > OC   Schizotrypanum.
> > XX
> > RN   [1]
> > RP   1-543
> > RA   De Melo L.D.B.;
> > RT   "Actin of Trypanosoma cruzi: ubiquitous actin-binding proteins";
> > RL   Unpublished.
> > XX
> > RN   [2]
> > RP   1-543
> > RA   De Melo L.D.B.;
> > RT   ;
> > RL   Submitted (03-APR-2006) to the EMBL/GenBank/DDBJ databases.
> > RL   Instituto de Biofisica Carlos Chagas Filho, Universidade Federal do Rio
> > RL   de Janeiro, Cidade Universitaria, CCS, Bl.G, Sl.G157, Rio de Janeiro, RJ
> > RL   21949-900, Brazil
> > XX
> > FH   Key             Location/Qualifiers
> > FH
> > FT   source          1..543
> > FT                   /organism="Trypanosoma cruzi strain CL Brener"
> > FT                   /mol_type="genomic DNA"
> > FT                   /strain="CL Brener"
> > FT                   /db_xref="taxon:353153"
> > FT   gene            <1..>543
> > FT                   /gene="ARC20"
> > FT                   /note="TcARC20"
> > FT   mRNA            <1..>543
> > FT                   /gene="ARC20"
> > FT                   /product="actin-related protein 4"
> > FT   CDS             1..543
> > FT                   /gene="ARC20"
> > FT                   /note="actin-binding protein; ARPC4 20 kDa; putative
> > FT                   member of Arp2/3 complex"
> > FT                   /codon_start=1
> > FT                   /product="actin-related protein 4"
> > FT                   /protein_id="ABF13402.1"
> > FT                   /db_xref="GI:93360016"
> > FT                   /translation="MATAYLPYYDCIKCTLHAALCIGNYPSCTVERHNKPEVEVADH
> > FT                   LENNGEIKVQDFLLNPIRIVRSEQESCLIEPSINSTRISVSFLKSDAIAEIIARKYV
> > FT                   GFLAQRAKQFHILRKKPIPGYDISFLISHEEVETMHRNRIIQFIITFLMDIDADIAA
> > FT                   MKLNVNQRARRAAMEFFLALNFT"
> >      atggcaaccg cctatttgcc ttactacgac tgcatcaagt gcacgttgca cgcggctttg        60
> >      tgcatcggga attatccttc atgtaccgtg gagcgtcata ataaaccaga agttgaggtt       120
> >      gcagaccatc tggagaataa tggtgaaata aaagtacaag atttccttct taaccccata       180
> >      cgcattgtgc gttcagaaca ggaaagttgt cttattgaac ctagtataaa cagcacacgc       240
> >      atatctgtat cgtttctcaa gagcgacgct attgcagaga ttattgcccg aaagtacgtt       300
> >      ggatttttag ctcagcgagc caaacagttt cacatcttga gaaaaaagcc tattccggga       360
> >      tatgatataa gttttttgat ttctcacgag gaagtagaaa caatgcatag gaataggatt       420
> >      attcaattta taattacttt cttgatggat attgatgctg acattgctgc aatgaagttg       480
> >      aatgtgaatc aacgtgcacg tcgagcagcg atggaattct ttcttgcatt gaatttcaca       540
> >      tga                                                                     543
> > //
> > -----------------------------------------------------------------------
> > I get an exception message "Could Not Read Sequence".  Same thing
> > happens if I use the readINSDSetDNA reader instead of readEMBLDNA one
> > with the following INSDset file (beginning of the file):
> >
> > <?xml version="1.0"?>
> > <!DOCTYPE INSDSeq PUBLIC "-//NCBI//INSD INSDSeq/EN" "INSD_INSDSeq.dtd">
> > <INSDSeq>
> >   <INSDSeq_locus>DQ022078</INSDSeq_locus>
> >   <INSDSeq_length>16729</INSDSeq_length>
> >   <INSDSeq_moltype>DNA</INSDSeq_moltype>
> >   <INSDSeq_topology>linear</INSDSeq_topology>
> >   <INSDSeq_division>ENV</INSDSeq_division>
> >   <INSDSeq_update-date>15-MAY-2006</INSDSeq_update-date>
> >   <INSDSeq_create-date>15-MAY-2006</INSDSeq_create-date>
> >   <INSDSeq_definition>Uncultured bacterium WWRS-2005 putative
> > aminoglycoside phosphotransferase (a3.001), putative oxidoreductase
> > (a3.002), putative oxidoreductase (a3.003), putative beta-lactamase
> > class C (estA3), putative permease (a3.005), putative transmembrane
> > signal peptide (a3.006), thiol-disulfide isomerase (a3.007), histone
> > acetyltransferase HPA2 (a3.008), putative enzyme (a3.009), putative
> > asparaginase (a3.010), hypothetical protein (a3.011), hypothetical
> > protein (a3.012), putative membrane protease subunit (a3.013),
> > putative haloalkane dehalogenase (a3.014), putative transcriptional
> > regulator (a3.015), putative peptidyl-dipeptidase Dcp (a3.016), and
> > hypothetical protein (a3.017) genes, complete cds</INSDSeq_definition>
> >   <INSDSeq_primary-accession>DQ022078</INSDSeq_primary-accession>
> >   <INSDSeq_other-seqids>
> >     <INSDSeqid>gb|DQ022078.1|</INSDSeqid>
> >     <INSDSeqid>gi|71842722</INSDSeqid>
> >   </INSDSeq_other-seqids>
> >   <INSDSeq_keywords>
> >     <INSDKeyword>ENV</INSDKeyword>
> >   </INSDSeq_keywords>
> >   <INSDSeq_references>
> >     <INSDReference>
> >       <INSDReference_reference>?</INSDReference_reference>
> >       <INSDReference_position>1..16729</INSDReference_position>
> >       <INSDReference_authors>
> >         <INSDAuthor>Schmeisser,C.</INSDAuthor>
> >         <INSDAuthor>Elend,C.</INSDAuthor>
> >         <INSDAuthor>Streit,W.R.</INSDAuthor>
> >       </INSDReference_authors>
> >       <INSDReference_title>Isolation and biochemical characterization
> > of two novel metagenome derived esterases</INSDReference_title>
> >       <INSDReference_journal>Appl. Environ. Microbiol. 0:0-0
> > (2006)</INSDReference_journal>
> >     </INSDReference>
> >     <INSDReference>
> >       <INSDReference_reference>?</INSDReference_reference>
> >       <INSDReference_position>1..16729</INSDReference_position>
> >       <INSDReference_authors>
> >         <INSDAuthor>Schmeisser,C.</INSDAuthor>
> >         <INSDAuthor>Elend,C.</INSDAuthor>
> >         <INSDAuthor>Streit,W.R.</INSDAuthor>
> >       </INSDReference_authors>
> >       <INSDReference_journal>Submitted (29-APR-2005) to the
> > EMBL/GenBank/DDBJ databases. Molekulare Enzymtechnologie, University
> > Duisburg-Essen, Lotharstrasse 1, Duisburg D-47057,
> > Germany</INSDReference_journal>
> >     </INSDReference>
> >   </INSDSeq_references>
> >
> > So my question is wether the ASN2GB produces output that's
> > incompatible with BioJava parsers or is there a problem with the
> > sequence themselves or the problems with the majority of parsers???
> > Could it be that I'm using the API wrongly for the above formats,
> > although GenBank parser works as advertised with some exceptions
> > below:
> >
> > ISSUE #2:
> > When I try to parse GenBank files using the following code:
> >
> > BufferedReader inBuf = new BufferedReader(new FileReader("genbank_output.gb"));
> > Namespace gbNspace = (Namespace)
> > RichObjectFactory.getObject(SimpleNamespace.class, new
> > Object[]{"gbSpace"} );
> > RichSequenceIterator gbSeqs =
> > RichSequence.IOTools.readGenbankDNA(inBuf,gbNspace);
> > while (gbSeqs.hasNext()) {
> >   try {
> >            RichSequence rs = gbSeqs.nextRichSequence();
> >            // Further processing or RichSequence object from here
> >
> >        } catch (BioException be){
> >            be.printStackTrace();
> >        }
> > }
> >
> > Genbank file in question:
> >
> > LOCUS       BC074905                 838 bp    mRNA    linear   PRI 15-APR-2006
> > DEFINITION  Homo sapiens kallikrein 14, mRNA (cDNA clone MGC:104038
> >             IMAGE:30915482), complete cds.
> > ACCESSION   BC074905
> > VERSION     BC074905.2  GI:50959825
> > KEYWORDS    MGC.
> > SOURCE      Homo sapiens (human)
> >   ORGANISM  Homo sapiens
> >             Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
> >             Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;
> >             Catarrhini; Hominidae; Homo.
> > REFERENCE   1  (bases 1 to 838)
> >   AUTHORS   Strausberg,R.L., Feingold,E.A., Grouse,L.H., Derge,J.G.,
> >             Klausner,R.D., Collins,F.S., Wagner,L., Shenmen,C.M., Schuler,G.D.,
> >             Altschul,S.F., Zeeberg,B., Buetow,K.H., Schaefer,C.F., Bhat,N.K.,
> >             Hopkins,R.F., Jordan,H., Moore,T., Max,S.I., Wang,J., Hsieh,F.,
> >             Diatchenko,L., Marusina,K., Farmer,A.A., Rubin,G.M., Hong,L.,
> >             Stapleton,M., Soares,M.B., Bonaldo,M.F., Casavant,T.L.,
> >             Scheetz,T.E., Brownstein,M.J., Usdin,T.B., Toshiyuki,S.,
> >             Carninci,P., Prange,C., Raha,S.S., Loquellano,N.A., Peters,G.J.,
> >             Abramson,R.D., Mullahy,S.J., Bosak,S.A., McEwan,P.J.,
> >             McKernan,K.J., Malek,J.A., Gunaratne,P.H., Richards,S.,
> >             Worley,K.C., Hale,S., Garcia,A.M., Gay,L.J., Hulyk,S.W.,
> >             Villalon,D.K., Muzny,D.M., Sodergren,E.J., Lu,X., Gibbs,R.A.,
> >             Fahey,J., Helton,E., Ketteman,M., Madan,A., Rodrigues,S.,
> >             Sanchez,A., Whiting,M., Madan,A., Young,A.C., Shevchenko,Y.,
> >             Bouffard,G.G., Blakesley,R.W., Touchman,J.W., Green,E.D.,
> >             Dickson,M.C., Rodriguez,A.C., Grimwood,J., Schmutz,J., Myers,R.M.,
> >             Butterfield,Y.S., Krzywinski,M.I., Skalska,U., Smailus,D.E.,
> >             Schnerch,A., Schein,J.E., Jones,S.J. and Marra,M.A.
> >   CONSRTM   Mammalian Gene Collection Program Team
> >   TITLE     Generation and initial analysis of more than 15,000 full-length
> >             human and mouse cDNA sequences
> >   JOURNAL   Proc. Natl. Acad. Sci. U.S.A. 99 (26), 16899-16903 (2002)
> >    PUBMED   12477932
> > REFERENCE   2  (bases 1 to 838)
> >   CONSRTM   NIH MGC Project
> >   TITLE     Direct Submission
> >   JOURNAL   Submitted (25-JUN-2004) National Institutes of Health, Mammalian
> >             Gene Collection (MGC), Bethesda, MD 20892-2590, USA
> >   REMARK    NIH-MGC Project URL: http://mgc.nci.nih.gov
> > COMMENT     On Aug 4, 2004 this sequence version replaced gi:49901832.
> >             Contact: MGC help desk
> >             Email: cgapbs-r at mail.nih.gov
> >             Tissue Procurement: Genome Sequence Centre, British Columbia Cancer
> >             Center
> >             cDNA Library Preparation: British Columbia Cancer Research Center
> >             cDNA Library Arrayed by: The I.M.A.G.E. Consortium (LLNL)
> >             DNA Sequencing by: Genome Sequence Centre,
> >             BC Cancer Agency, Vancouver, BC, Canada
> >             info at bcgsc.bc.ca
> >             Martin Hirst, Thomas Zeng, Ryan Morin, Michelle Moksa, Johnson
> >             Pang, Diana Mah, Jing Wang, Kieth Fichter, Eric Chuah, Allen
> >             Delaney, Rob Kirkpatrick, Agnes Baross, Sarah Barber, Mabel
> >             Brown-John, Steve S. Chand, William Chow, Ryan Babakaiff, Dave
> >             Wong, Corey Matsuo, Jaclyn Beland, Susan Gibson, Luis delRio, Ruth
> >             Featherstone, Malachi Griffith, Obi Griffith, Ran Guin, Nancy Liao,
> >             Kim MacDonald,  Mike R. Mayo, Josh Moran, Diana Palmquist, JR
> >             Santos, Duane Smailus, Jeff Stott, Miranda Tsai, George Yang,
> >             Jacquie Schein, Asim Siddiqui,Steven Jones, Rob Holt, Marco Marra.
> >
> >             Clone distribution: MGC clone distribution information can be found
> >             through the I.M.A.G.E. Consortium/LLNL at: http://image.llnl.gov
> >             Series: IRBU Plate: 4 Row: C Column: 3.
> >
> >             Differences found between this sequence and the human reference
> >             genome (build 36) are described in misc_difference features below.
> > FEATURES             Location/Qualifiers
> >      source          1..838
> >                      /organism="Homo sapiens"
> >                      /mol_type="mRNA"
> >                      /db_xref="taxon:9606"
> >                      /clone="MGC:104038 IMAGE:30915482"
> >                      /tissue_type="Lung, PCR rescued clones"
> >                      /clone_lib="NIH_MGC_273"
> >                      /lab_host="DH10B"
> >                      /note="Vector: pCR4 Topo TA with reversed insert"
> >      gene            1..838
> >                      /gene="KLK14"
> >                      /note="synonym: KLK-L6"
> >                      /db_xref="GeneID:43847"
> >                      /db_xref="HGNC:6362"
> >                      /db_xref="IMGT/GENE-DB:6362"
> >                      /db_xref="MIM:606135"
> >      CDS             49..804
> >                      /gene="KLK14"
> >                      /codon_start=1
> >                      /product="KLK14 protein"
> >                      /protein_id="AAH74905.1"
> >                      /db_xref="GI:50959826"
> >                      /db_xref="GeneID:43847"
> >                      /db_xref="HGNC:6362"
> >                      /db_xref="IMGT/GENE-DB:6362"
> >                      /db_xref="MIM:606135"
> >                      /translation="MFLLLTALQVLAIAMTRSQEDENKIIGGYTCTRSSQPWQAALLA
> >                      GPRRRFLCGGALLSGQWVITAAHCGRPILQVALGKHNLRRWEATQQVLRVVRQVTHPN
> >                      YNSRTHDNDLMLLQLQQPARIGRAVRPIEVTQACASPGTSCRVSGWGTISSPIARYPA
> >                      SLQCVNINISPDEVCQKAYPRTITPGMVCAGVPQGGKDSCQGDSGGPLVCRGQLQGLV
> >                      SWGMERCALPGYPGVYTNLCKYRSWIEETMRDK"
> >      misc_difference 98
> >                      /gene="KLK14"
> >                      /note="'G' in cDNA is 'A' in the human genome; amino acid
> >                      difference: 'R' in cDNA, 'Q' in the human genome."
> >      misc_difference 133
> >                      /gene="KLK14"
> >                      /note="'T' in cDNA is 'C' in the human genome; amino acid
> >                      difference: 'Y' in cDNA, 'H' in the human genome."
> > ORIGIN
> >         1 atgtccctga gggtcttggg ctctgggacc tggccctcag cccctaaaat gttcctcctg
> >        61 ctgacagcac ttcaagtcct ggctatagcc atgacacgga gccaagagga tgagaacaag
> >       121 ataattggtg gctatacgtg cacccggagc tcccagccgt ggcaggcggc cctgctggcg
> >       181 ggtcccaggc gccgcttcct ctgcggaggc gccctgcttt caggccagtg ggtcatcact
> >       241 gctgctcact gcggccgccc gatccttcag gttgccctgg gcaagcacaa cctgaggagg
> >       301 tgggaggcca cccagcaggt gctgcgcgtg gttcgtcagg tgacgcaccc caactacaac
> >       361 tcccggaccc acgacaacga cctcatgctg ctgcagctac agcagcccgc acggatcggg
> >       421 agggcagtca ggcccattga ggtcacccag gcctgtgcca gccccgggac ctcctgccga
> >       481 gtgtcaggct ggggaactat atccagcccc atcgccaggt accccgcctc tctgcaatgc
> >       541 gtgaacatca acatctcccc ggatgaggtg tgccagaagg cctatcctag aaccatcacg
> >       601 cctggcatgg tctgtgcagg agttccccag ggcgggaagg actcttgtca gggtgactct
> >       661 gggggacccc tggtgtgcag aggacagctc cagggcctcg tgtcttgggg aatggagcgc
> >       721 tgcgccctgc ctggctaccc cggtgtctac accaacctgt gcaagtacag aagctggatt
> >       781 gaggaaacga tgcgggacaa atgatggtct tcacggtggg atggacctcg tcagctgc
> > //
> >
> > I get the following exception:
> >
> > java.lang.IllegalArgumentException: Authors string cannot be null
> > org.biojava.bio.BioException: Could not read sequence
> >         at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112)
> >         at exonhit.parsers.GenBankParser.getSequences(GenBankParser.java:107)
> >         at exonhit.parsers.GenBankParser.runGBparser(GenBankParser.java:258)
> >         at exonhit.parsers.GenBankParser.main(GenBankParser.java:341)
> > Caused by: java.lang.IllegalArgumentException: Authors string cannot be null
> >         at org.biojavax.DocRefAuthor$Tools.parseAuthorString(DocRefAuthor.java:76)
> >         at org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.java:356)
> >         at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109)
> >
> > -----------------------------------------------------------------------
> >
> > I'm trying to see what could be the problem with this particular
> > sequence.  Looks to me like the AUTHORS portion is not getting parsed
> > correctly.  Any ideas would be greatly appreciated!
> >
> --
> Richard Holland (BioMart Team)
> EMBL-EBI
> Wellcome Trust Genome Campus
> Hinxton
> Cambridge CB10 1SD
> UNITED KINGDOM
> Tel: +44-(0)1223-494416
>
>


-- 
Best Regards,


Seth Johnson
Senior Bioinformatics Associate

Ph: (202) 470-0900
Fx: (775) 251-0358



More information about the Biojava-l mailing list