[Biojava-l] Genbank parser error [biojavax]

Morgane THOMAS-CHOLLIER mthomasc at vub.ac.be
Fri Feb 17 05:16:05 EST 2006


Hello Mark,

Thank you very much for your quick reply.

However, I could not find out how to get the organism informations via 
the (Rich)Annotation.
Would it be possible for you to post a piece of code showing how I could 
retrieve the common name for the organism ?

Sorry for insisting, but I really need this parser for my work, and I 
also really need to retrieve the organism info from the file :)

Thank you for your help,

Morgane.


mark.schreiber at novartis.com wrote:

>I think these properties should be going to the (Rich)Annotation bundle.
>
>- Mark
>
>
>
>
>
>Morgane THOMAS-CHOLLIER <mthomasc at vub.ac.be>
>Sent by: biojava-l-bounces at portal.open-bio.org
>02/15/2006 04:56 PM
>
> 
>        To:     biojava-l at biojava.org
>        cc:     (bcc: Mark Schreiber/GP/Novartis)
>        Subject:        Re: [Biojava-l] Genbank  parser error [biojavax]
>
>
>Hello again,
>
>I have continued using the Genbank parser, but this time with Genbank 
>files coming from NCBI :)
>
>I really appreciate the example from the documentation that converts a 
>Genbank file into an EMBL file. I have to say, it is really easy to use.
>
>I nevertheless have a question concerning the Organism and Source tags. 
>Indeed, it is clear in the documentation that they are ignored by the 
>parser.
>But I do not really understand why.
>When I used the Genbank file of the accession numbers : AC147788 and 
>DQ158013, I was unable to get the common name of the organism or use 
>getNameHierarchy(), but I can get the taxon ID for both.
>
>Is there a way to get the common name of the organism, without using a 
>remote call to the NCBI with the taxonID ?
>
>Thanks for your help,
>
>Morgane.
>
>Morgane THOMAS-CHOLLIER wrote:
>
>  
>
>>Hello Mark,
>>
>>My file is indeed too large to be posted.
>>So I have exported a smaller sequence from Ensembl that I tested with 
>>the parser. The behavior is the same.
>>You will find below this "Genbank" formatted file enclosed.
>>
>>Thanks for your help,
>>
>>Morgane.
>>
>>LOCUS       6 3498 bp DNA HTG 14-FEB-2006
>>DEFINITION  Mus musculus chromosome 6 NCBIM34 partial sequence
>>           52305503..52309000 reannotated via EnsEMBL
>>ACCESSION   chromosome:NCBIM34:6:52305503:52309000:1
>>VERSION     chromosome:NCBIM34:6:52305503:52309000:1
>>KEYWORDS    .
>>SOURCE      House mouse
>> ORGANISM  Mus musculus
>>           Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; 
>>Euteleostomi;
>>           Mammalia; Eutheria; Euarchontoglires; Glires; Rodentia;
>>           Sciurognathi; Muridae; Murinae; Mus.
>>COMMENT     This sequence was annotated by the Ensembl system. Please 
>>visit the
>>           Ensembl web site, http://www.ensembl.org/ for more 
>>information.
>>COMMENT     All feature locations are relative to the first (5') base 
>>of the
>>           sequence in this file.  The sequence presented is always the
>>           forward strand of the assembly. Features that lie outside 
>>of the
>>           sequence contained in this file have clonal location 
>>coordinates in
>>           the format: .:..
>>COMMENT     The /gene indicates a unique id for a gene,
>>           /note="transcript_id=..." a unique id for a transcript, 
>>/protein_id
>>           a unique id for a peptide and note="exon_id=..." a unique 
>>id for an
>>           exon. These ids are maintained wherever possible between 
>>versions.
>>COMMENT     All the exons and transcripts in Ensembl are confirmed by
>>           similarity to either protein or cDNA sequences.
>>FEATURES             Location/Qualifiers
>>    source          1..3498
>>                    /organism="Mus musculus"
>>                    /db_xref="taxon:10090"
>>    gene            complement(506..2826)
>>                    /gene=ENSMUSG00000014704
>>    mRNA            join(complement(2261..2826),complement(506..1620))
>>                    /gene="ENSMUSG00000014704"
>>                    /note="transcript_id=ENSMUST00000014848"
>>    CDS             join(complement(2261..2639),complement(881..1620))
>>                    /gene="ENSMUSG00000014704"
>>                    /protein_id="ENSMUSP00000014848"
>>                    /note="transcript_id=ENSMUST00000014848"
>>                    /db_xref="MarkerSymbol:Hoxa2"
>>                    /db_xref="Uniprot/SWISSPROT:HXA2_MOUSE"
>>                    /db_xref="RefSeq_peptide:NP_034581.1"
>>                    /db_xref="RefSeq_dna:NM_010451.1"
>>                    /db_xref="Uniprot/SPTREMBL:Q3UYP9_MOUSE"
>>                    /db_xref="Uniprot/SPTREMBL:Q920T7_MOUSE"
>>                    /db_xref="Uniprot/SPTREMBL:Q920T9_MOUSE"
>>                    /db_xref="Uniprot/SPTREMBL:Q920U0_MOUSE"
>>                    /db_xref="Uniprot/SPTREMBL:Q920U1_MOUSE"
>>                    /db_xref="Uniprot/SPTREMBL:Q920U2_MOUSE"
>>                    /db_xref="Uniprot/SPTREMBL:Q920U3_MOUSE"
>>                    /db_xref="Uniprot/SPTREMBL:Q920U4_MOUSE"
>>                    /db_xref="Uniprot/SPTREMBL:Q920U5_MOUSE"
>>                    /db_xref="EntrezGene:15399"
>>                    /db_xref="AgilentProbe:A_51_P501803"
>>                    /db_xref="EMBL:AB039184"
>>                    /db_xref="EMBL:AB039185"
>>                    /db_xref="EMBL:AB039186"
>>                    /db_xref="EMBL:AB039187"
>>                    /db_xref="EMBL:AB039188"
>>                    /db_xref="EMBL:AB039189"
>>                    /db_xref="EMBL:AB039190"
>>                    /db_xref="EMBL:AB039191"
>>                    /db_xref="EMBL:AB039192"
>>                    /db_xref="EMBL:AK134501"
>>                    /db_xref="EMBL:M87801"
>>                    /db_xref="EMBL:M93148"
>>                    /db_xref="EMBL:M93292"
>>                    /db_xref="EMBL:M95599"
>>                    /db_xref="GO:GO:0003700"
>>                    /db_xref="GO:GO:0005634"
>>                    /db_xref="GO:GO:0006355"
>>                    /db_xref="GO:GO:0007275"
>>                    /db_xref="IPI:IPI00132242.1"
>>                    /db_xref="UniGene:Mm.131"
>>                    /db_xref="protein_id:AAA37827.1"
>>                    /db_xref="protein_id:AAA37834.1"
>>                    /db_xref="protein_id:AAA37835.1"
>>                    /db_xref="protein_id:AAA37836.1"
>>                    /db_xref="protein_id:BAB68708.1"
>>                    /db_xref="protein_id:BAB68709.1"
>>                    /db_xref="protein_id:BAB68710.1"
>>                    /db_xref="protein_id:BAB68711.1"
>>                    /db_xref="protein_id:BAB68712.1"
>>                    /db_xref="protein_id:BAB68713.1"
>>                    /db_xref="protein_id:BAB68714.1"
>>                    /db_xref="protein_id:BAB68715.1"
>>                    /db_xref="protein_id:BAB68716.1"
>>                    /db_xref="protein_id:BAE22163.1"
>>                    /db_xref="AFFY_MG_U74Av2:102643_at"
>>                    /db_xref="AFFY_MG_U74Cv2:171063_at"
>>                    /db_xref="AFFY_Mouse430A_2:1419602_at"
>>                    /db_xref="AFFY_Mouse430_2:1419602_at"
>>
>>/translation="MNYEFEREIGFINSQPSLAECLTSFPPVADTFQSSSIKTSTLSH
>>
>>STLIPPPFEQTIPSLNPGSHPRHGAGVGGRPKSSPAGSRGSPVPAGALQPPEYPWMKE
>>
>>KKAAKKTALPPAAASTGPACLGHKESLEIADGSGGGSRRLRTAYTNTQLLELEKEFHF
>>
>>NKYLCRPRRVEIAALLDLTERQVKVWFQNRRMKHKRQTQCKENQNSEGKFKNLEDSDK
>>
>>VEEDEEEKSLFEQALSVSGALLEREGYTFQQNALSQQQAPNGHNGDSQTFPVSPLTSN
>>
>>EKNLKHFQHQSPTVPNCLSTMGQNCGAGLNNDSPEAIEVPSLQDFNVFSTDSCLQLSD
>>                    ALSPSLPGSLDSPVDISADSFDFFTDTLTTIDLQHLNY"
>>    exon            complement(506..1620)
>>                    /note="exon_id=ENSMUSE00000387033"
>>    exon            complement(2261..2826)
>>                    /note="exon_id=ENSMUSE00000193269"
>>BASE COUNT  938 a 815 c 882 g 863 t
>>ORIGIN
>>       1 AGGAAGAGTT GGAACGTAGA TGTTTGAAAC AAATGTGTAT AAATAAATGA 
>>ATTTTTGATA
>>      61 ACTCCGTTAT TGACCTAGAA ACTAGCAGCT TGGTAAGGGA ACTCCATTCC 
>>ACTCCACTCG
>>     121 TCCTAGAACT GGAAGTTTTT GTAGGCACTT TTCCTCTCCA CACTCAAAAG 
>>CTTGGGCTAG
>>     181 GGCCAACTCA GGCTGCCCAA GCCCATTTCT ATTACTAATG TAACTCTATG 
>>GCCTGAGTCT
>>     241 CAACACTGAA AACCAAATTC ATTCCCTTAG GGGGGAAAAA TCCAAAAAAA 
>>AAAAAAAAAA
>>     301 AAGTCTTGCC AGAAGCCCTA GCACTTTCTG GTTTTCTTCT TTGTTGCTGT 
>>TTGTTGCAGG
>>     361 CTTTGAACAT GCCACCCTAA TAAAATATAT TAAGATTGAA AAGTAAATTG 
>>TGACCAGACT
>>     421 TTTATTTACC ATGTTAGACT AAAAGAAGTA TAAGAAATCA GTATGAGTCT 
>>TGAGAAAGAG
>>     481 GGGAAGAAAA AAATAAGAAA GCTACTTATA GCAAAGGAGA ATTTATTCTA 
>>CCAAAAATAC
>>     541 GCATGACAAT GCATTCTAAT GTGGTACAAA AATAAACAGA AAGTGACAAG 
>>ACAATTTATG
>>     601 GTCACTTTCT TGCAGGCCTC CTGTTTTGTT TTTCAGGAAA ATCACATAGA 
>>AGCTTGTTGG
>>     661 GTTCTGTGTA AAAACCACTT AGAACGCCAA CATAATTTGC AAGAGATGGC 
>>TTTAAAACTG
>>     721 TGTCAGGGGA GAACATTAAA CGGAAAGTCC TCAACATTTG AGAGAGTAGG 
>>GGTAGATCAA
>>     781 GAAGAAACTA AAACGAAAAT CAACTCCCAG AATAAAAGAA GGCAAAGCCA 
>>CCTGGTCAAA
>>     841 GGCGTTTTGT TTTGTGAAGC TTTGTTTTGC TTTAATGTTC TTAGTAATTC 
>>AGATGCTGTA
>>     901 GGTCGATTGT GGTGAGTGTG TCTGTAAAAA AGTCAAAGCT GTCAGCTGAG 
>>ATATCTACAG
>>     961 GACTGTCCAG GGAGCCAGGC AAGCTGGGCG ACAGTGCATC TGAAAGCTGC 
>>AGGCAGGAAT
>>    1021 CTGTGGAGAA AACATTGAAG TCCTGCAAAG AGGGGACCTC GATGGCCTCG 
>>GGACTGTCAT
>>    1081 TGTTTAGGCC AGCTCCACAG TTCTGGCCCA TTGTTGACAA GCAGTTAGGA 
>>ACAGTGGGTG
>>    1141 ACTGGTGCTG AAAATGTTTC AAATTTTTCT CATTGCTGGT TAAAGGCGAA 
>>ACTGGGAAAG
>>    1201 TTTGGGAGTC GCCATTGTGT CCATTGGGAG CCTGCTGTTG AGAGAGCGCA 
>>TTTTGCTGAA
>>    1261 AAGTGTACCC TTCCCTCTCC AGAAGGGCCC CGGAGACACT GAGGGCTTGC 
>>TCAAAGAGTG
>>    1321 ACTTCTCTTC CTCGTCTTCC TCCACTTTGT CCGAGTCCTC CAGGTTTTTA 
>>AATTTCCCTT
>>    1381 CGCTGTTTTG GTTCTCCTTG CACTGGGTTT GCCTCTTATG CTTCATTCTC 
>>CGGTTCTGAA
>>    1441 ACCACACTTT CACTTGTCTC TCGGTCAAAT CCAGCAGCGC GGCGATTTCC 
>>ACCCTGCGGG
>>    1501 GTCTGCAAAG GTACTTGTTG AAATGAAATT CCTTTTCCAG CTCCAAAAGC 
>>TGAGTGTTGG
>>    1561 TGTACGCGGT TCTCAGACGC CTGGATCCCC CGCCGCTGCC ATCAGCTATT 
>>TCCAGGGATT
>>    1621 CTGCAGAAAG GGAAACCAAC AAGAGACACA CATACAGTTG AAGGTGGAAG 
>>GGTCCGAGCA
>>    1681 GGGTTATTCC ATTGGAGCAT AAATACAGCA GAAAAGATCA ACTGCAACAA 
>>AATGGCCGCC
>>    1741 CCTGGATGCA GTGCAGCTAT TGTGCTGCCC TTCCTGGGAG CCCAGCCCGG 
>>GGAAGCCCAG
>>    1801 TCTCTTCCAC CTCCATCAAA TTCCTGCCTG TGGCTTCCCC CAACCTCTTC 
>>ATCCGGGAGC
>>    1861 AAACTTTATA TTAGCTACAA CACAATTTAT AATTAATGCA TCAGCTGCTT 
>>AGCTGAGCAA
>>    1921 GAGCGGTCTA TCACTCTTCA TTACTGTCAA AAAGCCAAAC TCTAGGACAA 
>>CTAGACAAGA
>>    1981 GGAGGTCAGT TCCAACTCAA ATAAATCATC CTACATTACA CAAGTTAGGG 
>>AAAGTGCCCC
>>    2041 CCCTCCTCAA AATATATATG TCTCATTGTG GGACTCGGGA TCTATTTTCC 
>>CCTCCACCAA
>>    2101 ACCCACTCCT GAGACCACAG GGGCATGAGA CCCGCCACCA GGCATCTCTC 
>>TCTCTCCCCC
>>    2161 TTCCCTCGAA GCTCATGGTC CCCTCCCCCA CAACCGCTCC TAGGGAAGCC 
>>CGGAGGGGGA
>>    2221 CAAGGGTCCC CGAGACCTGG GGCCAAGTCT CCGGACTGAC CTTTGTGGCC 
>>GAGGCAGGCA
>>    2281 GGGCCCGTGG AGGCGGCGGC GGGCGGCAGC GCGGTTTTCT TGGCCGCCTT 
>>CTTCTCCTTC
>>    2341 ATCCAGGGAT ACTCAGGCGG CTGCAGGGCG CCGGCAGGCA CCGGGCTGCC 
>>GCGACTGCCC
>>    2401 GCGGGGCTCG ACTTGGGGCG GCCGCCAACG CCAGCGCCGT GGCGAGGGTG 
>>ACTGCCCGGG
>>    2461 TTCAGGCTGG GAATGGTCTG CTCAAAAGGA GGAGGAATCA GTGTCGAGTG 
>>TGAAAGCGTC
>>    2521 GAGGTCTTGA TTGATGAACT TTGAAATGTA TCAGCGACAG GGGGAAAAGA 
>>TGTCAGGCAC
>>    2581 TCAGCGAGCG ACGGCTGGCT ATTGATAAAA CCAATCTCTC GCTCAAATTC 
>>GTAATTCATG
>>    2641 GCCTTCTCCT TGGAGCCCCC TCGGAGGAAA AGTTCCCTCT TTTGGAGGGG 
>>CTTTGGGGGG
>>    2701 GCAAGGCCCA GGAAAAAGGC GAGCGCGAAG GAAAAAAAAA TCTATCATAG 
>>AAGATCGCTG
>>    2761 CTGGGGTGTT TTTTTTCTAA TTCACTGATT ACAGCCGTAT GGGGACCGCG 
>>CTACTATTAA
>>    2821 ACTATTGAAT TCATGGAGAC AAGGTTGAAA TTGGACCGAA TTGGCTGTCA 
>>CATGATTGCT
>>    2881 TCTGCCCAAT GACAATTTGG GCTTTAATCA AAAGAAGCCA CTGTCTGTTT 
>>GATTGATCCA
>>    2941 AAAAAGTCAG AAAGGAACGC CTCATTGGGG GCCATCGAGG CTTTATTTAC 
>>ACTTTTTTTC
>>    3001 AGGGCAAAAA TACATATATG TGGGTGTGGA TGGCAATGCC CCGGGAGTGC 
>>GTGGGGGGCG
>>    3061 AGAGTGCCTG TTTGCCTCCT GATCTGCAAG GATCTAGTGT GCTCCCTGGA 
>>GTGTGTGTGT
>>    3121 GAGTGTGTGC GTGTGAGCCC TGCTGCCGTC CCGCCAGTGG CTGCCCTCTG 
>>CCTCCCCCGC
>>    3181 ACACTCCGCG CATTGTTTGG GACTGTCGGG AAGACGCCTC GCACCTCACA 
>>AATCATTTAA
>>    3241 GCACCTCAGC CTGACGCCTG CAGTCATTAA CAAAGTAATC CATTAATCTT 
>>CAAAGTTTTG
>>    3301 ACACCCCAGG GCCCTGCATC TCAGCCACAT AAGTTCTGCT AAGGCAAGAG 
>>AAAGGAGCAG
>>    3361 AGTGGGAGAG AGAGAGGAGA GAGGGAGAGA GGGAGAGAGG GAGAGAGAGA 
>>GAGAGAGAGA
>>    3421 GAGAGAGAGA GAGAGAGAGA GAGAGAATGA ATATTGGGGT TCACCTTTCC 
>>TCTTCCTCCT
>>    3481 CTTTTTCCAA AATCAGTT
>>//
>>
>>
>>
>>
>>mark.schreiber at novartis.com wrote:
>>
>>    
>>
>>>Hi Morgane -
>>>
>>>I have to say that doesn't look much like Genbank : )
>>>
>>>The biojavax parser are possibly a bit brittle due to their use of 
>>>regexps to recognize key elements. It should be fixable, I think the 
>>>problem is that the parser expects a word after LOCUS not a number. 
>>>This may not be the only problem though. Could you post the entire 
>>>file? Or if it is large then a representative file of smaller size.
>>>
>>>- Mark
>>>
>>>
>>>
>>>
>>>
>>>Morgane THOMAS-CHOLLIER <mthomasc at vub.ac.be>
>>>Sent by: biojava-l-bounces at portal.open-bio.org
>>>02/14/2006 04:36 AM
>>>
>>>
>>>       To:     biojava-l at biojava.org
>>>       cc:     (bcc: Mark Schreiber/GP/Novartis)
>>>       Subject:        [Biojava-l] Genbank  parser error [biojavax]
>>>
>>>
>>>Hello,
>>>
>>>I have tried biojavax today with a view to use the Genbank file parser.
>>>
>>>My test file is a Genbank formatted file which has been produced by 
>>>Ensembl export system.
>>>
>>>The head of the file is as follow :
>>>
>>>LOCUS       6 489671 bp DNA HTG 13-FEB-2006
>>>DEFINITION  Mus musculus chromosome 6 NCBIM34 partial sequence
>>>           52296503..52786173 reannotated via EnsEMBL
>>>ACCESSION   chromosome:NCBIM34:6:52296503:52786173:1
>>>VERSION     chromosome:NCBIM34:6:52296503:52786173:1
>>>
>>>I used the code provided in biojavax docbook to parse this file.
>>>I get the following error :
>>>
>>>Exception in thread "main" org.biojava.bio.BioException: Could not 
>>>read sequence
>>>   at 
>>>
>>>      
>>>
>org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:111) 
>
>  
>
>>>   at 
>>>
>>>      
>>>
>org.embnet.be.biojavax.tryout.GenbankParseTest.main(GenbankParseTest.java:31) 
>
>  
>
>>>Caused by: org.biojava.bio.seq.io.ParseException: Bad locus line 
>>>found: 6 489671 bp DNA HTG 13-FEB-2006
>>>   at 
>>>
>>>      
>>>
>org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.java:229) 
>
>  
>
>>>   at 
>>>
>>>      
>>>
>org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:108) 
>
>  
>
>>>   ... 1 more
>>>
>>>I had a look at GenbankFormat.java, and I guess the problem comes 
>>>from the regular expression that do not recognize the LOCUS as a 
>>>standard Genbank file LOCUS tag.
>>>
>>>Am I wrong ? Have biojavax Genbank parser been tested on Ensembl 
>>>exported files ?
>>>
>>>Morgane.
>>>
>>>
>>>
>>>      
>>>
>
>  
>

-- 
**********************************************************
Morgane THOMAS-CHOLLIER, PHD Student (mthomasc at vub.ac.be)

Vrije Universiteit Brussels (VUB)
Laboratory of Cell Genetics
Pleinlaan 2
1050 Brussels
Belgium

Tel : +32 2 629 15 22
**********************************************************
Stop Using Internet Explorer, choose FIREFOX !



More information about the Biojava-l mailing list