[Biojava-l] Genbank parser error [biojavax]
Morgane THOMAS-CHOLLIER
mthomasc at vub.ac.be
Fri Feb 17 05:16:05 EST 2006
Hello Mark,
Thank you very much for your quick reply.
However, I could not find out how to get the organism informations via
the (Rich)Annotation.
Would it be possible for you to post a piece of code showing how I could
retrieve the common name for the organism ?
Sorry for insisting, but I really need this parser for my work, and I
also really need to retrieve the organism info from the file :)
Thank you for your help,
Morgane.
mark.schreiber at novartis.com wrote:
>I think these properties should be going to the (Rich)Annotation bundle.
>
>- Mark
>
>
>
>
>
>Morgane THOMAS-CHOLLIER <mthomasc at vub.ac.be>
>Sent by: biojava-l-bounces at portal.open-bio.org
>02/15/2006 04:56 PM
>
>
> To: biojava-l at biojava.org
> cc: (bcc: Mark Schreiber/GP/Novartis)
> Subject: Re: [Biojava-l] Genbank parser error [biojavax]
>
>
>Hello again,
>
>I have continued using the Genbank parser, but this time with Genbank
>files coming from NCBI :)
>
>I really appreciate the example from the documentation that converts a
>Genbank file into an EMBL file. I have to say, it is really easy to use.
>
>I nevertheless have a question concerning the Organism and Source tags.
>Indeed, it is clear in the documentation that they are ignored by the
>parser.
>But I do not really understand why.
>When I used the Genbank file of the accession numbers : AC147788 and
>DQ158013, I was unable to get the common name of the organism or use
>getNameHierarchy(), but I can get the taxon ID for both.
>
>Is there a way to get the common name of the organism, without using a
>remote call to the NCBI with the taxonID ?
>
>Thanks for your help,
>
>Morgane.
>
>Morgane THOMAS-CHOLLIER wrote:
>
>
>
>>Hello Mark,
>>
>>My file is indeed too large to be posted.
>>So I have exported a smaller sequence from Ensembl that I tested with
>>the parser. The behavior is the same.
>>You will find below this "Genbank" formatted file enclosed.
>>
>>Thanks for your help,
>>
>>Morgane.
>>
>>LOCUS 6 3498 bp DNA HTG 14-FEB-2006
>>DEFINITION Mus musculus chromosome 6 NCBIM34 partial sequence
>> 52305503..52309000 reannotated via EnsEMBL
>>ACCESSION chromosome:NCBIM34:6:52305503:52309000:1
>>VERSION chromosome:NCBIM34:6:52305503:52309000:1
>>KEYWORDS .
>>SOURCE House mouse
>> ORGANISM Mus musculus
>> Eukaryota; Metazoa; Chordata; Craniata; Vertebrata;
>>Euteleostomi;
>> Mammalia; Eutheria; Euarchontoglires; Glires; Rodentia;
>> Sciurognathi; Muridae; Murinae; Mus.
>>COMMENT This sequence was annotated by the Ensembl system. Please
>>visit the
>> Ensembl web site, http://www.ensembl.org/ for more
>>information.
>>COMMENT All feature locations are relative to the first (5') base
>>of the
>> sequence in this file. The sequence presented is always the
>> forward strand of the assembly. Features that lie outside
>>of the
>> sequence contained in this file have clonal location
>>coordinates in
>> the format: .:..
>>COMMENT The /gene indicates a unique id for a gene,
>> /note="transcript_id=..." a unique id for a transcript,
>>/protein_id
>> a unique id for a peptide and note="exon_id=..." a unique
>>id for an
>> exon. These ids are maintained wherever possible between
>>versions.
>>COMMENT All the exons and transcripts in Ensembl are confirmed by
>> similarity to either protein or cDNA sequences.
>>FEATURES Location/Qualifiers
>> source 1..3498
>> /organism="Mus musculus"
>> /db_xref="taxon:10090"
>> gene complement(506..2826)
>> /gene=ENSMUSG00000014704
>> mRNA join(complement(2261..2826),complement(506..1620))
>> /gene="ENSMUSG00000014704"
>> /note="transcript_id=ENSMUST00000014848"
>> CDS join(complement(2261..2639),complement(881..1620))
>> /gene="ENSMUSG00000014704"
>> /protein_id="ENSMUSP00000014848"
>> /note="transcript_id=ENSMUST00000014848"
>> /db_xref="MarkerSymbol:Hoxa2"
>> /db_xref="Uniprot/SWISSPROT:HXA2_MOUSE"
>> /db_xref="RefSeq_peptide:NP_034581.1"
>> /db_xref="RefSeq_dna:NM_010451.1"
>> /db_xref="Uniprot/SPTREMBL:Q3UYP9_MOUSE"
>> /db_xref="Uniprot/SPTREMBL:Q920T7_MOUSE"
>> /db_xref="Uniprot/SPTREMBL:Q920T9_MOUSE"
>> /db_xref="Uniprot/SPTREMBL:Q920U0_MOUSE"
>> /db_xref="Uniprot/SPTREMBL:Q920U1_MOUSE"
>> /db_xref="Uniprot/SPTREMBL:Q920U2_MOUSE"
>> /db_xref="Uniprot/SPTREMBL:Q920U3_MOUSE"
>> /db_xref="Uniprot/SPTREMBL:Q920U4_MOUSE"
>> /db_xref="Uniprot/SPTREMBL:Q920U5_MOUSE"
>> /db_xref="EntrezGene:15399"
>> /db_xref="AgilentProbe:A_51_P501803"
>> /db_xref="EMBL:AB039184"
>> /db_xref="EMBL:AB039185"
>> /db_xref="EMBL:AB039186"
>> /db_xref="EMBL:AB039187"
>> /db_xref="EMBL:AB039188"
>> /db_xref="EMBL:AB039189"
>> /db_xref="EMBL:AB039190"
>> /db_xref="EMBL:AB039191"
>> /db_xref="EMBL:AB039192"
>> /db_xref="EMBL:AK134501"
>> /db_xref="EMBL:M87801"
>> /db_xref="EMBL:M93148"
>> /db_xref="EMBL:M93292"
>> /db_xref="EMBL:M95599"
>> /db_xref="GO:GO:0003700"
>> /db_xref="GO:GO:0005634"
>> /db_xref="GO:GO:0006355"
>> /db_xref="GO:GO:0007275"
>> /db_xref="IPI:IPI00132242.1"
>> /db_xref="UniGene:Mm.131"
>> /db_xref="protein_id:AAA37827.1"
>> /db_xref="protein_id:AAA37834.1"
>> /db_xref="protein_id:AAA37835.1"
>> /db_xref="protein_id:AAA37836.1"
>> /db_xref="protein_id:BAB68708.1"
>> /db_xref="protein_id:BAB68709.1"
>> /db_xref="protein_id:BAB68710.1"
>> /db_xref="protein_id:BAB68711.1"
>> /db_xref="protein_id:BAB68712.1"
>> /db_xref="protein_id:BAB68713.1"
>> /db_xref="protein_id:BAB68714.1"
>> /db_xref="protein_id:BAB68715.1"
>> /db_xref="protein_id:BAB68716.1"
>> /db_xref="protein_id:BAE22163.1"
>> /db_xref="AFFY_MG_U74Av2:102643_at"
>> /db_xref="AFFY_MG_U74Cv2:171063_at"
>> /db_xref="AFFY_Mouse430A_2:1419602_at"
>> /db_xref="AFFY_Mouse430_2:1419602_at"
>>
>>/translation="MNYEFEREIGFINSQPSLAECLTSFPPVADTFQSSSIKTSTLSH
>>
>>STLIPPPFEQTIPSLNPGSHPRHGAGVGGRPKSSPAGSRGSPVPAGALQPPEYPWMKE
>>
>>KKAAKKTALPPAAASTGPACLGHKESLEIADGSGGGSRRLRTAYTNTQLLELEKEFHF
>>
>>NKYLCRPRRVEIAALLDLTERQVKVWFQNRRMKHKRQTQCKENQNSEGKFKNLEDSDK
>>
>>VEEDEEEKSLFEQALSVSGALLEREGYTFQQNALSQQQAPNGHNGDSQTFPVSPLTSN
>>
>>EKNLKHFQHQSPTVPNCLSTMGQNCGAGLNNDSPEAIEVPSLQDFNVFSTDSCLQLSD
>> ALSPSLPGSLDSPVDISADSFDFFTDTLTTIDLQHLNY"
>> exon complement(506..1620)
>> /note="exon_id=ENSMUSE00000387033"
>> exon complement(2261..2826)
>> /note="exon_id=ENSMUSE00000193269"
>>BASE COUNT 938 a 815 c 882 g 863 t
>>ORIGIN
>> 1 AGGAAGAGTT GGAACGTAGA TGTTTGAAAC AAATGTGTAT AAATAAATGA
>>ATTTTTGATA
>> 61 ACTCCGTTAT TGACCTAGAA ACTAGCAGCT TGGTAAGGGA ACTCCATTCC
>>ACTCCACTCG
>> 121 TCCTAGAACT GGAAGTTTTT GTAGGCACTT TTCCTCTCCA CACTCAAAAG
>>CTTGGGCTAG
>> 181 GGCCAACTCA GGCTGCCCAA GCCCATTTCT ATTACTAATG TAACTCTATG
>>GCCTGAGTCT
>> 241 CAACACTGAA AACCAAATTC ATTCCCTTAG GGGGGAAAAA TCCAAAAAAA
>>AAAAAAAAAA
>> 301 AAGTCTTGCC AGAAGCCCTA GCACTTTCTG GTTTTCTTCT TTGTTGCTGT
>>TTGTTGCAGG
>> 361 CTTTGAACAT GCCACCCTAA TAAAATATAT TAAGATTGAA AAGTAAATTG
>>TGACCAGACT
>> 421 TTTATTTACC ATGTTAGACT AAAAGAAGTA TAAGAAATCA GTATGAGTCT
>>TGAGAAAGAG
>> 481 GGGAAGAAAA AAATAAGAAA GCTACTTATA GCAAAGGAGA ATTTATTCTA
>>CCAAAAATAC
>> 541 GCATGACAAT GCATTCTAAT GTGGTACAAA AATAAACAGA AAGTGACAAG
>>ACAATTTATG
>> 601 GTCACTTTCT TGCAGGCCTC CTGTTTTGTT TTTCAGGAAA ATCACATAGA
>>AGCTTGTTGG
>> 661 GTTCTGTGTA AAAACCACTT AGAACGCCAA CATAATTTGC AAGAGATGGC
>>TTTAAAACTG
>> 721 TGTCAGGGGA GAACATTAAA CGGAAAGTCC TCAACATTTG AGAGAGTAGG
>>GGTAGATCAA
>> 781 GAAGAAACTA AAACGAAAAT CAACTCCCAG AATAAAAGAA GGCAAAGCCA
>>CCTGGTCAAA
>> 841 GGCGTTTTGT TTTGTGAAGC TTTGTTTTGC TTTAATGTTC TTAGTAATTC
>>AGATGCTGTA
>> 901 GGTCGATTGT GGTGAGTGTG TCTGTAAAAA AGTCAAAGCT GTCAGCTGAG
>>ATATCTACAG
>> 961 GACTGTCCAG GGAGCCAGGC AAGCTGGGCG ACAGTGCATC TGAAAGCTGC
>>AGGCAGGAAT
>> 1021 CTGTGGAGAA AACATTGAAG TCCTGCAAAG AGGGGACCTC GATGGCCTCG
>>GGACTGTCAT
>> 1081 TGTTTAGGCC AGCTCCACAG TTCTGGCCCA TTGTTGACAA GCAGTTAGGA
>>ACAGTGGGTG
>> 1141 ACTGGTGCTG AAAATGTTTC AAATTTTTCT CATTGCTGGT TAAAGGCGAA
>>ACTGGGAAAG
>> 1201 TTTGGGAGTC GCCATTGTGT CCATTGGGAG CCTGCTGTTG AGAGAGCGCA
>>TTTTGCTGAA
>> 1261 AAGTGTACCC TTCCCTCTCC AGAAGGGCCC CGGAGACACT GAGGGCTTGC
>>TCAAAGAGTG
>> 1321 ACTTCTCTTC CTCGTCTTCC TCCACTTTGT CCGAGTCCTC CAGGTTTTTA
>>AATTTCCCTT
>> 1381 CGCTGTTTTG GTTCTCCTTG CACTGGGTTT GCCTCTTATG CTTCATTCTC
>>CGGTTCTGAA
>> 1441 ACCACACTTT CACTTGTCTC TCGGTCAAAT CCAGCAGCGC GGCGATTTCC
>>ACCCTGCGGG
>> 1501 GTCTGCAAAG GTACTTGTTG AAATGAAATT CCTTTTCCAG CTCCAAAAGC
>>TGAGTGTTGG
>> 1561 TGTACGCGGT TCTCAGACGC CTGGATCCCC CGCCGCTGCC ATCAGCTATT
>>TCCAGGGATT
>> 1621 CTGCAGAAAG GGAAACCAAC AAGAGACACA CATACAGTTG AAGGTGGAAG
>>GGTCCGAGCA
>> 1681 GGGTTATTCC ATTGGAGCAT AAATACAGCA GAAAAGATCA ACTGCAACAA
>>AATGGCCGCC
>> 1741 CCTGGATGCA GTGCAGCTAT TGTGCTGCCC TTCCTGGGAG CCCAGCCCGG
>>GGAAGCCCAG
>> 1801 TCTCTTCCAC CTCCATCAAA TTCCTGCCTG TGGCTTCCCC CAACCTCTTC
>>ATCCGGGAGC
>> 1861 AAACTTTATA TTAGCTACAA CACAATTTAT AATTAATGCA TCAGCTGCTT
>>AGCTGAGCAA
>> 1921 GAGCGGTCTA TCACTCTTCA TTACTGTCAA AAAGCCAAAC TCTAGGACAA
>>CTAGACAAGA
>> 1981 GGAGGTCAGT TCCAACTCAA ATAAATCATC CTACATTACA CAAGTTAGGG
>>AAAGTGCCCC
>> 2041 CCCTCCTCAA AATATATATG TCTCATTGTG GGACTCGGGA TCTATTTTCC
>>CCTCCACCAA
>> 2101 ACCCACTCCT GAGACCACAG GGGCATGAGA CCCGCCACCA GGCATCTCTC
>>TCTCTCCCCC
>> 2161 TTCCCTCGAA GCTCATGGTC CCCTCCCCCA CAACCGCTCC TAGGGAAGCC
>>CGGAGGGGGA
>> 2221 CAAGGGTCCC CGAGACCTGG GGCCAAGTCT CCGGACTGAC CTTTGTGGCC
>>GAGGCAGGCA
>> 2281 GGGCCCGTGG AGGCGGCGGC GGGCGGCAGC GCGGTTTTCT TGGCCGCCTT
>>CTTCTCCTTC
>> 2341 ATCCAGGGAT ACTCAGGCGG CTGCAGGGCG CCGGCAGGCA CCGGGCTGCC
>>GCGACTGCCC
>> 2401 GCGGGGCTCG ACTTGGGGCG GCCGCCAACG CCAGCGCCGT GGCGAGGGTG
>>ACTGCCCGGG
>> 2461 TTCAGGCTGG GAATGGTCTG CTCAAAAGGA GGAGGAATCA GTGTCGAGTG
>>TGAAAGCGTC
>> 2521 GAGGTCTTGA TTGATGAACT TTGAAATGTA TCAGCGACAG GGGGAAAAGA
>>TGTCAGGCAC
>> 2581 TCAGCGAGCG ACGGCTGGCT ATTGATAAAA CCAATCTCTC GCTCAAATTC
>>GTAATTCATG
>> 2641 GCCTTCTCCT TGGAGCCCCC TCGGAGGAAA AGTTCCCTCT TTTGGAGGGG
>>CTTTGGGGGG
>> 2701 GCAAGGCCCA GGAAAAAGGC GAGCGCGAAG GAAAAAAAAA TCTATCATAG
>>AAGATCGCTG
>> 2761 CTGGGGTGTT TTTTTTCTAA TTCACTGATT ACAGCCGTAT GGGGACCGCG
>>CTACTATTAA
>> 2821 ACTATTGAAT TCATGGAGAC AAGGTTGAAA TTGGACCGAA TTGGCTGTCA
>>CATGATTGCT
>> 2881 TCTGCCCAAT GACAATTTGG GCTTTAATCA AAAGAAGCCA CTGTCTGTTT
>>GATTGATCCA
>> 2941 AAAAAGTCAG AAAGGAACGC CTCATTGGGG GCCATCGAGG CTTTATTTAC
>>ACTTTTTTTC
>> 3001 AGGGCAAAAA TACATATATG TGGGTGTGGA TGGCAATGCC CCGGGAGTGC
>>GTGGGGGGCG
>> 3061 AGAGTGCCTG TTTGCCTCCT GATCTGCAAG GATCTAGTGT GCTCCCTGGA
>>GTGTGTGTGT
>> 3121 GAGTGTGTGC GTGTGAGCCC TGCTGCCGTC CCGCCAGTGG CTGCCCTCTG
>>CCTCCCCCGC
>> 3181 ACACTCCGCG CATTGTTTGG GACTGTCGGG AAGACGCCTC GCACCTCACA
>>AATCATTTAA
>> 3241 GCACCTCAGC CTGACGCCTG CAGTCATTAA CAAAGTAATC CATTAATCTT
>>CAAAGTTTTG
>> 3301 ACACCCCAGG GCCCTGCATC TCAGCCACAT AAGTTCTGCT AAGGCAAGAG
>>AAAGGAGCAG
>> 3361 AGTGGGAGAG AGAGAGGAGA GAGGGAGAGA GGGAGAGAGG GAGAGAGAGA
>>GAGAGAGAGA
>> 3421 GAGAGAGAGA GAGAGAGAGA GAGAGAATGA ATATTGGGGT TCACCTTTCC
>>TCTTCCTCCT
>> 3481 CTTTTTCCAA AATCAGTT
>>//
>>
>>
>>
>>
>>mark.schreiber at novartis.com wrote:
>>
>>
>>
>>>Hi Morgane -
>>>
>>>I have to say that doesn't look much like Genbank : )
>>>
>>>The biojavax parser are possibly a bit brittle due to their use of
>>>regexps to recognize key elements. It should be fixable, I think the
>>>problem is that the parser expects a word after LOCUS not a number.
>>>This may not be the only problem though. Could you post the entire
>>>file? Or if it is large then a representative file of smaller size.
>>>
>>>- Mark
>>>
>>>
>>>
>>>
>>>
>>>Morgane THOMAS-CHOLLIER <mthomasc at vub.ac.be>
>>>Sent by: biojava-l-bounces at portal.open-bio.org
>>>02/14/2006 04:36 AM
>>>
>>>
>>> To: biojava-l at biojava.org
>>> cc: (bcc: Mark Schreiber/GP/Novartis)
>>> Subject: [Biojava-l] Genbank parser error [biojavax]
>>>
>>>
>>>Hello,
>>>
>>>I have tried biojavax today with a view to use the Genbank file parser.
>>>
>>>My test file is a Genbank formatted file which has been produced by
>>>Ensembl export system.
>>>
>>>The head of the file is as follow :
>>>
>>>LOCUS 6 489671 bp DNA HTG 13-FEB-2006
>>>DEFINITION Mus musculus chromosome 6 NCBIM34 partial sequence
>>> 52296503..52786173 reannotated via EnsEMBL
>>>ACCESSION chromosome:NCBIM34:6:52296503:52786173:1
>>>VERSION chromosome:NCBIM34:6:52296503:52786173:1
>>>
>>>I used the code provided in biojavax docbook to parse this file.
>>>I get the following error :
>>>
>>>Exception in thread "main" org.biojava.bio.BioException: Could not
>>>read sequence
>>> at
>>>
>>>
>>>
>org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:111)
>
>
>
>>> at
>>>
>>>
>>>
>org.embnet.be.biojavax.tryout.GenbankParseTest.main(GenbankParseTest.java:31)
>
>
>
>>>Caused by: org.biojava.bio.seq.io.ParseException: Bad locus line
>>>found: 6 489671 bp DNA HTG 13-FEB-2006
>>> at
>>>
>>>
>>>
>org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.java:229)
>
>
>
>>> at
>>>
>>>
>>>
>org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:108)
>
>
>
>>> ... 1 more
>>>
>>>I had a look at GenbankFormat.java, and I guess the problem comes
>>>from the regular expression that do not recognize the LOCUS as a
>>>standard Genbank file LOCUS tag.
>>>
>>>Am I wrong ? Have biojavax Genbank parser been tested on Ensembl
>>>exported files ?
>>>
>>>Morgane.
>>>
>>>
>>>
>>>
>>>
>
>
>
--
**********************************************************
Morgane THOMAS-CHOLLIER, PHD Student (mthomasc at vub.ac.be)
Vrije Universiteit Brussels (VUB)
Laboratory of Cell Genetics
Pleinlaan 2
1050 Brussels
Belgium
Tel : +32 2 629 15 22
**********************************************************
Stop Using Internet Explorer, choose FIREFOX !
More information about the Biojava-l
mailing list