[Biojava-l] Parsing INSDseq Sequences (1.3 & 1.4)

Seth Johnson johnson.biotech at gmail.com
Tue Jun 13 16:28:24 UTC 2006


Works like a charm now!!!  :)   I figured it was a typo somewhere on Friday,
but couldn't find the source.  I didn't think tag info was case sensitive.

On 6/12/06, Richard Holland <richard.holland at ebi.ac.uk> wrote:
>
> Typo in code. my fault. Try again!
>
>
>
> On Thu, 2006-06-08 at 10:23 -0400, Seth Johnson wrote:
> > I'm still getting an empty array back from this:
> >
> > Note [] myAccs = ((RichAnnotation)rs.getAnnotation()).getProperties
> > (INSDseqFormat.Terms.getOtherSeqIdTerm());
> >
> > Here's the file that I'm parsing:
> > ~~~~~~~~~~~~~~~~~~~~~~
> > <?xml version="1.0"?>
> > <!DOCTYPE INSDSet PUBLIC "-//NCBI//INSD INSDSeq/EN"
> > "http://www.ncbi.nlm.nih.gov/dtd/INSD_INSDSeq.dtd">
> > <INSDSet>
> > <INSDSeq>
> >   <INSDSeq_locus>AY069118</INSDSeq_locus>
> >   <INSDSeq_length>1502</INSDSeq_length>
> >   <INSDSeq_strandedness>single</INSDSeq_strandedness>
> >   <INSDSeq_moltype>mRNA</INSDSeq_moltype>
> >   <INSDSeq_topology>linear</INSDSeq_topology>
> >   <INSDSeq_division>INV</INSDSeq_division>
> >   <INSDSeq_update-date>17-DEC-2001</INSDSeq_update-date>
> >   <INSDSeq_create-date>15-DEC-2001</INSDSeq_create-date>
> >   <INSDSeq_definition>Drosophila melanogaster GH13089 full length
> > cDNA</INSDSeq_definition>
> >   <INSDSeq_primary-accession>AY069118</INSDSeq_primary-accession>
> >   <INSDSeq_accession-version> AY069118.1</INSDSeq_accession-version>
> >   <INSDSeq_other-seqids>
> >     <INSDSeqid>gb|AY069118.1|</INSDSeqid>
> >     <INSDSeqid>gi|17861571</INSDSeqid>
> >   </INSDSeq_other-seqids>
> >   <INSDSeq_keywords>
> >     <INSDKeyword>FLI_CDNA</INSDKeyword>
> >   </INSDSeq_keywords>
> >   <INSDSeq_source>Drosophila melanogaster (fruit
> > fly)</INSDSeq_source>
> >   <INSDSeq_organism>Drosophila melanogaster</INSDSeq_organism>
> >   <INSDSeq_taxonomy>Eukaryota; Metazoa; Arthropoda; Hexapoda; Insecta;
> > Pterygota; Neoptera; Endopterygota; Diptera; Brachycera; Muscomorpha;
> > Ephydroidea; Drosophilidae; Drosophila</INSDSeq_taxonomy>
> >   <INSDSeq_references>
> >     <INSDReference>
> >       <INSDReference_reference>1 (bases 1 to
> > 1502)</INSDReference_reference>
> >       <INSDReference_position>1..1502</INSDReference_position>
> >       <INSDReference_authors>
> >         <INSDAuthor>Stapleton,M.</INSDAuthor>
> >         <INSDAuthor>Brokstein,P.</INSDAuthor>
> >         <INSDAuthor>Hong,L.</INSDAuthor>
> >         <INSDAuthor>Agbayani,A.</INSDAuthor>
> >         <INSDAuthor>Carlson,J.</INSDAuthor>
> >         <INSDAuthor>Champe,M.</INSDAuthor>
> >         <INSDAuthor>Chavez,C.</INSDAuthor>
> >         <INSDAuthor>Dorsett,V.</INSDAuthor>
> >         <INSDAuthor>Farfan,D.</INSDAuthor>
> >         <INSDAuthor>Frise,E.</INSDAuthor>
> >         <INSDAuthor>George,R.</INSDAuthor>
> >         <INSDAuthor>Gonzalez,M.</INSDAuthor>
> >         <INSDAuthor>Guarin,H.</INSDAuthor>
> >         <INSDAuthor>Li,P.</INSDAuthor>
> >         <INSDAuthor>Liao,G.</INSDAuthor>
> >         <INSDAuthor>Miranda,A.</INSDAuthor>
> >         <INSDAuthor>Mungall,C.J.</INSDAuthor>
> >         <INSDAuthor>Nunoo,J.</INSDAuthor>
> >         <INSDAuthor>Pacleb,J.</INSDAuthor>
> >         <INSDAuthor>Paragas,V.</INSDAuthor>
> >         <INSDAuthor>Park,S.</INSDAuthor>
> >         <INSDAuthor>Phouanenavong,S.</INSDAuthor>
> >         <INSDAuthor>Wan,K.</INSDAuthor>
> >         <INSDAuthor>Yu,C.</INSDAuthor>
> >         <INSDAuthor>Lewis,S.E.</INSDAuthor>
> >         <INSDAuthor>Rubin, G.M.</INSDAuthor>
> >         <INSDAuthor>Celniker,S.</INSDAuthor>
> >       </INSDReference_authors>
> >       <INSDReference_title>Direct Submission</INSDReference_title>
> >       <INSDReference_journal>Submitted (10-DEC-2001) Berkeley
> > Drosophila Genome Project, Lawrence Berkeley National Laboratory, One
> > Cyclotron Road, Berkeley, CA 94720, USA</INSDReference_journal>
> >     </INSDReference>
> >   </INSDSeq_references>
> >   <INSDSeq_comment>Sequence submitted by: Berkeley Drosophila Genome
> > Project Lawrence Berkeley National Laboratory Berkeley, CA 94720 This
> > clone was sequenced as part of a high-throughput process to sequence
> > clones from Drosophila Gene Collection 1 (Rubin et al., Science 2000).
> > The sequence has been subjected to integrity checks for sequence
> > accuracy, presence of a polyA tail and contiguity within 100 kb in the
> > genome. Thus we believe the sequence to reflect accurately this
> > particular cDNA clone. However, there are artifacts associated with
> > the generation of cDNA clones that may have not been detected in our
> > initial analyses such as internal priming, priming from contaminating
> > genomic DNA, retained introns due to reverse transcription of
> > unspliced precursor RNAs, and reverse transcriptase errors that result
> > in single base changes. For further information about this sequence,
> > including its location and relationship to other sequences, please
> > visit our Web site ( http://fruitfly.berkeley.edu) or send email to
> > cdna at fruitfly.berkeley.edu.</INSDSeq_comment>
> >   <INSDSeq_feature-table>
> >     <INSDFeature>
> >       <INSDFeature_key>source</INSDFeature_key>
> >       <INSDFeature_location>1..1502</INSDFeature_location>
> >       <INSDFeature_intervals>
> >         <INSDInterval>
> >           <INSDInterval_from>1</INSDInterval_from>
> >           <INSDInterval_to>1502</INSDInterval_to>
> >           <INSDInterval_accession> AY069118.1</INSDInterval_accession>
> >         </INSDInterval>
> >       </INSDFeature_intervals>
> >       <INSDFeature_quals>
> >         <INSDQualifier>
> >           <INSDQualifier_name>organism</INSDQualifier_name>
> >           <INSDQualifier_value>Drosophila
> > melanogaster</INSDQualifier_value>
> >         </INSDQualifier>
> >         <INSDQualifier>
> >           <INSDQualifier_name>mol_type</INSDQualifier_name>
> >           <INSDQualifier_value>mRNA</INSDQualifier_value>
> >         </INSDQualifier>
> >         <INSDQualifier>
> >           <INSDQualifier_name>strain</INSDQualifier_name>
> >           <INSDQualifier_value>y; cn bw sp</INSDQualifier_value>
> >         </INSDQualifier>
> >         <INSDQualifier>
> >           <INSDQualifier_name>db_xref</INSDQualifier_name>
> >           <INSDQualifier_value>taxon:7227</INSDQualifier_value>
> >         </INSDQualifier>
> >         <INSDQualifier>
> >           <INSDQualifier_name>map</INSDQualifier_name>
> >           <INSDQualifier_value>39B3-39B3</INSDQualifier_value>
> >         </INSDQualifier>
> >       </INSDFeature_quals>
> >     </INSDFeature>
> >     <INSDFeature>
> >       <INSDFeature_key>gene</INSDFeature_key>
> >       <INSDFeature_location>1..1502</INSDFeature_location>
> >       <INSDFeature_intervals>
> >         <INSDInterval>
> >           <INSDInterval_from>1</INSDInterval_from>
> >           <INSDInterval_to>1502</INSDInterval_to>
> >           <INSDInterval_accession> AY069118.1</INSDInterval_accession>
> >         </INSDInterval>
> >       </INSDFeature_intervals>
> >       <INSDFeature_quals>
> >         <INSDQualifier>
> >           <INSDQualifier_name>gene</INSDQualifier_name>
> >           <INSDQualifier_value>E2f2</INSDQualifier_value>
> >         </INSDQualifier>
> >         <INSDQualifier>
> >           <INSDQualifier_name>note</INSDQualifier_name>
> >           <INSDQualifier_value>alignment with genomic scaffold
> > AE003669</INSDQualifier_value>
> >         </INSDQualifier>
> >         <INSDQualifier>
> >           <INSDQualifier_name>db_xref</INSDQualifier_name>
> >
> > <INSDQualifier_value>FLYBASE:FBgn0024371</INSDQualifier_value>
> >         </INSDQualifier>
> >       </INSDFeature_quals>
> >     </INSDFeature>
> >     <INSDFeature>
> >       <INSDFeature_key>CDS</INSDFeature_key>
> >       <INSDFeature_location>189..1301</INSDFeature_location>
> >       <INSDFeature_intervals>
> >         <INSDInterval>
> >           <INSDInterval_from>189</INSDInterval_from>
> >           <INSDInterval_to>1301</INSDInterval_to>
> >           <INSDInterval_accession> AY069118.1</INSDInterval_accession>
> >         </INSDInterval>
> >       </INSDFeature_intervals>
> >       <INSDFeature_quals>
> >         <INSDQualifier>
> >           <INSDQualifier_name>gene</INSDQualifier_name>
> >           <INSDQualifier_value>E2f2</INSDQualifier_value>
> >         </INSDQualifier>
> >         <INSDQualifier>
> >           <INSDQualifier_name>note</INSDQualifier_name>
> >           <INSDQualifier_value>Longest ORF</INSDQualifier_value>
> >         </INSDQualifier>
> >         <INSDQualifier>
> >           <INSDQualifier_name>codon_start</INSDQualifier_name>
> >           <INSDQualifier_value>1</INSDQualifier_value>
> >         </INSDQualifier>
> >         <INSDQualifier>
> >           <INSDQualifier_name>transl_table</INSDQualifier_name>
> >           <INSDQualifier_value>1</INSDQualifier_value>
> >         </INSDQualifier>
> >         <INSDQualifier>
> >           <INSDQualifier_name>product</INSDQualifier_name>
> >           <INSDQualifier_value>GH13089p</INSDQualifier_value>
> >         </INSDQualifier>
> >         <INSDQualifier>
> >           <INSDQualifier_name>protein_id</INSDQualifier_name>
> >           <INSDQualifier_value>AAL39263.1</INSDQualifier_value>
> >         </INSDQualifier>
> >         <INSDQualifier>
> >           <INSDQualifier_name>db_xref</INSDQualifier_name>
> >           <INSDQualifier_value>GI:17861572</INSDQualifier_value>
> >         </INSDQualifier>
> >         <INSDQualifier>
> >           <INSDQualifier_name>db_xref</INSDQualifier_name>
> >
> > <INSDQualifier_value>FLYBASE:FBgn0024371</INSDQualifier_value>
> >         </INSDQualifier>
> >         <INSDQualifier>
> >           <INSDQualifier_name>translation</INSDQualifier_name>
> >
> >
> <INSDQualifier_value>MYKRKTASIVKRDSSAAGTTSSAMMMKVDSAETSVRSQSYESTPVSMDTSPDPPTPIKSPSNSQSQSQPGQQRSVGSLVLLTQKFVDLVKANEGSIDLKAATKILDVQKRRIYDITNVLEGIGLIDKGRHCSLVRWRGGGFNNAKDQENYDLARSRTNHLKMLEDDLDRQLEYAQRNLRYVMQDPSNRSYAYVTRDDLLDIFGDDSVFTIPNYDEEVDIKRNHYELAVSLDNGSAIDIRLVTNQGKSTTNPHDVDGFFDYHRLDTPSPSTSSHSSEDGNAPACAGNVITDEHGYSCNPGMKDEMKLLENELTAKIIFQNYLSGHSLRRFYPDDPNLENPPLLQLNPPQEDFNFALKSDEGICELFDVQCS</INSDQualifier_value>
>
> >         </INSDQualifier>
> >       </INSDFeature_quals>
> >     </INSDFeature>
> >   </INSDSeq_feature-table>
> >
> >
> <INSDSeq_sequence>AAGAATAGAGGGAGAATGAAAAAAATGACATAAATGGCGGAAAGCAAACCTAGCGCCAACATTCGTATTTTCGTTTAATTTTCGCTCCAAAGTGCAATTAATTCCGGCTTCTTGATCGCTGCATATTGAGTGCAGCCACGCAAAGAGTTACAAGGACAGGAGTATAGTCATCGAGTCGATTGCGGACCATGTACAAGCGCAAAACCGCGAGTATTGTTAAAAGAGACAGCTCCGCAGCGGGCACCACCTCCTCGGCTATGATGATGAAGGTGGATTCGGCTGAGACTTCGGTCCGGTCGCAGAGCTACGAGTCTACACCCGTTAGCATGGACACATCACCGGATCCTCCAACGCCAATCAAGTCTCCGTCGAATTCACAATCGCAATCGCAGCCTGGACAACAGCGCTCCGTGGGCTCACTGGTCCTGCTCACACAGAAGTTTGTGGATCTCGTGAAGGCCAACGAAGGATCCATCGACCTGAAAGCGGCAACCAAAATCTTGGACGTACAGAAGCGCCGAATATACGATATTACCAATGTTTTAGAGGGCATTGGACTAATTGATAAGGGCAGACACTGCTCCCTAGTGCGCTGGCGCGGAGGGGGCTTTAACAATGCCAAGGACCAAGAGAACTACGACCTGGCACGTAGCCGGACTAATCATTTGAAAATGTTGGAGGATGACCTAGACAGGCAACTGGAGTATGCACAGCGCAATCTGCGCTACGTTATGCAGGATCCCTCGAATAGGTCGTATGCATATGTGACACGTGATGATCTGCTGGACATCTTTGGAGATGATTCCGTATTCACAATACCTAATTATGACGAGGAAGTAGATATCAAGCGTAATCATTACGAGCTGGCCGTGTCGCTGGACAATGGCAGCGCAATTGACATTCGCCTGGTGACGAACCAAGGAAAGAGTACTACAAATCCGCACGATGTGGATGGGTTCTTTGACTATCACCGTCTGGACACGCCCTCACCCTCGACGTCGTCGCACTCCAGCGAGGATGGTAACGCTCCAGCATGCGCGGGGAACGTGATCACCGACGAGCACGGTTACTCGTGCAATCCCGGGATGAAAGATGAGATGAAACTTTTGGAGAACGAGCTGACGGCCAAGATAATCTTCCAAAATTATCTGTCCGGTCATTCGCTGCGGCGATTTTATCCCGATGATCCGAATCTAGAAAACCCGCCGCTGCTGCAGCTGAATCCTCCGCAGGAAGACTTCAACTTTGCGTTAAAAAGCGACGAAGGTATTTGCGAGCTGTTTGATGTTCAGTGCTCCTAACTGTGGAAGGGGATGTACACCTTAGGACTATAGCTACACTGCAACTGGCCGCGTGCATTGTGCAAATATTTATGATTAGTACAATTTTGACTTTGGATTTCTCTATATCGTCTAGAAATTTTTAATTAGTGTAATACCTTGTAATTTCGCAAATAACAGCAAAACCAATAAATTCGTAAATGCAAAAAAAAAAAAAAAAAA</INSDSeq_sequence>
>
> > </INSDSeq>
> > </INSDSet>
> > ~~~~~~~~~~~~~~~~~~~~~~
> >
> > On 6/8/06, Richard Holland <richard.holland at ebi.ac.uk> wrote:
> >         Yesterday I think I said I was going to add other-seqids but I
> >         forgot to
> >         do it, so I did it just now. Try it and see. Use the new
> >         INSDseqFormat.Terms.getOtherSeqIdTerm() term to find them.
> >
> >         cheers,
> >         Richard
> >
> >         On Wed, 2006-06-07 at 19:48 -0400, Seth Johnson wrote:
> >         > Hi Richard,
> >         >
> >         > I still cannot locate the GI number for the main
> >         sequence.  After I
> >         > parse it with readINSDseqDNA, I then use:
> >         >
> >         >                 Note [] myAccs =
> >         ((RichAnnotation)rs.getAnnotation
> >         > ()).getProperties( Terms.getAdditionalAccessionTerm ());
> >         >
> >         > However, the 'myAccs' appears to be empty.  Am I on the
> >         wrong track to
> >         > get to other-seqids???
> >         >
> >         > On 6/6/06, Richard Holland < richard.holland at ebi.ac.uk>
> >         wrote:
> >         >         GenBank has a separate line for GI number, so it can
> >         be parsed
> >         >         out
> >         >         nicely. INSDseq does not, so you have to rely on the
> >         other-
> >         >         seqids tag
> >         >         and hope that one of them is the GI number. However
> >         it seems I
> >         >         have not
> >         >         included that tag in the parser, so I will include
> >         it. This
> >         >         will make
> >         >         the other-seqids values available through the notes
> >         with the
> >         >         term
> >         >         Terms.getAdditionalAccessionTerm(), but
> >         getIdentifier() will
> >         >         remain
> >         >         null.
> >         >
> >         >         For your second question, the tutorial makes the
> >         mistake in
> >         >         several
> >         >         places of saying getNoteSet( Terms.blahblah()). This
> >         was
> >         >         shorthand for:
> >         >
> >         >         rs.getAnnotation().getProperty(Terms.blahblah())
> >         >                 (for single values)
> >         >
> >         >         or
> >         >
> >         >         ((RichAnnotation)rs.getAnnotation()).getProperties
> >         >         ( Terms.blahblah ())
> >         >                 (for multiple values)
> >         >
> >         >         but never got expanded. Maybe someone can fix that
> >         one
> >         >         day... :)ded...
> >         >
> >         >         I'm just updating INSDseq to 1.4 now. The guys next
> >         door gave
> >         >         me the
> >         >         details of the changes, and told me that 1.3 is
> >         actually no
> >         >         longer
> >         >         supported by them after Friday this week! So I'll
> >         make it 1.4
> >         >         only.
> >         >
> >         >         cheers,
> >         >         Richard
> >         >
> >         --
> >         Richard Holland (BioMart Team)
> >         EMBL-EBI
> >         Wellcome Trust Genome Campus
> >         Hinxton
> >         Cambridge CB10 1SD
> >         UNITED KINGDOM
> >         Tel: +44-(0)1223-494416
> >
> >
> >
> >
> > --
> > Best Regards,
> >
> >
> > Seth Johnson
> > Senior Bioinformatics Associate
> >
> > Ph: (202) 470-0900
> > Fx: (775) 251-0358
> --
> Richard Holland (BioMart Team)
> EMBL-EBI
> Wellcome Trust Genome Campus
> Hinxton
> Cambridge CB10 1SD
> UNITED KINGDOM
> Tel: +44-(0)1223-494416
>
>


-- 
Best Regards,


Seth Johnson
Senior Bioinformatics Associate

Ph: (202) 470-0900
Fx: (775) 251-0358


-- 
Best Regards,


Seth Johnson
Senior Bioinformatics Associate

Ph: (202) 470-0900
Fx: (775) 251-0358




More information about the Biojava-l mailing list