[Biojava-l] Parsing INSDseq Sequences (1.3 & 1.4)

Richard Holland richard.holland at ebi.ac.uk
Mon Jun 12 08:37:23 UTC 2006


Typo in code. my fault. Try again!



On Thu, 2006-06-08 at 10:23 -0400, Seth Johnson wrote:
> I'm still getting an empty array back from this:
> 
> Note [] myAccs = ((RichAnnotation)rs.getAnnotation()).getProperties
> (INSDseqFormat.Terms.getOtherSeqIdTerm());
> 
> Here's the file that I'm parsing:
> ~~~~~~~~~~~~~~~~~~~~~~ 
> <?xml version="1.0"?>
> <!DOCTYPE INSDSet PUBLIC "-//NCBI//INSD INSDSeq/EN"
> "http://www.ncbi.nlm.nih.gov/dtd/INSD_INSDSeq.dtd">
> <INSDSet>
> <INSDSeq>
>   <INSDSeq_locus>AY069118</INSDSeq_locus>
>   <INSDSeq_length>1502</INSDSeq_length>
>   <INSDSeq_strandedness>single</INSDSeq_strandedness> 
>   <INSDSeq_moltype>mRNA</INSDSeq_moltype>
>   <INSDSeq_topology>linear</INSDSeq_topology>
>   <INSDSeq_division>INV</INSDSeq_division>
>   <INSDSeq_update-date>17-DEC-2001</INSDSeq_update-date> 
>   <INSDSeq_create-date>15-DEC-2001</INSDSeq_create-date>
>   <INSDSeq_definition>Drosophila melanogaster GH13089 full length
> cDNA</INSDSeq_definition>
>   <INSDSeq_primary-accession>AY069118</INSDSeq_primary-accession> 
>   <INSDSeq_accession-version>AY069118.1</INSDSeq_accession-version>
>   <INSDSeq_other-seqids>
>     <INSDSeqid>gb|AY069118.1|</INSDSeqid>
>     <INSDSeqid>gi|17861571</INSDSeqid> 
>   </INSDSeq_other-seqids>
>   <INSDSeq_keywords>
>     <INSDKeyword>FLI_CDNA</INSDKeyword>
>   </INSDSeq_keywords>
>   <INSDSeq_source>Drosophila melanogaster (fruit
> fly)</INSDSeq_source> 
>   <INSDSeq_organism>Drosophila melanogaster</INSDSeq_organism>
>   <INSDSeq_taxonomy>Eukaryota; Metazoa; Arthropoda; Hexapoda; Insecta;
> Pterygota; Neoptera; Endopterygota; Diptera; Brachycera; Muscomorpha;
> Ephydroidea; Drosophilidae; Drosophila</INSDSeq_taxonomy> 
>   <INSDSeq_references>
>     <INSDReference>
>       <INSDReference_reference>1 (bases 1 to
> 1502)</INSDReference_reference>
>       <INSDReference_position>1..1502</INSDReference_position> 
>       <INSDReference_authors>
>         <INSDAuthor>Stapleton,M.</INSDAuthor>
>         <INSDAuthor>Brokstein,P.</INSDAuthor>
>         <INSDAuthor>Hong,L.</INSDAuthor>
>         <INSDAuthor>Agbayani,A.</INSDAuthor>
>         <INSDAuthor>Carlson,J.</INSDAuthor>
>         <INSDAuthor>Champe,M.</INSDAuthor>
>         <INSDAuthor>Chavez,C.</INSDAuthor> 
>         <INSDAuthor>Dorsett,V.</INSDAuthor>
>         <INSDAuthor>Farfan,D.</INSDAuthor>
>         <INSDAuthor>Frise,E.</INSDAuthor>
>         <INSDAuthor>George,R.</INSDAuthor> 
>         <INSDAuthor>Gonzalez,M.</INSDAuthor>
>         <INSDAuthor>Guarin,H.</INSDAuthor>
>         <INSDAuthor>Li,P.</INSDAuthor>
>         <INSDAuthor>Liao,G.</INSDAuthor> 
>         <INSDAuthor>Miranda,A.</INSDAuthor>
>         <INSDAuthor>Mungall,C.J.</INSDAuthor>
>         <INSDAuthor>Nunoo,J.</INSDAuthor>
>         <INSDAuthor>Pacleb,J.</INSDAuthor> 
>         <INSDAuthor>Paragas,V.</INSDAuthor>
>         <INSDAuthor>Park,S.</INSDAuthor>
>         <INSDAuthor>Phouanenavong,S.</INSDAuthor>
>         <INSDAuthor>Wan,K.</INSDAuthor> 
>         <INSDAuthor>Yu,C.</INSDAuthor>
>         <INSDAuthor>Lewis,S.E.</INSDAuthor>
>         <INSDAuthor>Rubin,G.M.</INSDAuthor>
>         <INSDAuthor>Celniker,S.</INSDAuthor> 
>       </INSDReference_authors>
>       <INSDReference_title>Direct Submission</INSDReference_title>
>       <INSDReference_journal>Submitted (10-DEC-2001) Berkeley
> Drosophila Genome Project, Lawrence Berkeley National Laboratory, One
> Cyclotron Road, Berkeley, CA 94720, USA</INSDReference_journal> 
>     </INSDReference>
>   </INSDSeq_references>
>   <INSDSeq_comment>Sequence submitted by: Berkeley Drosophila Genome
> Project Lawrence Berkeley National Laboratory Berkeley, CA 94720 This
> clone was sequenced as part of a high-throughput process to sequence
> clones from Drosophila Gene Collection 1 (Rubin et al., Science 2000).
> The sequence has been subjected to integrity checks for sequence
> accuracy, presence of a polyA tail and contiguity within 100 kb in the
> genome. Thus we believe the sequence to reflect accurately this
> particular cDNA clone. However, there are artifacts associated with
> the generation of cDNA clones that may have not been detected in our
> initial analyses such as internal priming, priming from contaminating
> genomic DNA, retained introns due to reverse transcription of
> unspliced precursor RNAs, and reverse transcriptase errors that result
> in single base changes. For further information about this sequence,
> including its location and relationship to other sequences, please
> visit our Web site ( http://fruitfly.berkeley.edu) or send email to
> cdna at fruitfly.berkeley.edu.</INSDSeq_comment>
>   <INSDSeq_feature-table>
>     <INSDFeature>
>       <INSDFeature_key>source</INSDFeature_key> 
>       <INSDFeature_location>1..1502</INSDFeature_location>
>       <INSDFeature_intervals>
>         <INSDInterval>
>           <INSDInterval_from>1</INSDInterval_from>
>           <INSDInterval_to>1502</INSDInterval_to> 
>           <INSDInterval_accession>AY069118.1</INSDInterval_accession>
>         </INSDInterval>
>       </INSDFeature_intervals>
>       <INSDFeature_quals>
>         <INSDQualifier> 
>           <INSDQualifier_name>organism</INSDQualifier_name>
>           <INSDQualifier_value>Drosophila
> melanogaster</INSDQualifier_value>
>         </INSDQualifier>
>         <INSDQualifier> 
>           <INSDQualifier_name>mol_type</INSDQualifier_name>
>           <INSDQualifier_value>mRNA</INSDQualifier_value>
>         </INSDQualifier>
>         <INSDQualifier>
>           <INSDQualifier_name>strain</INSDQualifier_name> 
>           <INSDQualifier_value>y; cn bw sp</INSDQualifier_value>
>         </INSDQualifier>
>         <INSDQualifier>
>           <INSDQualifier_name>db_xref</INSDQualifier_name> 
>           <INSDQualifier_value>taxon:7227</INSDQualifier_value>
>         </INSDQualifier>
>         <INSDQualifier>
>           <INSDQualifier_name>map</INSDQualifier_name>
>           <INSDQualifier_value>39B3-39B3</INSDQualifier_value> 
>         </INSDQualifier>
>       </INSDFeature_quals>
>     </INSDFeature>
>     <INSDFeature>
>       <INSDFeature_key>gene</INSDFeature_key>
>       <INSDFeature_location>1..1502</INSDFeature_location> 
>       <INSDFeature_intervals>
>         <INSDInterval>
>           <INSDInterval_from>1</INSDInterval_from>
>           <INSDInterval_to>1502</INSDInterval_to>
>           <INSDInterval_accession> AY069118.1</INSDInterval_accession>
>         </INSDInterval>
>       </INSDFeature_intervals>
>       <INSDFeature_quals>
>         <INSDQualifier>
>           <INSDQualifier_name>gene</INSDQualifier_name> 
>           <INSDQualifier_value>E2f2</INSDQualifier_value>
>         </INSDQualifier>
>         <INSDQualifier>
>           <INSDQualifier_name>note</INSDQualifier_name>
>           <INSDQualifier_value>alignment with genomic scaffold
> AE003669</INSDQualifier_value> 
>         </INSDQualifier>
>         <INSDQualifier>
>           <INSDQualifier_name>db_xref</INSDQualifier_name>
> 
> <INSDQualifier_value>FLYBASE:FBgn0024371</INSDQualifier_value> 
>         </INSDQualifier>
>       </INSDFeature_quals>
>     </INSDFeature>
>     <INSDFeature>
>       <INSDFeature_key>CDS</INSDFeature_key>
>       <INSDFeature_location>189..1301</INSDFeature_location> 
>       <INSDFeature_intervals>
>         <INSDInterval>
>           <INSDInterval_from>189</INSDInterval_from>
>           <INSDInterval_to>1301</INSDInterval_to>
>           <INSDInterval_accession> AY069118.1</INSDInterval_accession>
>         </INSDInterval>
>       </INSDFeature_intervals>
>       <INSDFeature_quals>
>         <INSDQualifier>
>           <INSDQualifier_name>gene</INSDQualifier_name> 
>           <INSDQualifier_value>E2f2</INSDQualifier_value>
>         </INSDQualifier>
>         <INSDQualifier>
>           <INSDQualifier_name>note</INSDQualifier_name>
>           <INSDQualifier_value>Longest ORF</INSDQualifier_value> 
>         </INSDQualifier>
>         <INSDQualifier>
>           <INSDQualifier_name>codon_start</INSDQualifier_name>
>           <INSDQualifier_value>1</INSDQualifier_value>
>         </INSDQualifier> 
>         <INSDQualifier>
>           <INSDQualifier_name>transl_table</INSDQualifier_name>
>           <INSDQualifier_value>1</INSDQualifier_value>
>         </INSDQualifier>
>         <INSDQualifier> 
>           <INSDQualifier_name>product</INSDQualifier_name>
>           <INSDQualifier_value>GH13089p</INSDQualifier_value>
>         </INSDQualifier>
>         <INSDQualifier>
>           <INSDQualifier_name>protein_id</INSDQualifier_name>
>           <INSDQualifier_value>AAL39263.1</INSDQualifier_value>
>         </INSDQualifier>
>         <INSDQualifier>
>           <INSDQualifier_name>db_xref</INSDQualifier_name>
>           <INSDQualifier_value>GI:17861572</INSDQualifier_value>
>         </INSDQualifier>
>         <INSDQualifier>
>           <INSDQualifier_name>db_xref</INSDQualifier_name>
> 
> <INSDQualifier_value>FLYBASE:FBgn0024371</INSDQualifier_value>
>         </INSDQualifier>
>         <INSDQualifier> 
>           <INSDQualifier_name>translation</INSDQualifier_name>
> 
> <INSDQualifier_value>MYKRKTASIVKRDSSAAGTTSSAMMMKVDSAETSVRSQSYESTPVSMDTSPDPPTPIKSPSNSQSQSQPGQQRSVGSLVLLTQKFVDLVKANEGSIDLKAATKILDVQKRRIYDITNVLEGIGLIDKGRHCSLVRWRGGGFNNAKDQENYDLARSRTNHLKMLEDDLDRQLEYAQRNLRYVMQDPSNRSYAYVTRDDLLDIFGDDSVFTIPNYDEEVDIKRNHYELAVSLDNGSAIDIRLVTNQGKSTTNPHDVDGFFDYHRLDTPSPSTSSHSSEDGNAPACAGNVITDEHGYSCNPGMKDEMKLLENELTAKIIFQNYLSGHSLRRFYPDDPNLENPPLLQLNPPQEDFNFALKSDEGICELFDVQCS</INSDQualifier_value> 
>         </INSDQualifier>
>       </INSDFeature_quals>
>     </INSDFeature>
>   </INSDSeq_feature-table>
> 
> <INSDSeq_sequence>AAGAATAGAGGGAGAATGAAAAAAATGACATAAATGGCGGAAAGCAAACCTAGCGCCAACATTCGTATTTTCGTTTAATTTTCGCTCCAAAGTGCAATTAATTCCGGCTTCTTGATCGCTGCATATTGAGTGCAGCCACGCAAAGAGTTACAAGGACAGGAGTATAGTCATCGAGTCGATTGCGGACCATGTACAAGCGCAAAACCGCGAGTATTGTTAAAAGAGACAGCTCCGCAGCGGGCACCACCTCCTCGGCTATGATGATGAAGGTGGATTCGGCTGAGACTTCGGTCCGGTCGCAGAGCTACGAGTCTACACCCGTTAGCATGGACACATCACCGGATCCTCCAACGCCAATCAAGTCTCCGTCGAATTCACAATCGCAATCGCAGCCTGGACAACAGCGCTCCGTGGGCTCACTGGTCCTGCTCACACAGAAGTTTGTGGATCTCGTGAAGGCCAACGAAGGATCCATCGACCTGAAAGCGGCAACCAAAATCTTGGACGTACAGAAGCGCCGAATATACGATATTACCAATGTTTTAGAGGGCATTGGACTAATTGATAAGGGCAGACACTGCTCCCTAGTGCGCTGGCGCGGAGGGGGCTTTAACAATGCCAAGGACCAAGAGAACTACGACCTGGCACGTAGCCGGACTAATCATTTGAAAATGTTGGAGGATGACCTAGACAGGCAACTGGAGTATGCACAGCGCAATCTGCGCTACGTTATGCAGGATCCCTCGAATAGGTCGTATGCATATGTGACACGTGATGATCTGCTGGACATCTTTGGAGATGATTCCGTATTCACAATACCTAATTATGACGAGGAAGTAGATATCAAGCGTAATCATTACGAGCTGGCCGTGTCGCTGGACAATGGCAGCGCAATTGACATTCGCCTGGTGACGAACCAAGGAAAGAGTACTACAAATCCGCACGATGTGGATGGGTTCTTTGACTATCACCGTCTGGACACGCCCTCACCCTCGACGTCGTCGCACTCCAGCGAGGATGGTAACGCTCCAGCATGCGCGGGGAACGTGATCACCGACGAGCACGGTTACTCGTGCAATCCCGGGATGAAAGATGAGATGAAACTTTTGGAGAACGAGCTGACGGCCAAGATAATCTTCCAAAATTATCTGTCCGGTCATTCGCTGCGGCGATTTTATCCCGATGATCCGAATCTAGAAAACCCGCCGCTGCTGCAGCTGAATCCTCCGCAGGAAGACTTCAACTTTGCGTTAAAAAGCGACGAAGGTATTTGCGAGCTGTTTGATGTTCAGTGCTCCTAACTGTGGAAGGGGATGTACACCTTAGGACTATAGCTACACTGCAACTGGCCGCGTGCATTGTGCAAATATTTATGATTAGTACAATTTTGACTTTGGATTTCTCTATATCGTCTAGAAATTTTTAATTAGTGTAATACCTTGTAATTTCGCAAATAACAGCAAAACCAATAAATTCGTAAATGCAAAAAAAAAAAAAAAAAA</INSDSeq_sequence> 
> </INSDSeq>
> </INSDSet>
> ~~~~~~~~~~~~~~~~~~~~~~
> 
> On 6/8/06, Richard Holland <richard.holland at ebi.ac.uk> wrote:
>         Yesterday I think I said I was going to add other-seqids but I
>         forgot to 
>         do it, so I did it just now. Try it and see. Use the new
>         INSDseqFormat.Terms.getOtherSeqIdTerm() term to find them.
>         
>         cheers,
>         Richard
>         
>         On Wed, 2006-06-07 at 19:48 -0400, Seth Johnson wrote:
>         > Hi Richard, 
>         >
>         > I still cannot locate the GI number for the main
>         sequence.  After I
>         > parse it with readINSDseqDNA, I then use:
>         >
>         >                 Note [] myAccs =
>         ((RichAnnotation)rs.getAnnotation
>         > ()).getProperties(Terms.getAdditionalAccessionTerm ());
>         >
>         > However, the 'myAccs' appears to be empty.  Am I on the
>         wrong track to
>         > get to other-seqids???
>         >
>         > On 6/6/06, Richard Holland < richard.holland at ebi.ac.uk>
>         wrote:
>         >         GenBank has a separate line for GI number, so it can
>         be parsed
>         >         out
>         >         nicely. INSDseq does not, so you have to rely on the
>         other- 
>         >         seqids tag
>         >         and hope that one of them is the GI number. However
>         it seems I
>         >         have not
>         >         included that tag in the parser, so I will include
>         it. This
>         >         will make 
>         >         the other-seqids values available through the notes
>         with the
>         >         term
>         >         Terms.getAdditionalAccessionTerm(), but
>         getIdentifier() will
>         >         remain
>         >         null.
>         >
>         >         For your second question, the tutorial makes the
>         mistake in
>         >         several
>         >         places of saying getNoteSet(Terms.blahblah()). This
>         was
>         >         shorthand for:
>         >
>         >         rs.getAnnotation().getProperty(Terms.blahblah())
>         >                 (for single values)
>         >
>         >         or
>         >
>         >         ((RichAnnotation)rs.getAnnotation()).getProperties
>         >         ( Terms.blahblah ())
>         >                 (for multiple values)
>         >
>         >         but never got expanded. Maybe someone can fix that
>         one
>         >         day... :)ded...
>         >
>         >         I'm just updating INSDseq to 1.4 now. The guys next
>         door gave 
>         >         me the
>         >         details of the changes, and told me that 1.3 is
>         actually no
>         >         longer
>         >         supported by them after Friday this week! So I'll
>         make it 1.4
>         >         only. 
>         >
>         >         cheers,
>         >         Richard
>         >
>         --
>         Richard Holland (BioMart Team)
>         EMBL-EBI
>         Wellcome Trust Genome Campus
>         Hinxton
>         Cambridge CB10 1SD
>         UNITED KINGDOM
>         Tel: +44-(0)1223-494416 
>         
> 
> 
> 
> -- 
> Best Regards,
> 
> 
> Seth Johnson
> Senior Bioinformatics Associate
> 
> Ph: (202) 470-0900
> Fx: (775) 251-0358
-- 
Richard Holland (BioMart Team)
EMBL-EBI
Wellcome Trust Genome Campus
Hinxton
Cambridge CB10 1SD
UNITED KINGDOM
Tel: +44-(0)1223-494416





More information about the Biojava-l mailing list