[Biojava-l] bioJava & swissProt parsering
Matthew Pocock
matthew_pocock@yahoo.co.uk
Fri, 17 May 2002 16:35:41 +0100
Yes.
It works for 90% + of all entries in swissprot40.
SeqIOTools.readSwissprot() should do the trick. You will get back a
normal SequenceIterator object over all the swissprot entries in the
file. The entries it fails on tend to have a domain at coordinate 0
which is an initiations site. The feature creation logic in BioJava sees
this as a feature with illegal coordinates and barfs. I keep meaning to
find a work-around for this, but never get the time. The options are:
1) throw an informative exception
2)ignore the 'bad' features
3) put feature templates for 'bad' features in a well-known annotatin
property
4) relax the feature creation checks
My vote is for 3. It keeps as much of the data that is valid available,
while still allowing access to the rest (to facilitate round-tripping).
Matthew
wanner.de@pg.com wrote:
> Hi,
>
> Has anyone used bioJava to parse a swissProt record. I've attached an example
> below -- it's format looks much different from
> either genbank or refseq.
>
> thx,
> Dave
>
>
> ID 100K_RAT STANDARD; PRT; 889 AA.
> AC Q62671;
> DT 01-NOV-1997 (Rel. 35, Created)
> DT 01-NOV-1997 (Rel. 35, Last sequence update)
> DT 16-OCT-2001 (Rel. 40, Last annotation update)
> DE 100 kDa protein (ENZYME: 6.3.2.-).
> OS Rattus norvegicus (Rat).
> OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
> OC Mammalia; Eutheria; Rodentia; Sciurognathi; Muridae; Murinae; Rattus.
> OX NCBI_TaxID=10116;
> RN [1]
> RP SEQUENCE FROM N.A.
> RC STRAIN=WISTAR; TISSUE=Testis;
> RX MEDLINE=92253337: PubMed=1533713;
> RA Mueller D., Rehbein M., Baumeister H., Richter D.;
> RT "Molecular characterization of a novel rat protein structurally
> RT related to poly(A) binding proteins and the 70K protein of the U1
> RT small nuclear ribonucleoprotein particle (snRNP).";
> RL Nucleic Acids Res. 20:1471-1475(1992).
> RN [2]
> RP ERRATUM.
> RA Mueller D., Rehbein M., Baumeister H., Richter D.;
> RL Nucleic Acids Res. 20:2624-2624(1992).
> CC -!- FUNCTION: E3 UBIQUITIN-PROTEIN LIGASE WHICH ACCEPTS UBIQUITIN FROM
> CC AN E2 UBIQUITIN-CONJUGATING ENZYME IN THE FORM OF A THIOESTER AND
> CC THEN DIRECTLY TRANSFERS THE UBIQUITIN TO TARGETED SUBSTRATES (BY
> CC SIMILARITY). THIS PROTEIN MAY BE INVOLVED IN MATURATION AND/OR
> CC POST-TRANSCRIPTIONAL REGULATION OF MRNA.
> CC -!- TISSUE SPECIFICITY: HIGHEST LEVELS FOUND IN TESTIS. ALSO PRESENT
> CC IN LIVER, KIDNEY, LUNG AND BRAIN.
> CC -!- DEVELOPMENTAL STAGE: IN EARLY POST-NATAL LIFE, EXPRESSION IN
> CC THE TESTIS INCREASES TO REACH A MAXIMUM AROUND DAY 28.
> CC -!- MISCELLANEOUS: A CYSTEINE RESIDUE IS REQUIRED FOR
> CC UBIQUITIN-THIOLESTER FORMATION.
> CC -!- SIMILARITY: A CENTRAL REGION (AA 485-514) IS SIMILAR TO THE
> CC C-TERMINAL DOMAINS OF MAMMALIAN AND YEAST POLY (A) RNA BINDING
> CC PROTEINS (PABP).
> CC -!- SIMILARITY: CONTAINS MIXED-CHARGE DOMAINS SIMILAR TO RNA-BINDING
> CC PROTEINS.
> CC -!- SIMILARITY: CONTAINS 1 HECT-TYPE E3 UBIQUITIN-PROTEIN LIGASE
> CC DOMAIN.
> CC --------------------------------------------------------------------------
> CC This SWISS-PROT entry is copyright. It is produced through a collaboration
> CC between the Swiss Institute of Bioinformatics and the EMBL outstation -
> CC the European Bioinformatics Institute. There are no restrictions on its
> CC use by non-profit institutions as long as its content is in no way
> CC modified and this statement is not removed. Usage by and for commercial
> CC entities requires a license agreement (See http://www.isb-sib.ch/announce/
> CC or send an email to license@isb-sib.ch).
> CC --------------------------------------------------------------------------
> DR EMBL: X64411
> DR InterPro: IPR000569
> DR IPR002004
> DR Pfam: PF00632
> DR PF00658
> DR SMART: SM00119
> DR SM00517
> DR PROSITE: PS50237
> KW Ubiquitin conjugation; Ligase.
> FT DOMAIN 77 88 ASP/GLU-RICH (ACIDIC).
> FT DOMAIN 127 150 PRO-RICH.
> FT DOMAIN 420 439 ARG/GLU-RICH (MIXED CHARGE).
> FT DOMAIN 448 457 ARG/ASP-RICH (MIXED CHARGE).
> FT DOMAIN 485 514 PABP-LIKE.
> FT DOMAIN 579 590 ASP/GLU-RICH (ACIDIC).
> FT DOMAIN 786 889 HECT.
> FT DOMAIN 827 847 PRO-RICH.
> FT BINDING 858 858 UBIQUITIN (BY SIMILARITY).
> SQ SEQUENCE 889 AA; 100368 MW; ABD7E3CD53961B78 CRC64;
> MMSARGDFLN YALSLMRSHN DEHSDVLPVL DVCSLKHVAY VFQALIYWIK AMNQQTTLDT
> PQLERKRTRE LLELGIDNED SEHENDDDTS QSATLNDKDD ESLPAETGQN HPFFRRSDSM
> TFLGCIPPNP FEVPLAEAIP LADQPHLLQP NARKEDLFGR PSQGLYSSSA GSGKCLVEVT
> MDRNCLEVLP TKMSYAANLK NVMNMQNRQK KAGEDQSMLA EEADSSKPGP SAHDVAAQLK
> SSLLAEIGLT ESEGPPLTSF RPQCSFMGMV ISHDMLLGRW RLSLELFGRV FMEDVGAEPG
> SILTELGGFE VKESKFRREM EKLRNQQSRD LSLEVDRDRD LLIQQTMRQL NNHFGRRCAT
> TPMAVHRVKV TFKDEPGEGS GVARSFYTAI AQAFLSNEKL PNLDCIQNAN KGTHTSLMQR
> LRNRGERDRE REREREMRRS SGLRAGSRRD RDRDFRRQLS IDTRPFRPAS EGNPSDDPDP
> LPAHRQALGE RLYPRVQAMQ PAFASKITGM LLELSPAQLL LLLASEDSLR ARVEEAMELI
> VAHGRENGAD SILDLGLLDS SEKVQENRKR HGSSRSVVDM DLDDTDDGDD NAPLFYQPGK
> RGFYTPRPGK NTEARLNCFR NIGRILGLCL LQNELCPITL NRHVIKVLLG RKVNWHDFAF
> FDPVMYESLR QLILASQSSD ADAVFSAMDL AFAVDLCKEE GGGQVELIPN GVNIPVTPQN
> VYEYVRKYAE HRMLVVAEQP LHAMRKGLLD VLPKNSLEDL TAEDFRLLVN GCGEVNVQML
> ISFTSFNDES GENAEKLLQF KRWFWSIVER MSMTERQDLV YFWTSSPSLP ASEEGFQPMP
> SITIRPPDDQ HLPTANTCIS RLYVPLYSSK QILKQKLLLA IKTKNFGFV
> //
>
>
> _______________________________________________
> Biojava-l mailing list - Biojava-l@biojava.org
> http://biojava.org/mailman/listinfo/biojava-l
>