[Biojava-l] bioJava & swissProt parsering
wanner.de@pg.com
wanner.de@pg.com
Fri, 17 May 2002 10:56:01 -0400
Hi,
Has anyone used bioJava to parse a swissProt record. I've attached an example
below -- it's format looks much different from
either genbank or refseq.
thx,
Dave
ID 100K_RAT STANDARD; PRT; 889 AA.
AC Q62671;
DT 01-NOV-1997 (Rel. 35, Created)
DT 01-NOV-1997 (Rel. 35, Last sequence update)
DT 16-OCT-2001 (Rel. 40, Last annotation update)
DE 100 kDa protein (ENZYME: 6.3.2.-).
OS Rattus norvegicus (Rat).
OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
OC Mammalia; Eutheria; Rodentia; Sciurognathi; Muridae; Murinae; Rattus.
OX NCBI_TaxID=10116;
RN [1]
RP SEQUENCE FROM N.A.
RC STRAIN=WISTAR; TISSUE=Testis;
RX MEDLINE=92253337: PubMed=1533713;
RA Mueller D., Rehbein M., Baumeister H., Richter D.;
RT "Molecular characterization of a novel rat protein structurally
RT related to poly(A) binding proteins and the 70K protein of the U1
RT small nuclear ribonucleoprotein particle (snRNP).";
RL Nucleic Acids Res. 20:1471-1475(1992).
RN [2]
RP ERRATUM.
RA Mueller D., Rehbein M., Baumeister H., Richter D.;
RL Nucleic Acids Res. 20:2624-2624(1992).
CC -!- FUNCTION: E3 UBIQUITIN-PROTEIN LIGASE WHICH ACCEPTS UBIQUITIN FROM
CC AN E2 UBIQUITIN-CONJUGATING ENZYME IN THE FORM OF A THIOESTER AND
CC THEN DIRECTLY TRANSFERS THE UBIQUITIN TO TARGETED SUBSTRATES (BY
CC SIMILARITY). THIS PROTEIN MAY BE INVOLVED IN MATURATION AND/OR
CC POST-TRANSCRIPTIONAL REGULATION OF MRNA.
CC -!- TISSUE SPECIFICITY: HIGHEST LEVELS FOUND IN TESTIS. ALSO PRESENT
CC IN LIVER, KIDNEY, LUNG AND BRAIN.
CC -!- DEVELOPMENTAL STAGE: IN EARLY POST-NATAL LIFE, EXPRESSION IN
CC THE TESTIS INCREASES TO REACH A MAXIMUM AROUND DAY 28.
CC -!- MISCELLANEOUS: A CYSTEINE RESIDUE IS REQUIRED FOR
CC UBIQUITIN-THIOLESTER FORMATION.
CC -!- SIMILARITY: A CENTRAL REGION (AA 485-514) IS SIMILAR TO THE
CC C-TERMINAL DOMAINS OF MAMMALIAN AND YEAST POLY (A) RNA BINDING
CC PROTEINS (PABP).
CC -!- SIMILARITY: CONTAINS MIXED-CHARGE DOMAINS SIMILAR TO RNA-BINDING
CC PROTEINS.
CC -!- SIMILARITY: CONTAINS 1 HECT-TYPE E3 UBIQUITIN-PROTEIN LIGASE
CC DOMAIN.
CC --------------------------------------------------------------------------
CC This SWISS-PROT entry is copyright. It is produced through a collaboration
CC between the Swiss Institute of Bioinformatics and the EMBL outstation -
CC the European Bioinformatics Institute. There are no restrictions on its
CC use by non-profit institutions as long as its content is in no way
CC modified and this statement is not removed. Usage by and for commercial
CC entities requires a license agreement (See http://www.isb-sib.ch/announce/
CC or send an email to license@isb-sib.ch).
CC --------------------------------------------------------------------------
DR EMBL: X64411
DR InterPro: IPR000569
DR IPR002004
DR Pfam: PF00632
DR PF00658
DR SMART: SM00119
DR SM00517
DR PROSITE: PS50237
KW Ubiquitin conjugation; Ligase.
FT DOMAIN 77 88 ASP/GLU-RICH (ACIDIC).
FT DOMAIN 127 150 PRO-RICH.
FT DOMAIN 420 439 ARG/GLU-RICH (MIXED CHARGE).
FT DOMAIN 448 457 ARG/ASP-RICH (MIXED CHARGE).
FT DOMAIN 485 514 PABP-LIKE.
FT DOMAIN 579 590 ASP/GLU-RICH (ACIDIC).
FT DOMAIN 786 889 HECT.
FT DOMAIN 827 847 PRO-RICH.
FT BINDING 858 858 UBIQUITIN (BY SIMILARITY).
SQ SEQUENCE 889 AA; 100368 MW; ABD7E3CD53961B78 CRC64;
MMSARGDFLN YALSLMRSHN DEHSDVLPVL DVCSLKHVAY VFQALIYWIK AMNQQTTLDT
PQLERKRTRE LLELGIDNED SEHENDDDTS QSATLNDKDD ESLPAETGQN HPFFRRSDSM
TFLGCIPPNP FEVPLAEAIP LADQPHLLQP NARKEDLFGR PSQGLYSSSA GSGKCLVEVT
MDRNCLEVLP TKMSYAANLK NVMNMQNRQK KAGEDQSMLA EEADSSKPGP SAHDVAAQLK
SSLLAEIGLT ESEGPPLTSF RPQCSFMGMV ISHDMLLGRW RLSLELFGRV FMEDVGAEPG
SILTELGGFE VKESKFRREM EKLRNQQSRD LSLEVDRDRD LLIQQTMRQL NNHFGRRCAT
TPMAVHRVKV TFKDEPGEGS GVARSFYTAI AQAFLSNEKL PNLDCIQNAN KGTHTSLMQR
LRNRGERDRE REREREMRRS SGLRAGSRRD RDRDFRRQLS IDTRPFRPAS EGNPSDDPDP
LPAHRQALGE RLYPRVQAMQ PAFASKITGM LLELSPAQLL LLLASEDSLR ARVEEAMELI
VAHGRENGAD SILDLGLLDS SEKVQENRKR HGSSRSVVDM DLDDTDDGDD NAPLFYQPGK
RGFYTPRPGK NTEARLNCFR NIGRILGLCL LQNELCPITL NRHVIKVLLG RKVNWHDFAF
FDPVMYESLR QLILASQSSD ADAVFSAMDL AFAVDLCKEE GGGQVELIPN GVNIPVTPQN
VYEYVRKYAE HRMLVVAEQP LHAMRKGLLD VLPKNSLEDL TAEDFRLLVN GCGEVNVQML
ISFTSFNDES GENAEKLLQF KRWFWSIVER MSMTERQDLV YFWTSSPSLP ASEEGFQPMP
SITIRPPDDQ HLPTANTCIS RLYVPLYSSK QILKQKLLLA IKTKNFGFV
//