[Bioperl-l] HG-U133a annotation csv (HG-U133A_annot.csv)

D.Enrique ESCOBAR ESPINOZA escobarebio at yahoo.com
Fri Dec 10 15:59:37 EST 2004


I m have a hell of time trying to parse the annotation file with a
regular expression.
The problem is that the file contains fileds separated by a coma,
each field starts with a double quote and it ends in a double quote,
and also it contains in each field some ';' and ','.
an exemple of that file is at the end of this mail,
can someone help and give me a trick for parsing the lines of this
file?
It has 38 fields, and excel is not even opening it correctly,
and if i try to save it back to a csv file,
it does a complet mess.
Thanks in advance.
"Probe Set ID","GeneChip Array","Species Scientific Name","Annotation
Date","Sequence Type","Sequence Source","Transcript ID","Target
Description","Representative Public ID","UniGene ID","Genome
Version","Alignments","Gene Title","Gene Symbol","Chromosomal
Location","Unigene Cluster
Type","Ensembl","LocusLink","SwissProt","EC","OMIM","RefSeq Protein
ID","RefSeq Transcript ID","FlyBase","AGI","WormBase","MGI Name","RGD
Name","SGD accession number","Gene Ontology Biological Process","Gene
Ontology Cellular Component","Gene Ontology Molecular
Function","Pathway","Protein Families","Protein
Domains","InterPro","Trans Membrane","QTL","Annotation
Description","Annotation Transcript Cluster","Transcript
Assignments","Annotation Notes"
"1007_s_at","Human Genome U133A Array","Homo sapiens","Oct 11,
2004","Exemplar sequence","Affymetrix Proprietary
Database","U48705mRNA"," U48705 /FEATURE=mRNA /DEFINITION=HSU48705
Human receptor tyrosine kinase DDR gene, complete cds
","U48705","Hs.423573","May 2004 (NCBI 35)","chr6:30964144-30975910
(+) // 95.63 // p21.33","discoidin domain receptor family, member
1","DDR1","chr6p21.3","full length","ENSG00000137332","780","BAC85426
/// Q08345 /// Q96T61 /// Q96T62","EC:2.7.1.112","600408","NP_001945
/// NP_054699 /// NP_054700","NM_001954 /// NM_013993 ///
NM_013994","---","---","---","---","---","---","6468 // protein amino
acid phosphorylation // inferred from electronic annotation /// 7155
// cell adhesion // traceable author statement /// 7169 //
transmembrane receptor protein tyrosine kinase signaling pathway //
inferred from electronic annotation","5887 // integral to plasma
membrane // traceable author statement /// 16020 // membrane //
inferred from electronic annotation","4674 // protein
serine/threonine kinase activity // inferred from electronic
annotation /// 4714 // transmembrane receptor protein tyrosine kinase
activity // traceable author statement /// 4872 // receptor activity
// inferred from electronic annotation /// 5524 // ATP binding //
inferred from electronic annotation /// 16740 // transferase activity
// inferred from electronic annotation","---","ec // ZA70_HUMAN //
ZA70_HUMAN EC:2.7.1.112:TYROSINE-PROTEIN KINASE ZAP-70 (EC 2.7.1.112)
(70 KDA ZETA-ASSOCIATED PROTEIN) (SYK-RELATED TYROSINE KINASE). //
2.0E-65 /// Hanks // DDR // HUMRTK_1 (DDR) KINASES:5.11.1 | PTK Group
B membrane spanning protein tyrosine kinases.PTK XX DDR/TKT family
.DDR // 1.0E-156","scop // d1kexa_ // d1kexa_ SCOP:b.18.1.2:| B1
domain of neuropilin-1 // 5.0E-42","IPR000421 // Coagulation factor
5/8 type C domain (FA58C) /// IPR000719 // Protein
kinase","NP_054700.1 // span:417-439 // numtm:1","---","This probe
set was annotated using the Matching Probes based pipeline to a Locus
Link identifier using 1 transcripts. // false // Matching Probes //
A","NM_013994(16)","ENST00000259875 // cdna:known
chromosome:NCBI34:6:30958112:30974184:1 // ensembl // 16 // --- ///
NM_013994 // Homo sapiens discoidin domain receptor family, member 1
(DDR1), transcript variant 3, mRNA. // refseq // 16 //
---","ENST00000325423 // ensembl // 1 // Negative Strand Matching
Probes /// ENST00000340208 // ensembl // 1 // Negative Strand
Matching Probes /// GENSCAN00000025013 // ensembl // 1 // Negative
Strand Matching Probes /// BC026341 // gb // 1 // Negative Strand
Matching Probes /// S57212 // gb // 1 // Negative Strand Matching
Probes"
"1053_at","Human Genome U133A Array","Homo sapiens","Oct 11,
2004","Exemplar sequence","GenBank","M87338"," M87338 /FEATURE=
/DEFINITION=HUMA1SBU Human replication factor C, 40-kDa subunit (A1)
mRNA, complete cds ","M87338","Hs.139226","May 2004 (NCBI
35)","chr7:73090653-73113383 (-) // 70.86 // q11.23","replication
factor C (activator 1) 2, 40kDa","RFC2","chr7q11.23","full
length","ENSG00000049541","5982","AAP35707 ///
P35250","---","600404","NP_002905 /// NP_852136","NM_002914 ///
NM_181471","---","---","---","---","---","---","6260 // DNA
replication // inferred from electronic annotation","5634 // nucleus
// inferred from electronic annotation /// 5663 // DNA replication
factor C complex // traceable author statement","166 // nucleotide
binding // inferred from electronic annotation /// 3677 // DNA
binding // inferred from electronic annotation /// 5524 // ATP
binding // traceable author statement","DNA_replication //
GenMAPP","ec // KAD2_HUMAN // KAD2_HUMAN EC:2.7.4.3:ADENYLATE KINASE
ISOENZYME 2, MITOCHONDRIAL (EC 2.7.4.3) (ATP-AMP TRANSPHOSPHORYLASE).
// 8.2","scop // d1nrjb_ // d1nrjb_ SCOP:c.37.1.8:| Signal
recognition particle receptor beta-subunit //
0.024","---","---","---","This probe set was annotated using the
Matching Probes based pipeline to a Locus Link identifier using 2
transcripts. // false // Matching Probes //
A","M87338(15),NM_181471(12)","ENST00000055077 // cdna:known
chromosome:NCBI34:7:73057931:73080835:-1 // ensembl // 12 // --- ///
ENST00000275627 // cdna:known
chromosome:NCBI34:7:73057931:73080835:-1 // ensembl // 12 // --- ///
M87338 // Human replication factor C, 40-kDa subunit (A1) mRNA,
complete cds. // gb // 15 // --- /// NM_181471 // Homo sapiens
replication factor C (activator 1) 2, 40kDa (RFC2), transcript
variant 1, mRNA. // refseq // 12 // ---","GENSCAN00000014431 //
ensembl // 8 // Cross Hyb Matching Probes"


=====
--------------------------------------------------
D.Enrique ESCOBAR ESPINOZA (B.Sc.) 
http://www.iro.umontreal.ca/~escobard/
http://adn.bioinfo.uqam.ca/~escd07097301/
ICQ#: 201778618
-------------------------------------------------
1487, Boul. St-Joseph Est Apt4
Tel:  (514) 523-8398
Montreal QC Canada
H2J 1M6


		
__________________________________ 
Do you Yahoo!? 
Yahoo! Mail - Easier than ever with enhanced search. Learn more.
http://info.mail.yahoo.com/mail_250


More information about the Bioperl-l mailing list