From escobarebio at yahoo.com Fri Dec 10 15:59:37 2004 From: escobarebio at yahoo.com (D.Enrique ESCOBAR ESPINOZA) Date: Fri Dec 10 15:57:31 2004 Subject: [Bioperl-microarray] HG-U133a annotation csv (HG-U133A_annot.csv) Message-ID: <20041210205938.77559.qmail@web11505.mail.yahoo.com> I m have a hell of time trying to parse the annotation file with a regular expression. The problem is that the file contains fileds separated by a coma, each field starts with a double quote and it ends in a double quote, and also it contains in each field some ';' and ','. an exemple of that file is at the end of this mail, can someone help and give me a trick for parsing the lines of this file? It has 38 fields, and excel is not even opening it correctly, and if i try to save it back to a csv file, it does a complet mess. Thanks in advance. "Probe Set ID","GeneChip Array","Species Scientific Name","Annotation Date","Sequence Type","Sequence Source","Transcript ID","Target Description","Representative Public ID","UniGene ID","Genome Version","Alignments","Gene Title","Gene Symbol","Chromosomal Location","Unigene Cluster Type","Ensembl","LocusLink","SwissProt","EC","OMIM","RefSeq Protein ID","RefSeq Transcript ID","FlyBase","AGI","WormBase","MGI Name","RGD Name","SGD accession number","Gene Ontology Biological Process","Gene Ontology Cellular Component","Gene Ontology Molecular Function","Pathway","Protein Families","Protein Domains","InterPro","Trans Membrane","QTL","Annotation Description","Annotation Transcript Cluster","Transcript Assignments","Annotation Notes" "1007_s_at","Human Genome U133A Array","Homo sapiens","Oct 11, 2004","Exemplar sequence","Affymetrix Proprietary Database","U48705mRNA"," U48705 /FEATURE=mRNA /DEFINITION=HSU48705 Human receptor tyrosine kinase DDR gene, complete cds ","U48705","Hs.423573","May 2004 (NCBI 35)","chr6:30964144-30975910 (+) // 95.63 // p21.33","discoidin domain receptor family, member 1","DDR1","chr6p21.3","full length","ENSG00000137332","780","BAC85426 /// Q08345 /// Q96T61 /// Q96T62","EC:2.7.1.112","600408","NP_001945 /// NP_054699 /// NP_054700","NM_001954 /// NM_013993 /// NM_013994","---","---","---","---","---","---","6468 // protein amino acid phosphorylation // inferred from electronic annotation /// 7155 // cell adhesion // traceable author statement /// 7169 // transmembrane receptor protein tyrosine kinase signaling pathway // inferred from electronic annotation","5887 // integral to plasma membrane // traceable author statement /// 16020 // membrane // inferred from electronic annotation","4674 // protein serine/threonine kinase activity // inferred from electronic annotation /// 4714 // transmembrane receptor protein tyrosine kinase activity // traceable author statement /// 4872 // receptor activity // inferred from electronic annotation /// 5524 // ATP binding // inferred from electronic annotation /// 16740 // transferase activity // inferred from electronic annotation","---","ec // ZA70_HUMAN // ZA70_HUMAN EC:2.7.1.112:TYROSINE-PROTEIN KINASE ZAP-70 (EC 2.7.1.112) (70 KDA ZETA-ASSOCIATED PROTEIN) (SYK-RELATED TYROSINE KINASE). // 2.0E-65 /// Hanks // DDR // HUMRTK_1 (DDR) KINASES:5.11.1 | PTK Group B membrane spanning protein tyrosine kinases.PTK XX DDR/TKT family .DDR // 1.0E-156","scop // d1kexa_ // d1kexa_ SCOP:b.18.1.2:| B1 domain of neuropilin-1 // 5.0E-42","IPR000421 // Coagulation factor 5/8 type C domain (FA58C) /// IPR000719 // Protein kinase","NP_054700.1 // span:417-439 // numtm:1","---","This probe set was annotated using the Matching Probes based pipeline to a Locus Link identifier using 1 transcripts. // false // Matching Probes // A","NM_013994(16)","ENST00000259875 // cdna:known chromosome:NCBI34:6:30958112:30974184:1 // ensembl // 16 // --- /// NM_013994 // Homo sapiens discoidin domain receptor family, member 1 (DDR1), transcript variant 3, mRNA. // refseq // 16 // ---","ENST00000325423 // ensembl // 1 // Negative Strand Matching Probes /// ENST00000340208 // ensembl // 1 // Negative Strand Matching Probes /// GENSCAN00000025013 // ensembl // 1 // Negative Strand Matching Probes /// BC026341 // gb // 1 // Negative Strand Matching Probes /// S57212 // gb // 1 // Negative Strand Matching Probes" "1053_at","Human Genome U133A Array","Homo sapiens","Oct 11, 2004","Exemplar sequence","GenBank","M87338"," M87338 /FEATURE= /DEFINITION=HUMA1SBU Human replication factor C, 40-kDa subunit (A1) mRNA, complete cds ","M87338","Hs.139226","May 2004 (NCBI 35)","chr7:73090653-73113383 (-) // 70.86 // q11.23","replication factor C (activator 1) 2, 40kDa","RFC2","chr7q11.23","full length","ENSG00000049541","5982","AAP35707 /// P35250","---","600404","NP_002905 /// NP_852136","NM_002914 /// NM_181471","---","---","---","---","---","---","6260 // DNA replication // inferred from electronic annotation","5634 // nucleus // inferred from electronic annotation /// 5663 // DNA replication factor C complex // traceable author statement","166 // nucleotide binding // inferred from electronic annotation /// 3677 // DNA binding // inferred from electronic annotation /// 5524 // ATP binding // traceable author statement","DNA_replication // GenMAPP","ec // KAD2_HUMAN // KAD2_HUMAN EC:2.7.4.3:ADENYLATE KINASE ISOENZYME 2, MITOCHONDRIAL (EC 2.7.4.3) (ATP-AMP TRANSPHOSPHORYLASE). // 8.2","scop // d1nrjb_ // d1nrjb_ SCOP:c.37.1.8:| Signal recognition particle receptor beta-subunit // 0.024","---","---","---","This probe set was annotated using the Matching Probes based pipeline to a Locus Link identifier using 2 transcripts. // false // Matching Probes // A","M87338(15),NM_181471(12)","ENST00000055077 // cdna:known chromosome:NCBI34:7:73057931:73080835:-1 // ensembl // 12 // --- /// ENST00000275627 // cdna:known chromosome:NCBI34:7:73057931:73080835:-1 // ensembl // 12 // --- /// M87338 // Human replication factor C, 40-kDa subunit (A1) mRNA, complete cds. // gb // 15 // --- /// NM_181471 // Homo sapiens replication factor C (activator 1) 2, 40kDa (RFC2), transcript variant 1, mRNA. // refseq // 12 // ---","GENSCAN00000014431 // ensembl // 8 // Cross Hyb Matching Probes" ===== -------------------------------------------------- D.Enrique ESCOBAR ESPINOZA (B.Sc.) http://www.iro.umontreal.ca/~escobard/ http://adn.bioinfo.uqam.ca/~escd07097301/ ICQ#: 201778618 ------------------------------------------------- 1487, Boul. St-Joseph Est Apt4 Tel: (514) 523-8398 Montreal QC Canada H2J 1M6 __________________________________ Do you Yahoo!? Yahoo! Mail - Easier than ever with enhanced search. Learn more. http://info.mail.yahoo.com/mail_250