[Bioperl-l] HG-U133a annotation csv (HG-U133A_annot.csv)
Peter Robinson
Peter.Robinson at t-online.de
Fri Dec 10 16:34:47 EST 2004
while ($line =~ m/"(.*?)"/g) {
print $1;
}
The "?" keeps * from being greedy, so we match only what is in between
each of the quotes. This regex just basically ignores the commas in
between the entries.
HTH
Peter
On Fri, 2004-12-10 at 21:59, D.Enrique ESCOBAR ESPINOZA wrote:
> I m have a hell of time trying to parse the annotation file with a
> regular expression.
> The problem is that the file contains fileds separated by a coma,
> each field starts with a double quote and it ends in a double quote,
> and also it contains in each field some ';' and ','.
> an exemple of that file is at the end of this mail,
> can someone help and give me a trick for parsing the lines of this
> file?
> It has 38 fields, and excel is not even opening it correctly,
> and if i try to save it back to a csv file,
> it does a complet mess.
> Thanks in advance.
> "Probe Set ID","GeneChip Array","Species Scientific Name","Annotation
> Date","Sequence Type","Sequence Source","Transcript ID","Target
> Description","Representative Public ID","UniGene ID","Genome
> Version","Alignments","Gene Title","Gene Symbol","Chromosomal
> Location","Unigene Cluster
> Type","Ensembl","LocusLink","SwissProt","EC","OMIM","RefSeq Protein
> ID","RefSeq Transcript ID","FlyBase","AGI","WormBase","MGI Name","RGD
> Name","SGD accession number","Gene Ontology Biological Process","Gene
> Ontology Cellular Component","Gene Ontology Molecular
> Function","Pathway","Protein Families","Protein
> Domains","InterPro","Trans Membrane","QTL","Annotation
> Description","Annotation Transcript Cluster","Transcript
> Assignments","Annotation Notes"
> "1007_s_at","Human Genome U133A Array","Homo sapiens","Oct 11,
> 2004","Exemplar sequence","Affymetrix Proprietary
> Database","U48705mRNA"," U48705 /FEATURE=mRNA /DEFINITION=HSU48705
> Human receptor tyrosine kinase DDR gene, complete cds
> ","U48705","Hs.423573","May 2004 (NCBI 35)","chr6:30964144-30975910
> (+) // 95.63 // p21.33","discoidin domain receptor family, member
> 1","DDR1","chr6p21.3","full length","ENSG00000137332","780","BAC85426
> /// Q08345 /// Q96T61 /// Q96T62","EC:2.7.1.112","600408","NP_001945
> /// NP_054699 /// NP_054700","NM_001954 /// NM_013993 ///
> NM_013994","---","---","---","---","---","---","6468 // protein amino
> acid phosphorylation // inferred from electronic annotation /// 7155
> // cell adhesion // traceable author statement /// 7169 //
> transmembrane receptor protein tyrosine kinase signaling pathway //
> inferred from electronic annotation","5887 // integral to plasma
> membrane // traceable author statement /// 16020 // membrane //
> inferred from electronic annotation","4674 // protein
> serine/threonine kinase activity // inferred from electronic
> annotation /// 4714 // transmembrane receptor protein tyrosine kinase
> activity // traceable author statement /// 4872 // receptor activity
> // inferred from electronic annotation /// 5524 // ATP binding //
> inferred from electronic annotation /// 16740 // transferase activity
> // inferred from electronic annotation","---","ec // ZA70_HUMAN //
> ZA70_HUMAN EC:2.7.1.112:TYROSINE-PROTEIN KINASE ZAP-70 (EC 2.7.1.112)
> (70 KDA ZETA-ASSOCIATED PROTEIN) (SYK-RELATED TYROSINE KINASE). //
> 2.0E-65 /// Hanks // DDR // HUMRTK_1 (DDR) KINASES:5.11.1 | PTK Group
> B membrane spanning protein tyrosine kinases.PTK XX DDR/TKT family
> .DDR // 1.0E-156","scop // d1kexa_ // d1kexa_ SCOP:b.18.1.2:| B1
> domain of neuropilin-1 // 5.0E-42","IPR000421 // Coagulation factor
> 5/8 type C domain (FA58C) /// IPR000719 // Protein
> kinase","NP_054700.1 // span:417-439 // numtm:1","---","This probe
> set was annotated using the Matching Probes based pipeline to a Locus
> Link identifier using 1 transcripts. // false // Matching Probes //
> A","NM_013994(16)","ENST00000259875 // cdna:known
> chromosome:NCBI34:6:30958112:30974184:1 // ensembl // 16 // --- ///
> NM_013994 // Homo sapiens discoidin domain receptor family, member 1
> (DDR1), transcript variant 3, mRNA. // refseq // 16 //
> ---","ENST00000325423 // ensembl // 1 // Negative Strand Matching
> Probes /// ENST00000340208 // ensembl // 1 // Negative Strand
> Matching Probes /// GENSCAN00000025013 // ensembl // 1 // Negative
> Strand Matching Probes /// BC026341 // gb // 1 // Negative Strand
> Matching Probes /// S57212 // gb // 1 // Negative Strand Matching
> Probes"
> "1053_at","Human Genome U133A Array","Homo sapiens","Oct 11,
> 2004","Exemplar sequence","GenBank","M87338"," M87338 /FEATURE=
> /DEFINITION=HUMA1SBU Human replication factor C, 40-kDa subunit (A1)
> mRNA, complete cds ","M87338","Hs.139226","May 2004 (NCBI
> 35)","chr7:73090653-73113383 (-) // 70.86 // q11.23","replication
> factor C (activator 1) 2, 40kDa","RFC2","chr7q11.23","full
> length","ENSG00000049541","5982","AAP35707 ///
> P35250","---","600404","NP_002905 /// NP_852136","NM_002914 ///
> NM_181471","---","---","---","---","---","---","6260 // DNA
> replication // inferred from electronic annotation","5634 // nucleus
> // inferred from electronic annotation /// 5663 // DNA replication
> factor C complex // traceable author statement","166 // nucleotide
> binding // inferred from electronic annotation /// 3677 // DNA
> binding // inferred from electronic annotation /// 5524 // ATP
> binding // traceable author statement","DNA_replication //
> GenMAPP","ec // KAD2_HUMAN // KAD2_HUMAN EC:2.7.4.3:ADENYLATE KINASE
> ISOENZYME 2, MITOCHONDRIAL (EC 2.7.4.3) (ATP-AMP TRANSPHOSPHORYLASE).
> // 8.2","scop // d1nrjb_ // d1nrjb_ SCOP:c.37.1.8:| Signal
> recognition particle receptor beta-subunit //
> 0.024","---","---","---","This probe set was annotated using the
> Matching Probes based pipeline to a Locus Link identifier using 2
> transcripts. // false // Matching Probes //
> A","M87338(15),NM_181471(12)","ENST00000055077 // cdna:known
> chromosome:NCBI34:7:73057931:73080835:-1 // ensembl // 12 // --- ///
> ENST00000275627 // cdna:known
> chromosome:NCBI34:7:73057931:73080835:-1 // ensembl // 12 // --- ///
> M87338 // Human replication factor C, 40-kDa subunit (A1) mRNA,
> complete cds. // gb // 15 // --- /// NM_181471 // Homo sapiens
> replication factor C (activator 1) 2, 40kDa (RFC2), transcript
> variant 1, mRNA. // refseq // 12 // ---","GENSCAN00000014431 //
> ensembl // 8 // Cross Hyb Matching Probes"
>
>
> =====
> --------------------------------------------------
> D.Enrique ESCOBAR ESPINOZA (B.Sc.)
> http://www.iro.umontreal.ca/~escobard/
> http://adn.bioinfo.uqam.ca/~escd07097301/
> ICQ#: 201778618
> -------------------------------------------------
> 1487, Boul. St-Joseph Est Apt4
> Tel: (514) 523-8398
> Montreal QC Canada
> H2J 1M6
>
>
>
> __________________________________
> Do you Yahoo!?
> Yahoo! Mail - Easier than ever with enhanced search. Learn more.
> http://info.mail.yahoo.com/mail_250
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioperl-l
--
Peter N. Robinson
peter.robinson at t-online.de
peter.robinson at charite.de
http://www.charite.de/ch/medgen/robinson/
More information about the Bioperl-l
mailing list