[Bioperl-l] HG-U133a annotation csv (HG-U133A_annot.csv)

Fri Dec 10 16:49:31 EST 2004

The module Text::CSV_XS could be used as well - it does a pretty good 
job with mixed quoted and non-quoted fields.

-jason

On Dec 10, 2004, at 4:34 PM, Peter Robinson wrote:

> while ($line =~ m/"(.*?)"/g) {
> 	print $1;
> }
> The "?" keeps * from being greedy, so we match only what is in between
> each of the quotes. This regex just basically ignores the commas in
> between the entries.
>
> HTH
>
> Peter
>
>
> On Fri, 2004-12-10 at 21:59, D.Enrique ESCOBAR ESPINOZA wrote:
>> I m have a hell of time trying to parse the annotation file with a
>> regular expression.
>> The problem is that the file contains fileds separated by a coma,
>> each field starts with a double quote and it ends in a double quote,
>> and also it contains in each field some ';' and ','.
>> an exemple of that file is at the end of this mail,
>> can someone help and give me a trick for parsing the lines of this
>> file?
>> It has 38 fields, and excel is not even opening it correctly,
>> and if i try to save it back to a csv file,
>> it does a complet mess.
>> Thanks in advance.
>> "Probe Set ID","GeneChip Array","Species Scientific Name","Annotation
>> Date","Sequence Type","Sequence Source","Transcript ID","Target
>> Description","Representative Public ID","UniGene ID","Genome
>> Version","Alignments","Gene Title","Gene Symbol","Chromosomal
>> Location","Unigene Cluster
>> Type","Ensembl","LocusLink","SwissProt","EC","OMIM","RefSeq Protein
>> ID","RefSeq Transcript ID","FlyBase","AGI","WormBase","MGI Name","RGD
>> Name","SGD accession number","Gene Ontology Biological Process","Gene
>> Ontology Cellular Component","Gene Ontology Molecular
>> Function","Pathway","Protein Families","Protein
>> Domains","InterPro","Trans Membrane","QTL","Annotation
>> Description","Annotation Transcript Cluster","Transcript
>> Assignments","Annotation Notes"
>> "1007_s_at","Human Genome U133A Array","Homo sapiens","Oct 11,
>> 2004","Exemplar sequence","Affymetrix Proprietary
>> Database","U48705mRNA"," U48705 /FEATURE=mRNA /DEFINITION=HSU48705
>> Human receptor tyrosine kinase DDR gene, complete cds
>> ","U48705","Hs.423573","May 2004 (NCBI 35)","chr6:30964144-30975910
>> (+) // 95.63 // p21.33","discoidin domain receptor family, member
>> 1","DDR1","chr6p21.3","full length","ENSG00000137332","780","BAC85426
>> /// Q08345 /// Q96T61 /// Q96T62","EC:2.7.1.112","600408","NP_001945
>> /// NP_054699 /// NP_054700","NM_001954 /// NM_013993 ///
>> NM_013994","---","---","---","---","---","---","6468 // protein amino
>> acid phosphorylation // inferred from electronic annotation /// 7155
>> // cell adhesion // traceable author statement /// 7169 //
>> transmembrane receptor protein tyrosine kinase signaling pathway //
>> inferred from electronic annotation","5887 // integral to plasma
>> membrane // traceable author statement /// 16020 // membrane //
>> inferred from electronic annotation","4674 // protein
>> serine/threonine kinase activity // inferred from electronic
>> annotation /// 4714 // transmembrane receptor protein tyrosine kinase
>> activity // traceable author statement /// 4872 // receptor activity
>> // inferred from electronic annotation /// 5524 // ATP binding //
>> inferred from electronic annotation /// 16740 // transferase activity
>> // inferred from electronic annotation","---","ec // ZA70_HUMAN //
>> ZA70_HUMAN EC:2.7.1.112:TYROSINE-PROTEIN KINASE ZAP-70 (EC 2.7.1.112)
>> (70 KDA ZETA-ASSOCIATED PROTEIN) (SYK-RELATED TYROSINE KINASE). //
>> 2.0E-65 /// Hanks // DDR // HUMRTK_1 (DDR) KINASES:5.11.1 | PTK Group
>> B membrane spanning protein tyrosine kinases.PTK XX DDR/TKT family
>> .DDR // 1.0E-156","scop // d1kexa_ // d1kexa_ SCOP:b.18.1.2:| B1
>> domain of neuropilin-1 // 5.0E-42","IPR000421 // Coagulation factor
>> 5/8 type C domain (FA58C) /// IPR000719 // Protein
>> kinase","NP_054700.1 // span:417-439 // numtm:1","---","This probe
>> set was annotated using the Matching Probes based pipeline to a Locus
>> Link identifier using 1 transcripts. // false // Matching Probes //
>> A","NM_013994(16)","ENST00000259875 // cdna:known
>> chromosome:NCBI34:6:30958112:30974184:1 // ensembl // 16 // --- ///
>> NM_013994 // Homo sapiens discoidin domain receptor family, member 1
>> (DDR1), transcript variant 3, mRNA. // refseq // 16 //
>> ---","ENST00000325423 // ensembl // 1 // Negative Strand Matching
>> Probes /// ENST00000340208 // ensembl // 1 // Negative Strand
>> Matching Probes /// GENSCAN00000025013 // ensembl // 1 // Negative
>> Strand Matching Probes /// BC026341 // gb // 1 // Negative Strand
>> Matching Probes /// S57212 // gb // 1 // Negative Strand Matching
>> Probes"
>> "1053_at","Human Genome U133A Array","Homo sapiens","Oct 11,
>> 2004","Exemplar sequence","GenBank","M87338"," M87338 /FEATURE=
>> /DEFINITION=HUMA1SBU Human replication factor C, 40-kDa subunit (A1)
>> mRNA, complete cds ","M87338","Hs.139226","May 2004 (NCBI
>> 35)","chr7:73090653-73113383 (-) // 70.86 // q11.23","replication
>> factor C (activator 1) 2, 40kDa","RFC2","chr7q11.23","full
>> length","ENSG00000049541","5982","AAP35707 ///
>> P35250","---","600404","NP_002905 /// NP_852136","NM_002914 ///
>> NM_181471","---","---","---","---","---","---","6260 // DNA
>> replication // inferred from electronic annotation","5634 // nucleus
>> // inferred from electronic annotation /// 5663 // DNA replication
>> factor C complex // traceable author statement","166 // nucleotide
>> binding // inferred from electronic annotation /// 3677 // DNA
>> binding // inferred from electronic annotation /// 5524 // ATP
>> binding // traceable author statement","DNA_replication //
>> GenMAPP","ec // KAD2_HUMAN // KAD2_HUMAN EC:2.7.4.3:ADENYLATE KINASE
>> ISOENZYME 2, MITOCHONDRIAL (EC 2.7.4.3) (ATP-AMP TRANSPHOSPHORYLASE).
>> // 8.2","scop // d1nrjb_ // d1nrjb_ SCOP:c.37.1.8:| Signal
>> recognition particle receptor beta-subunit //
>> 0.024","---","---","---","This probe set was annotated using the
>> Matching Probes based pipeline to a Locus Link identifier using 2
>> transcripts. // false // Matching Probes //
>> A","M87338(15),NM_181471(12)","ENST00000055077 // cdna:known
>> chromosome:NCBI34:7:73057931:73080835:-1 // ensembl // 12 // --- ///
>> ENST00000275627 // cdna:known
>> chromosome:NCBI34:7:73057931:73080835:-1 // ensembl // 12 // --- ///
>> M87338 // Human replication factor C, 40-kDa subunit (A1) mRNA,
>> complete cds. // gb // 15 // --- /// NM_181471 // Homo sapiens
>> replication factor C (activator 1) 2, 40kDa (RFC2), transcript
>> variant 1, mRNA. // refseq // 12 // ---","GENSCAN00000014431 //
>> ensembl // 8 // Cross Hyb Matching Probes"
>>
>>
>> =====
>> --------------------------------------------------
>> D.Enrique ESCOBAR ESPINOZA (B.Sc.)
>> http://www.iro.umontreal.ca/~escobard/
>> http://adn.bioinfo.uqam.ca/~escd07097301/
>> ICQ#: 201778618
>> -------------------------------------------------
>> 1487, Boul. St-Joseph Est Apt4
>> Tel:  (514) 523-8398
>> Montreal QC Canada
>> H2J 1M6
>>
>>
>> 		
>> __________________________________
>> Do you Yahoo!?
>> Yahoo! Mail - Easier than ever with enhanced search. Learn more.
>> http://info.mail.yahoo.com/mail_250
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at portal.open-bio.org
>> http://portal.open-bio.org/mailman/listinfo/bioperl-l
> -- 
> Peter N. Robinson
> peter.robinson at t-online.de
> peter.robinson at charite.de
> http://www.charite.de/ch/medgen/robinson/
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioperl-l
>
>
--
Jason Stajich
jason.stajich at duke.edu
http://www.duke.edu/~jes12/