[Bioperl-l] parsing DEFINITION field in GenBank entries?
Dave Lewis
ddlewis3@worldnet.att.net
Sun, 24 Jun 2001 07:37:08 -0500
Hi - I'm wondering if anyone has attempted to parse (using perl or
otherwise) the DEFINITION field of GenBank entries. Here's some examples:
DEFINITION Papio hamadryas cynocephalus MHC class II antigen DQ-alpha
MHC-DQA
gene (MHC-DQA*AMB-2 allele), exon 2 and partial cds.
DEFINITION Homo sapiens inducible
6-phosphofructo-2-kinase/fructose-2,6-bisphosphatase (IPFK2)
gene,
partial cds.
DEFINITION Homo sapiens SBBI12 mRNA, complete cds.
DEFINITION Sequence 22 from Patent WO0100669.
The field is only semi-formatted, so this would in general be a heuristic
pattern-matching / natural language processing problem. It wouldn't be
possible to do perfectly, but one might be able to do a reasonable job of
pulling out gene names and gene symbols when they are there, or at least
eliminating parts of the entry that aren't those things.
Regards, Dave
David D. Lewis, Ph.D.
858 W. Armitage Ave., #296
Chicago, IL 60614 USA
ph. 773-975-0304; fax 773 442-0262
http://www.DavidDLewis.com