[Bioperl-l] quotes in features

Michael Muratet mam at torchconcepts.com
Thu Jul 17 15:26:44 EDT 2003


On Thu, 17 Jul 2003 17:53:49 +0100 (BST)
Ewan Birney <birney at ebi.ac.uk> wrote:

> 
> 
> On Thu, 17 Jul 2003, Michael Muratet wrote:
> 
> > Greetings
> >
> > I found the following entry in gbpri1.seq.gz
> >
> > LOCUS       AB078028                 510 bp    mRNA    linear   PRI
> > 17-JUL-2002
> > DEFINITION  Homo sapiens ATF3deltaZip2exonD'DE'E gene for
> > ATF3deltaZip2,
> >             partial cds.
> >                      /gene="ATF3deltaZip2exonD'DE'E"
> >      CDS             <1..60
> >                      /gene="ATF3deltaZip2exonD'DE'E"
> >                      /codon_start=1
> >
> > Embedded quotes are a problem for us who try to automatically parse
> > and/or store in databases the information in the DEFINITION or CDS
> > or/gene fields. We can deal with them, but adding code for special
> > cases(and figuring out what those cases are) is time consuming. I'd
> > like to propose a standard that says that strings that represent
> > names, genes, etc., contain no spaces, quotes, or non-printing
> > characters, or anything else that might be construed as a delimiter
> > in perl, C, Java, SQL, etc..
> >
> 
> Mike - this is a good point, but the Feature table has very long
> established rules about quoting etc and we are not going to be
> changing those. If the Bioperl parser falls over on these guys, then
> this is a bioperl error which we should fix.
> 
> 
> there is simply no sense (or pragmatic way) to change the feature
> table parsing rules.
> 
> 
> 
> > Thank you.
> >
> > Mike
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at portal.open-bio.org
> > http://portal.open-bio.org/mailman/listinfo/bioperl-l
> >

Ewan

In this particular case, bioperl extracted the name correctly (i.e.,
with the quotes still embedded) and it was MySQL (or the DBD interface)
that had the problem with the quotes. (And thanks again to all the
people who make bioperl and DBD possible.) I had a problem a few weeks
back with an embedded space in a DEFINITION line which bioperl did not
parse correctly, but NCBI agreed to replace the space with an underscore
in the offending record. I copied the list on this one because I thought
people should be aware of the issue if they weren't already. 

I agree that we'll never get them to change the rules about features.
The rules we have seem to work OK 99.99% of the time. I'm guessing that
in this case, the single quotes should be read as 'prime', and I'll
incorporate Brian's suggestion (thanks, Brian). Still, you'd think that
with 52 letters and ten digits we could come up with all the gene names
we need... ;-)

Cheers

Mike


More information about the Bioperl-l mailing list