[Bioperl-l] Re: [BioSQL-l] Swissprot Problems

Thu Aug 19 15:51:19 EDT 2004

Thanks, this is helpful. So it appears we need to add recognition of  
the RG line to the swissprot SeqIO parser. I just committed a fix to  
the main trunk. There should be a test as well, didn't have the time  
yet - everybody feel free to step in. I also added code to deal with  
this on writing out and therefore also to Bio::Annotation::Reference.

I was being lazy and didn't check the manual myself, and sure enough  
paid the price ...

	-hilmar

On Aug 19, 2004, at 1:38 AM, Dave Howorth wrote:

> Raphael A. Bauer wrote:
>> just an interesting thing from Swiss-Prot:
>> If we want to load the latest Swiss-Prot flatfile with
>> load_seqdatabase.pl we get the following error (normally our
>> load_seqdatabase.pl works fine):
>> -------------------- WARNING ---------------------
>> MSG: insert in Bio::DB::BioSQL::ReferenceAdaptor (driver) failed,  
>> values
>> were
>> ("",""A multicenter comparison of methods for typing strains of  
>> Pseudomonas
>> aeruginosa predominantly from patients with cystic fibrosis."","J.  
>> Infect.
>> Dis. 169:134-142(1994).","CRC-237261AF859664D3","","") FKs (613823)
>> ERROR:  null value in column "authors" violates not-null constraint
>> ---------------------------------------------------
>> Could not store Q53391:
>> ------------- EXCEPTION  -------------
>> ...
>> And that is what I would expect, because Q53391 has no RA line.
>> The Swiss-Prot manual says:
>> RG    Reference group    Once or more (Optional if RA line)   RA     
>> Reference authors    Once or more (Optional if RG line)
>> ...
>> ...
>> so I don't know how one can deal with this, because it is a clear
>> violation of the Swiss-Prot manual statements and therefore a  
>> violation
>> of the BioSQL schema definition (authors NOT NULL)...
>> We will remove the NOT NULL statements from the authors line in the
>> BioSQL schema to deal with this..
>
> Hilmar Lapp replied:
> >> We will remove the NOT NULL statements from the authors line in the
> >> BioSQL schema to deal with this..
> > Yep, I'll do this in the repository too.
>
>
> I'm a little confused by this. I'm interested in learning a bit about  
> these entries so I went to browse the entry  
> <http://www.ebi.uniprot.org/uniprot-srv/flatView.do? 
> proteinId=FMK7_PSEAE&pager.offset=0>
> The relevant section seems to be:
>
> RN   [1]
> RP   SEQUENCE FROM N.A.
> RC   STRAIN=KB7;
> RX   MEDLINE=94103636; PubMed=7903973;
> RG   INTERNATIONAL PSEUDOMONAS AERUGINOSA TYPING STUDY GROUP;
> RT   "A multicenter comparison of methods for typing strains of  
> Pseudomonas
> RT   aeruginosa predominantly from patients with cystic fibrosis.";
> RL   J. Infect. Dis. 169:134-142(1994).
>
> Then I went to the user manual  
> <http://www.expasy.org/sprot/userman.html#Ref_line> where the relevant  
> text seems to be:
>
>     3.10.5. The RG line
>
> The Reference Group (RG) line lists the consortium name associated  
> with a given citation. The RG line is mainly used in submission  
> reference blocks, but can also be used in paper references, if the  
> working group is cited as an author in the paper. RG line and RA line  
> (Reference Author) can be present in the same reference block; at  
> least one RG or RA line is mandatory per reference block. An example  
> of the use of RG lines is shown below:
>
> RG   The mouse genome sequencing consortium;
>
>     3.10.6. The RA line
>
> The RA (Reference Author) lines list the authors of the paper (or  
> other work) cited. The RA line is present in most references, but  
> might be missing in references that cite a reference group (see RG  
> line). At least one RG or RA line is mandatory per reference block.
> --------------------
>
> So it seems to me that the record is valid according to the spec and  
> records do not need to have an RA line if they do have an RG line. It  
> is probably appropriate to use the value of the RG field as the  
> authors field in the database. Or am I missing something?
>
>
> >> Any better ideas?
> > No. There's not much you can do if people violate their own specs.
>
> There is another possible way to deal with errant records that violate  
> the spec. That is to maintain an exception dictionary. That is, for  
> each record that would fail validation, make a curated patch that can  
> be applied to the record before validation. Clearly this can be a lot  
> of work unless the initial record quality is already high. Submitting  
> the exceptions back to the originating institution is good to do as  
> well :)
>
> Cheers, Dave
> -- 
> Dave Howorth
> MRC Centre for Protein Engineering
> Hills Road, Cambridge, CB2 2QH
> 01223 252960
>
> _______________________________________________
> BioSQL-l mailing list
> BioSQL-l at open-bio.org
> http://open-bio.org/mailman/listinfo/biosql-l
>
-- 
-------------------------------------------------------------
Hilmar Lapp                            email: lapp at gnf.org
GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
-------------------------------------------------------------