[Bioperl-l] bioperl-db -- qualifier values

Keith Allen kallen@paragen.com
Fri, 22 Mar 2002 11:30:00 -0500


Hi guys,

I'm trying to work out some differences between the way that embl
and swissprot data are handled when loaded into the biosql db
(using  the bioslqdb_1.6.sql code to build the db and the 1.6 version
of load_seqdatabase.pl and bioperl-1.0alpha2-rc).

Step 1 was loading all of the embl human data.  After fishing around
for a bit I found that there is an EC_number ontology term in the
ontology_term table, which means you can construct a query that
gives you a list of all human proteins in the db that have an associated

EC number (and the number came out looking pretty reasonable).

Step 2 was loading all of Swissprot, which also went without a hitch.

So at this point I wanted to get all of the swissprot human sequences
that had an associated EC number, and compare that list to the embl
list.  Alas, mine was a luckless venture, and it took a while for me to
work out that it wasn't a problem with my SQL query, there were in
fact no EC numbers generated as qualifiers and stored in
bioentry_qualifier_value
from the swissprot data.  In fact, if you load a separate database with
the swissprot data (instead of loading both embl and swissprot into the
same db and using the tags in the biodatabase table to tell the data
apart),
the ontology_term table ends up with just a third as many entries, and
EC_number is not one of them.

I haven't worked all the way through the code to see why this is
happening,
but it looks like the deal is where the info is stored in the records.
In embl,
the EC number ends up as a note in the feature table, like so:

FT                   /note="triosephosphate isomerase (EC 5.3.1.1);
NCBI gi:

This means that this part of the record is passed off to
Bio::SeqFeature::Generic
and everything gets handled just the way you'd like.

On the other hand, swissprot puts this information in the description
line rather
than the feature table, like so:

DE   Triosephosphate isomerase (EC 5.3.1.1) (TIM).

which means, if I'm reading the code right, it never gets out of
Bio::SeqIO:Swiss
for further parsing, and the whole line gets loaded into the database
whole.  So for
this example (Swissprot accession P00938), the qualifier_value stored in
the
bioentry_qualifier_value table corresponding to the ontology term
"description"
is:

Triosephosphate isomerase (EC 5.3.1.1) (TIM).

which would be the raw result of the regex in Swiss.pm
/^DE\s+(\S.*\S)/.


So, I think that important info is being lost here (ie, filed someplace
where
we can't find it), but I'm not sure what the appropriate fix is.  This
is basically
a case of feature table info that is being provided outside of the
feature table,
so it almost seems like the description line parsing code in Swiss.pm
needs to
be looking for this, and then pass it off to the feature table code if
it finds it.

right?