[Bioperl-l] BioSQL, bioperl-db and UniGene

Thu Jan 5 14:02:38 EST 2006

On Jan 5, 2006, at 5:26 AM, Marc Saric wrote:

> I am currently writing an app which should map microarray probe
> sequences to target sequences. It should do so in a generalized manner
> (i.e. any microarray against an arbitrary sequence-database). Currently
> I need UniGene for Zebrafish (Dr.*) and several Oligonucleotide libs,
> among them an Affymetrix array.

First off, you have seen the TIGR RESOURCERER application
(http://www.tigr.org/tigr-scripts/magic/r1.pl), right?

> [...]
> 1st question:
>
> Due to the fact that the loader does not like raw FASTA-files,

The loader likes all formats that Bio::SeqIO likes, so it doesn't 
harbor any disdain for FASTA format. The only problem is that FASTA 
format doesn't designate fields for accession, version, and name but 
rather leaves it up to the file producer. This can be easily solved by 
writing a custom SeqProcessor as pointed out several times before, for 
instance:

http://portal.open-bio.org/pipermail/bioperl-l/2004-June/016204.html
http://portal.open-bio.org/pipermail/bioperl-l/2005-August/019579.html

>  what
> would be the most elegant/efficient way of loading all sequence-files
> for the UniGene build as well (normaly provided in a FASTA-file called
> *.seq.all, Dr.seq.all in my case). And how to associate them with the
> cluster data (i.e. there are allready entries in bioentry for all
> sequences, but they are missing the sequence data and most of their
> detail annotation, so this might be some kind of update).

See above for the format issue. As for automatically updating your 
sequences, use --lookup and possibly other update-related options for 
load_seqdatabase.pl (see its POD).

>
> 2nd question:
>
> What would be the best way of integrating BLAT/GMAP (same format as
> BLAT) results. I'm thinking about parsing the file and writing the
> mapping-results as a annotation into the database, linked to each
> probe-sequence. Data would include the hit(s) found for each probe,
> wether it hits more than one cluster and possibly some additional 
> notes.
>
>> From there I would write out a report or custom sequence file for use 
>> in
> other tools.
>
> If possible I would also like to accumulate annotations (like mapping
> against different UniGene builds over time).

I'm not sure exactly what your question is. Note that you can attach 
anything you like to sequences in the database, e.g., features, and 
annotations.

You can do so using Bioperl pretty easily. The sequence of steps is 
basically, 1) retrieve sequence object, 2) add annotation and/or 
features, 3) call $pseq->store(), and commit with $pseq->commit().

There are some pertinent code fragments in
http://www.open-bio.org/bosc2003/slides/Persistent_Bioperl_BOSC03.pdf

Let me know if this doesn't answer your question.

>
> 3rd question:
>
> Due to the fact, that UniGene changes frequently, I would like to have
> some kind of versioning, so that I can keep old versions of UniGene as 
> a
> backup and add new ones (i.e. not only keeping the mapping results but
> also keeping all the source sequences).
>
> If I understand it right, the load_seqdatabase script does not support
> this and has no (command-line) option for overriding the "database" 
> name
> (i.e. for UniGene it will always be set to "UniGene" in biodatabase and
> thus overwrite old versions)?

Yes - the reason is that an instance of Bio::Cluster::Unigene will 
default its namespace to 'UniGene' if none if provided by the caller, 
and the Unigene parser doesn't provide one. load_seqdatabase itself 
doesn't touch the namespace of the object if its been set already.

I'm not quite happy with this myself, as basically it takes away 
control from the user. Now I do think load_seqdatabase.pl's policy is 
correct; but maybe the right thing to do for Bio::Cluster::Unigene is 
not to default to a non-mandatory value if none is provided. What if I 
just propose to make that change.

What you can do regardless of this is before you want to load a new 
UniGene version rename the existing namespace to something that 
includes the version. Then all entries will be created fresh under the 
then-new namespace 'UniGene'.

Note that source sequences do not change because UniGene changes - 
there will be new cluster members and other member sequences will be 
retired from the cluster, but their sequences only change if the 
respective GenBank sequence changes, which will not only increment the 
version but also lead to a new GI number, which basically means a new 
cluster member (as they are references by GI number).

>
> Do you see any fundamental problems here for versioning the data 
> (except
> storage space)?

No, not at all.

Let me know if I didn't address your questions.

	-hilmar

>
> Thanks in advance.
>
> Links:
>
> ProbeLynx http://koch.pathogenomics.ca/probelynx/
> D.rerio UniGene: 
> http://www.ncbi.nlm.nih.gov/UniGene/UGOrg.cgi?TAXID=7955
>
>
> -- 
> Bye,
>
> Marc Saric
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioperl-l
>
>
-- 
-------------------------------------------------------------
Hilmar Lapp                            email: lapp at gnf.org
GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
-------------------------------------------------------------