[Bioperl-l] Packages retrieving online alignment sequences

Sun Aug 8 06:12:41 UTC 2010

On 7 August 2010 23:07, Chris Fields <cjfields at illinois.edu> wrote:

>
> A simpler method could be introduced, but I can see that being potentially
> brittle in the long run.  A naked alphanumeric string doesn't reveal much
> about what it is at face value w/o knowing database/service-specific
> behavior.  And then we're reliant on that behavior not changing, which we
> can't guarantee (this has bitten us in the past).  What would one do if NCBI
> (for instance) allowed accessions derived completely of digits, or
> conversely a unique ID with mixed alphanumerics?
>
> Using methods specific for ID/acc at least guarantees a behavior on the
> backend w/o guessing, and if there is no danger of overlap (a service
> accepts either/or) one could simply be an alias of the other.
>

Thanks for the clarification on IDs vs accessions. As long as the behavior
and distinction are well-documented, I'm sure it won't make too much of a
difference.

My main concern was just that having two similar methods -- with no clearly
laid out distinction between the two and one of them only supported by half
of the implementing subclasses -- might confuse potential users.

As a point of reference: both Rfam and Pfam allow either an ID or an
accession in their front-page search interface (http://www.pfam.org /
http://www.rfam.org/). In fact, they seem to entirely hide the distinction
between ID and Accession from the end user; nowhere on the Rfam page for an
individual result is it clear which string is the accession and which is the
ID (http://rfam.sanger.ac.uk/family/snoZ107_R87).

Thus, a potential user of the Rfam module wouldn't know whether to call the
get_by_ID or get_by_Accession method, even after looking at the Rfam page
for his / her desired alignment!

As you can probably tell, I'm all in favor of a unified search whenever
feasible / possible. :-)

> As for writing up an adaptor to ensembl outside of it's API, overall I
> don't think it's a bad idea, but if it's possible maybe start without
> reinventing things, then move to direct SQL.  Unless it's easier to use SQL.
>
>
For fetching Ensembl's gene family alignments, using the SQL will be
easiest. They don't tend to get unreasonably large in terms of memory  -- I
think the biggest tend to be ~700 sequences with a few thousand alignment
columns or so -- and it's a simple table join or two to get both the tree
and alignment from the database.

For genomic alignments, I agree that a more memory-efficient and/or lazy
backend would be necessary. And it's pretty much impossible to get those
things out of the Ensembl tables without using their API.

--greg