[DAS] DAS and bacterial genomes

Thu Aug 19 18:32:31 UTC 2010

I do wonder if there are two somewhat-disjoint problems here:

          1. Defining exactly what was sequenced by a given project.

          2. Coming up with a scheme of globally-unique identifiers for
genome assemblies which can be applied reasonably consistently.

(1) is hard.  I absolutely agree that this needs serious thought (and is
maybe a moving target right now anyway).

I'm going to suggest that (2) is sufficient to get DAS working fairly well,
and might be a rather easier situation.  Currently a coordinate system is
defined by up to four properties (quoting from Andy's draft spec):

   - The *category* or type of object. For example a chromosome, contig or
   protein sequence.
   - The *authority* responsible for defining the coordinate system. For
   example NCBI or UniProt.
   - The *version*, for coordinate systems containing entities that are not
   versioned (e.g. genomic assemblies).
   - The *species*, for coordinate systems containing only entities from a
   single organism.

The authority is the name of the organization "responsible" for the
assembly.  So if two different groups sequence the same organism, we can
already unambiguously identify them.

One might hope that it would be reasonably straightforward to project
annotations between closely related strains via alignments?

Not saying that the full semantic "what did they sequence" isn't worth
solving...  but I'm not convinced it's needed to get DAS working.

                 Thomas.

On Thu, Aug 19, 2010 at 10:56 AM, Ewan Birney <birney at ebi.ac.uk> wrote:

>
> Just to repeat :
>
>  I always think this should be easy and then I get educated by Paul:
>
>  I thikn each time one thinks about "just moving it down a level" (eg, to
> strain) there are submitted
> cases in which two people have submitted assemblies with the same "strain
> tax id" but actually
> clearly arent (eg, there is a big insertion of something). The whole thing
> keeps moving down
> a notch.
>
>  The right thing here is to assign tracking idenitifers to assembly series
> independently of
> the strain assignments, and track assemblies separately (but obviously with
> relationships)
> to strains.
>
>  Paul has met most (?all) of the use cases and understands this better than
> me. I think
> we should wait for Paul to weigh in here - it's just always a bit more
> complicated than you
> think ;)
>
>
>
>
> On 19 Aug 2010, at 00:10, Andy Jenkinson wrote:
>
>  On 18 Aug 2010, at 20:47, Adam Witney wrote:
>>
>>> As mentioned elsewhere in this thread, the problem of distinguishing an
>>>> individual from its taxon is not limited to bacteria. Does  the 1000 genomes
>>>> project use the assembly as a surrogate for isolate/individual?
>>>>
>>>
>>> No it's not unique to bacteria, indeed I notice that Plasmodium vivax is
>>> already in the DAS registry, does this not suffer the same problems? i.e
>>> which strain are the coordinates referring to?
>>>
>>
>> So far a coordinate system always refers to the species or strain
>> identified by its taxonomy ID. As you say, strains DO have their own NCBI
>> taxonomy ID. It may be that this is not the case for a strain that someone
>> wants to annotate, but I have yet to see an actual example. There is the
>> wider question of how to handle individuals though. I can't comment on how
>> 1000 genomes do this as I've only seen these data expressed as variations
>> annotated upon the reference assembly, but my feeling is that if annotations
>> of an individual were needed then it could/would be done using the assembly
>> paradigm as a surrogate.
>> _______________________________________________
>> DAS mailing list
>> DAS at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/das
>>
>
> _______________________________________________
> DAS mailing list
> DAS at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/das
>