[Bioperl-l] module for unflattening GenBank/EMBL/DDBJ records

Wed Jun 18 12:45:50 EDT 2003

Wow.  I'm glad that someone has finally bitten the bullet and done this.  I 
can't wait to put it through its paces!

Lincoln

On Tuesday 17 June 2003 11:33 pm, Chris Mungall wrote:
> I have a module (and corresponding .t test) ready to commit to bioperl.
> I'm including the pod docs below.
>
> It is called Bio::Tools::GenBankCollector - is this the correct namespace?
> good name?
>
> Should I commit this? Main trunk or branch?
>
> What is an acceptable size for a t/data file? Is ~400kb fine? Presumably I
> should do a "cvs add -kb" (it is a genbank record).
>
> The main impetus for writing this is to get legacy genbank data into the
> chado relational database in a useful form; it could of course be used for
> loading biosql, or dumping GFF3, or as a component in the building of full
> Bio::SeqFeature::Gene::* objects from GenBank.
>
> Here's the pod docs:
> ---------------------------------------------------------------
>
> =head1 NAME
>
> Bio::Tools::GenBankCollector - Unflattens a flat list of genbank-sourced
> features
>
> =head1 SYNOPSIS
>
>   # standard / generic use - unflatten a genbank record
>   use Bio::SeqIO;
>   use Bio::Tools::GenBankCollector;
>
>   # first fetch a genbank SeqI object
>   $seqio =
>     Bio::SeqIO->new(-file=>'AE003644.gbk',
>                     -format=>'GenBank');
>   $seq = $seqio->next_seq();
>
>   # generate a collector object
>   $collector = Bio::Tools::GenBankCollector->new;
>
>   # get top level unflattended SeqFeatureI objects
>   @top_sfs = $collector->unflatten_seq(-seq=>$seq);
>
>
> =head1 DESCRIPTION
>
> Most GenBank entries for annotated genomic DNA contain a B<flat> list
> of features. These features can be parsed into a flat list of
> Bio::SeqFeatureI objects using the standard Bio::SeqIO
> classes. However, it is often desirable to B<unflatten> this list into
> something resembling actual gene models, whereby genes, mRNAs and CDSs
> are linked according to the nature of the gene model.
>
> The BioPerl object model allows us to store these kind of associations
> in containment hierarchies (any SeqFeatureI object can contain nested
> SeqFeatureI objects). The Bio::Tools::GenBankCollector object
> facilitates construction of these hierarchies from the underlying
> GenBank flat-feature-list representation.
>
> For example, if you were to look at a typical GenBank DNA entry, say,
> B<AE003644>, you would see a flat list of features:
>
>   gene
>   mRNA CG4491-RA
>   CDS CG4491-PA
>   gene
>   tRNA tRNA-Pro
>   gene
>   mRNA CG32954-RA
>   mRNA CG32954-RC
>   mRNA CG32954-RB
>   CDS CG32954-PA
>   CDS CG32954-PB
>   CDS CG32954-PC
>
> (shown as [type . product-name] pairs)
>
> We would like to convert the above list into the B<containment
> hierarchy>, shown below:
>
>   gene
>     mRNA CG4491-RA
>       CDS CG4491-PA
>   gene
>     tRNA tRNA-Pro
>   gene
>     mRNA CG32954-RA
>       CDS CG32954-PA
>     mRNA CG32954-RC
>       CDS CG32954-PC
>     mRNA CG32954-RB
>       CDS CG32954-PB
>
> We do this using a call on a Bio::Tools::GenBankCollector object
>
>   @sfs = $collector->unflatten_seq(-seq=>$seq);
>
> This would return a list of the 'top level' (i.e. containing)
> SeqFeatureI objects - in this case, genes.
>
> The containment hierarchy can be accessed using the get_SeqFeature()
> call on any feature object - see L<Bio::SeqFeature::FeatureHolderI>
>
> Once you have built the hierarchy, you can do stuff like turn the
> features into rich feature objects (eg
> L<Bio::SeqFeature::Gene::GeneStructure) or convert to a suitable
> format such as GFF3 or chadoxml (after mapping to the Sequence
> Ontology); this step is not described here.
>
> Due to the quixotic nature of how features are stored in
> GenBank/EMBL/DDBJ, there is no guarantee that the default behaviour of
> this module will produce perfect results. Sometimes it is hard or
> impossible to build a correct containment hierarchy if the information
> provided is simply too lossy, as is often the cse. If you care deeply
> about your data, you should always manually inspect the resulting
> containment hierarchy; you may have to customise the algorithm for
> building the hierarchy, or even manually tweak the resulting
> hierarchy. This is explained in more detail below.
>
> However, if you are satisfied with the default behaviour, then you do
> not need to read any further.
>
> =head1 ALGORITHM
>
> This is the default algorithm; you should be able to override any part
> of it to customise.
>
> =head2 Partitioning into groups
>
> First of all the flat feature list is partitioned into B<group>s.
>
> The default way of doing this is to use the 'gene' attribute; if we
> look at two features from accession AE003644:
>
>      gene            20111..23268
>                      /gene="noc"
>                      /locus_tag="CG4491"
>                      /note="last curated on Thu Dec 13 16:51:32 PST 2001"
>                      /map="35B2-35B2"
>                      /db_xref="FLYBASE:FBgn0005771"
>      mRNA            join(20111..20584,20887..23268)
>                      /gene="noc"
>                      /locus_tag="CG4491"
>                      /product="CG4491-RA"
>                      /db_xref="FLYBASE:FBgn0005771"
>
> Both these features share the same /gene tag which is "noc", so they
> correspond to the same gene model (the CDS feature is not shown, but
> this also has a /gene="noc").
>
> Not all groups need to correspond to gene models, but this is the most
> common use case; later on we shall describe how to customise this.
>
> Sometimes other tags have to be used; for instance, if you look at the
> entire record for AE003644 you will see you actually need the use the
> /locus_tag attribute. This attribute is actually not present in most
> records!
>
> You can override this like this:
>
>   $collection->unflatten_seq(-seq=>$seq, group_tag=>'locus_tag');
>
> =head2 Resolving the containment mapping
>
> After the grouping is done, we end up with a list of groups which
> probably contain features of type 'gene', 'mRNA', 'CDS'.
>
> Each group is itself flat; we need to add an extra level of
> organisation. Usually this is because different spliceforms
> (represented by the 'mRNA' feature) can give rise to different
> translations (represented by the 'CDS' feature). We want to correctly
> associate mRNAs to CDSs.
>
> We want to go from a group like this:
>
>   [ gene mRNA mRNA mRNA CDS CDS CDS ]
>
> to a containment hierarchy like this:
>
>   gene
>     mRNA
>       CDS
>     mRNA
>       CDS
>     mRNA
>       CDS
>
> In which each CDS corresponds to its containing mRNA.
>
> How can we do this? The bad news is that there is no guaranteed way of
> doing this correctly for all of GenBank. Occasionally the submission
> will have been done in such a way as to reconstruct the containment
> hierarchy. However, this is not consistent across databank entries, so
> no generic solution can be provided witin bioperl. This module does
> provide the framework within which you can customise a solution for
> the particular dataset you are interested in - see later.
>
> The good news is that there is an inference we can do that should
> produce pretty good results most of the time. It uses splice
> coordinate data - this is the default behaviour of this module.
>
> =head2 Using splice site coordinates to infer containment
>
> If an mRNA is to be the container for a CDS, then the splice site
> coordinates of the CDS must fit inside the splice site coordinates of
> the mRNA.
>
> Ambiguities can still arise, but the results produced should still be
> reasonable and consistent at the sequence level. For example
>
>   mRNA    XXX---XX--XXXXXX--XXXX         join(1..3,7..8,11..16,19..23)
>   mRNA    XXX-------XXXXXX--XXXX         join(1..3,11..16,19..23)
>   CDS                 XXXX--XX           join(13..16,19..20)
>   CDS                 XXXX--XX           join(13..16,19..20)
>
> [obviously the positions have been scaled down]
>
> We cannot unambiguously match mRNA with CDS based on splice sites,
> since both CDS share the splice site locations 16^17 and
> 18^19. However, the consequences of making a wrong match are probably
> not that severe. Any annotation data attached to the first CDS is
> probably identical to the seconds CDS, other than identifiers.
>
> The default behaviour of this module is to make an arbitrary call
> where it is ambiguous (the mapping will always be bijective).
>
> [NOTE: not tested on EMBL data, which may not be bijective; ie two
> mRNAs can share the same CDS??]
>
> Of course, if you are dealing with an organism with no alternate
> splicing, you have nothing to worry about here! There is no ambiguity
> possible, so you will always get a tree that looks like this:
>
>   gene
>     mRNA
>       CDS
>
> =head1 ADVANCED
>
> =head2 Customising the grouping of features
>
> The default behaviour is suited mostly to building models of protein
> coding genes and noncoding genes from genbank genomic DNA submissions.
>
> You can change the tag used to partition the feature by passing in a
> different group_tag argument - see the unflatten_seq() method
>
> Other behaviour may be desirable. For example, even though SNPs
> (features of type 'variation' in GenBank) are not actually part of the
> gene model, it may be desirable to group SNPs that overlap or are
> nearby gene models.
>
> It should certainly be possible to extend this module to do
> this. However, I have yet to do this!
>
> In the meantime, you could write your own grouping subroutine, and
> feed the results into unflatten_groups() [see the method documentation
> below]
>
>
> =head2 Customising the resolution of the containment hierarchy
>
> Once the flat list of features has been partitioned into groups, the
> method unflatten_group() is called on each group to build a tree.
>
> The algorithm for doing this is described above; ambiguities are
> resolved by using splice coordinates. As discussed, this can be
> ambiguous.
>
> Some submissions may contain information in tags/attributes that hint
> as to the mapping that needs to be made between the features.
>
> For example, with the Drosophila Melanogaster release 3 submission, we
> see that CDS features in alternately spliced mRNAs have a form like
> this:
>
>      CDS             join(145588..145686,145752..146156,146227..146493)
>                      /locus_tag="CG32954"
>                      /note="CG32954 gene product from transcript
> CG32954-RA" ^^^^^^^^^^^^^^^^^^^^^^^^^^^ /codon_start=1
>                      /product="CG32954-PA"
>                      /protein_id="AAF53403.1"
>                      /db_xref="GI:7298167"
>                      /db_xref="FLYBASE:FBgn0052954"
>                     
> /translation="MSFTLTNKNVIFVAGLGGIGLDTSKELLKRDLKNLVILDRIENP..."
>
> Here the /note tag provides the clue we need to link CDS to mRNA
> (highlighted with ^^^^). We just need to find the mRNA with the tag
>
>   /product="CG32954-RA"
>
> I have no idea how consistent this practice is across submissions; it
> is consistent for the fruitfly genome submission.
>
> We can customise the behaviour of unflatten_group() by providing our
> own resolver method. This obviously requires a bit of extra
> programming, but there is no way to get around this.
>
> Here is an example of how to pass in your own resolver; this example
> basically checks the parent (container) /product tag to see if it
> matches the required string in the child (contained) /note tag.
>
>        $collector->unflatten_seq(-seq=>$seq,
>                                  -group_tag=>'locus_tag',
>                                  -resolver_method=>sub {
>                                      my $self = shift;
>                                      my ($sf, @candidate_container_sfs) =
> @_; if ($sf->has_tag('note')) {
>                                          my @notes =
> $sf->get_tag_values('note'); my @trnames = map {/from transcript\s+(.*)/;
> $1} @notes; @trnames = grep {$_} @trnames; my $trname;
>                                          if (@trnames == 0) {
>                                              $self->throw("UNRESOLVABLE");
>                                          }
>                                          elsif (@trnames == 1) {
>                                              $trname = $trnames[0];
>                                          }
>                                          else {
>                                              $self->throw("AMBIGUOUS:
> @trnames"); }
>                                          my @container_sfs =
>                                            grep {
>                                                my ($product) =
>                                                  $_->has_tag('product') ?
>                                                   
> $_->get_tag_values('product') : ('');
>                                                $product eq $trname;
>                                            } @candidate_container_sfs;
>                                          if (@container_sfs == 0) {
>                                              $self->throw("UNRESOLVABLE");
>                                          }
>                                          elsif (@container_sfs == 1) {
>                                              # we got it!
>                                              return $container_sfs[0];
>                                          }
>                                          else {
>                                              $self->throw("AMBIGUOUS");
>                                          }
>                                      }
>                                  });
>
> the resolver method is only called when there is more than one spliceform.
>
> =head2 Parsing mRNA records
>
> Some of the entries in sequence databanks are for mRNA sequences as
> well as genomic DNA. We may want to build models from these too.
>
> NOT YET DONE - IN PROGRESS!!!
>
> Open question - what would these look like?
>
> =cut
>
> =head2 seq
>
>  Title   : seq
>  Usage   : $obj->seq($newval)
>  Function:
>  Example :
>  Returns : value of seq (a Bio::SeqI)
>  Args    : on set, new value (a Bio::SeqI, optional)
>
> The Bio::SeqI object should hold a flat list of Bio::SeqFeatureI
> objects; this is the list that will be unflattened.
>
> =cut
>
> =head2 group_tag
>
>  Title   : group_tag
>  Usage   : $obj->group_tag($newval)
>  Function:
>  Example :
>  Returns : value of group_tag (a scalar)
>  Args    : on set, new value (a scalar or undef, optional)
>
> This is the tag that will be used to collect elements from the flat
> feature list into groups; for instance, if we look at two typical
> GenBank features:
>
>      gene            20111..23268
>                      /gene="noc"
>                      /locus_tag="CG4491"
>                      /note="last curated on Thu Dec 13 16:51:32 PST 2001"
>                      /map="35B2-35B2"
>                      /db_xref="FLYBASE:FBgn0005771"
>      mRNA            join(20111..20584,20887..23268)
>                      /gene="noc"
>                      /locus_tag="CG4491"
>                      /product="CG4491-RA"
>                      /db_xref="FLYBASE:FBgn0005771"
>
> We can see that these comprise the same gene model because they share
> the same /gene attribute; we want to collect these together in groups.
>
> Setting group_tag is optional. The default is to use 'gene'. In the
> example above, we could also use /locus_tag
>
> =cut
>
>
> =head2 unflatten_seq
>
>  Title   : unflatten_seq
>  Usage   : @sfs = $collector->unflatten_seq($seq);
>  Function: turns a flat list of features into a list of holder features
>  Example :
>  Returns : list of Bio::SeqFeatureI objects
>  Args    : see below
>
> Arguments
>
>   -seq   :          a Bio::SeqI object; must contain Bio::SeqFeatureI
> objects
>
>   -resolver_method: a CODE reference
>                     see the documentation above for an example of
>                     a subroutine that can be used to resolve hierarchies
>                     within groups
>
>   -group_tag:       a string
>                     [ see the group_tag() method ]
>                     this overrides the default group_tag which is 'gene'
>
> =cut
>
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioperl-l

-- 
========================================================================
Lincoln D. Stein                           Cold Spring Harbor Laboratory
lstein at cshl.org			                  Cold Spring Harbor, NY
========================================================================