[Biojava-l] Extract non-gene regions

Thu Apr 24 04:09:46 UTC 2008

On Thu, 24 Apr 2008, Mark Schreiber wrote:

> Hi Florian -
>
> There are at least two approaches. You are on the right track with
> making a union of all gene locations.  The compound location that
> results from the Union will contain all the nucleotides that are
> coding. You can then iterate through each nucleotide in the genome and
> find out if the union contains the nucleotide. If it doesn't then it
> is non coding.  This is surprisingly rapid as the comparisons are
> simple.  The pseudo code would be something like...
>
> RichLocation coding; //initialize this by making a union of all
> locations of CDS or Gene Features.
>
> RichSequence genome; // read from file or database
>
> for(int i = 1; i <= genome.lenght(); i++){  //you might need to be a
> bit more sophisticated for a circular genome
>     if( ! genome.contains(i){
>          //you have a non-coding nucleotide.
>     }
> }

typo?

  if (!coding.contains(i)) {
    // you have a non-coding nucleotide.
  }

> The other approach is to use the blockIterator() method of the
> compound location that results from the union of coding sequences.
> This will output each contiguous chunk of coding sequence. If you know
> the length of the sequence then you can rapidly figure out the
> intervening pieces.
>
> For example, if the block iterator tells you that [10..50], [90..100],
> [350..380] are coding and you know the genome is of length 400 then
> you can quickly derive [1..9], [51..89], [101..349] and [381..400] are
> non-coding.  Again it is more complicated for circular sequences and
> more complex if you consider the opposite strand of a gene (the gene
> shadow) to be non-coding. Unfortunately there is no convenience method
> to do this but if you code something up it would be great to put it in
> the cookbook so others can re-use it.
>
> - Mark
>
> You could actually make point locations of all the non-coding
> nucleotides and then merge the whole lot at the end into a compound
> location of non-coding
>
> On Wed, Apr 23, 2008 at 9:49 PM, Florian Schatz <mail at florianschatz.de> wrote:
> > Hello,
> >
> >  I am new to biojava and worked a lot with in the last few weeks. I hope
> > this is the right place for questions, if not please tell me.
> >
> >  I want to get the nucleotid sequence outside the genes of a genebank file.
> > So everything that is not marked by a 'gene' feature.  Unfortunately, there
> > is no sustract or exclude function for the Location class. Any hints?
> >
> >  Btw: union() of location worked fine for extracting nucleotids of the genes
> > only.
> >
> >  Best,
> >  Florian
> >  _______________________________________________
> >  Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> >  http://lists.open-bio.org/mailman/listinfo/biojava-l
> >
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>