[Biojava-l] Extract non-gene regions

Mark Schreiber markjschreiber at gmail.com
Thu Apr 24 02:29:12 UTC 2008


Hi Florian -

There are at least two approaches. You are on the right track with
making a union of all gene locations.  The compound location that
results from the Union will contain all the nucleotides that are
coding. You can then iterate through each nucleotide in the genome and
find out if the union contains the nucleotide. If it doesn't then it
is non coding.  This is surprisingly rapid as the comparisons are
simple.  The pseudo code would be something like...

RichLocation coding; //initialize this by making a union of all
locations of CDS or Gene Features.

RichSequence genome; // read from file or database

for(int i = 1; i <= genome.lenght(); i++){  //you might need to be a
bit more sophisticated for a circular genome
    if( ! genome.contains(i){
         //you have a non-coding nucleotide.
    }
}

The other approach is to use the blockIterator() method of the
compound location that results from the union of coding sequences.
This will output each contiguous chunk of coding sequence. If you know
the length of the sequence then you can rapidly figure out the
intervening pieces.

For example, if the block iterator tells you that [10..50], [90..100],
[350..380] are coding and you know the genome is of length 400 then
you can quickly derive [1..9], [51..89], [101..349] and [381..400] are
non-coding.  Again it is more complicated for circular sequences and
more complex if you consider the opposite strand of a gene (the gene
shadow) to be non-coding. Unfortunately there is no convenience method
to do this but if you code something up it would be great to put it in
the cookbook so others can re-use it.

- Mark

You could actually make point locations of all the non-coding
nucleotides and then merge the whole lot at the end into a compound
location of non-coding

On Wed, Apr 23, 2008 at 9:49 PM, Florian Schatz <mail at florianschatz.de> wrote:
> Hello,
>
>  I am new to biojava and worked a lot with in the last few weeks. I hope
> this is the right place for questions, if not please tell me.
>
>  I want to get the nucleotid sequence outside the genes of a genebank file.
> So everything that is not marked by a 'gene' feature.  Unfortunately, there
> is no sustract or exclude function for the Location class. Any hints?
>
>  Btw: union() of location worked fine for extracting nucleotids of the genes
> only.
>
>  Best,
>  Florian
>  _______________________________________________
>  Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>  http://lists.open-bio.org/mailman/listinfo/biojava-l
>



More information about the Biojava-l mailing list