[Biojava-l] Extract non-gene regions

Thu Apr 24 12:09:24 UTC 2008

Hello,

I tried that, but is as slow as a version operating on Strings..  
however, I created a Cookbook entry:
http://biojava.org/wiki/BioJava:Cookbook:Sequence:ExtractGeneRegions

Is there a better way to get a Sequence from a SybolList than:

Sequence newsequence = DNATools.createDNASequence(symbolL.seqString 
(), "New Sequence");

Best,
Florian

Am 24.04.2008 um 04:29 schrieb Mark Schreiber:
> Hi Florian -
>
> There are at least two approaches. You are on the right track with
> making a union of all gene locations.  The compound location that
> results from the Union will contain all the nucleotides that are
> coding. You can then iterate through each nucleotide in the genome and
> find out if the union contains the nucleotide. If it doesn't then it
> is non coding.  This is surprisingly rapid as the comparisons are
> simple.  The pseudo code would be something like...
>
> RichLocation coding; //initialize this by making a union of all
> locations of CDS or Gene Features.
>
> RichSequence genome; // read from file or database
>
> for(int i = 1; i <= genome.lenght(); i++){  //you might need to be a
> bit more sophisticated for a circular genome
>     if( ! genome.contains(i){
>          //you have a non-coding nucleotide.
>     }
> }
>
> The other approach is to use the blockIterator() method of the
> compound location that results from the union of coding sequences.
> This will output each contiguous chunk of coding sequence. If you know
> the length of the sequence then you can rapidly figure out the
> intervening pieces.
>
> For example, if the block iterator tells you that [10..50], [90..100],
> [350..380] are coding and you know the genome is of length 400 then
> you can quickly derive [1..9], [51..89], [101..349] and [381..400] are
> non-coding.  Again it is more complicated for circular sequences and
> more complex if you consider the opposite strand of a gene (the gene
> shadow) to be non-coding. Unfortunately there is no convenience method
> to do this but if you code something up it would be great to put it in
> the cookbook so others can re-use it.
>
> - Mark
>
> You could actually make point locations of all the non-coding
> nucleotides and then merge the whole lot at the end into a compound
> location of non-coding
>
> On Wed, Apr 23, 2008 at 9:49 PM, Florian Schatz  
> <mail at florianschatz.de> wrote:
>> Hello,
>>
>>  I am new to biojava and worked a lot with in the last few weeks.  
>> I hope
>> this is the right place for questions, if not please tell me.
>>
>>  I want to get the nucleotid sequence outside the genes of a  
>> genebank file.
>> So everything that is not marked by a 'gene' feature.   
>> Unfortunately, there
>> is no sustract or exclude function for the Location class. Any hints?
>>
>>  Btw: union() of location worked fine for extracting nucleotids of  
>> the genes
>> only.
>>
>>  Best,
>>  Florian
>>  _______________________________________________
>>  Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>  http://lists.open-bio.org/mailman/listinfo/biojava-l
>>