[Biojava-l] Extract non-gene regions

Mark Schreiber markjschreiber at gmail.com
Thu Apr 24 12:47:59 UTC 2008


Hi -

While Sequences and SymbolLists offer many advantages over Strings or
character arrays speed is not one of them.

You can create a Sequence using the SequenceFactory implementations
which are much more efficient than converting to Strings and back to
symbols again. This is a very expensive operation.  From memory
SimpleRichSequence may even have a constructor that takes a SymbolList
and a name. There should be no need to convert to a String and back.

Also, do you need a Sequence when a SymbolList may contain all the
information you need?

Finally the Edit operations you use in your wiki example will cause
quite a big performance hit, your comment seems to allude to this. It
would be better to collect all the non-coding points (i) and compile
them into a compound location and then extract the SymbolList for that
location all in one go.

- Mark

On Thu, Apr 24, 2008 at 8:09 PM, Florian Schatz <mail at florianschatz.de> wrote:
> Hello,
>
>  I tried that, but is as slow as a version operating on Strings.. however, I
> created a Cookbook entry:
>  http://biojava.org/wiki/BioJava:Cookbook:Sequence:ExtractGeneRegions
>
>  Is there a better way to get a Sequence from a SybolList than:
>
>  Sequence newsequence = DNATools.createDNASequence(symbolL.seqString(), "New
> Sequence");
>
>
>  Best,
>  Florian
>
>  Am 24.04.2008 um 04:29 schrieb Mark Schreiber:
>
> > Hi Florian -
> >
> >
> >
> >
> > There are at least two approaches. You are on the right track with
> > making a union of all gene locations.  The compound location that
> > results from the Union will contain all the nucleotides that are
> > coding. You can then iterate through each nucleotide in the genome and
> > find out if the union contains the nucleotide. If it doesn't then it
> > is non coding.  This is surprisingly rapid as the comparisons are
> > simple.  The pseudo code would be something like...
> >
> > RichLocation coding; //initialize this by making a union of all
> > locations of CDS or Gene Features.
> >
> > RichSequence genome; // read from file or database
> >
> > for(int i = 1; i <= genome.lenght(); i++){  //you might need to be a
> > bit more sophisticated for a circular genome
> >    if( ! genome.contains(i){
> >         //you have a non-coding nucleotide.
> >    }
> > }
> >
> > The other approach is to use the blockIterator() method of the
> > compound location that results from the union of coding sequences.
> > This will output each contiguous chunk of coding sequence. If you know
> > the length of the sequence then you can rapidly figure out the
> > intervening pieces.
> >
> > For example, if the block iterator tells you that [10..50], [90..100],
> > [350..380] are coding and you know the genome is of length 400 then
> > you can quickly derive [1..9], [51..89], [101..349] and [381..400] are
> > non-coding.  Again it is more complicated for circular sequences and
> > more complex if you consider the opposite strand of a gene (the gene
> > shadow) to be non-coding. Unfortunately there is no convenience method
> > to do this but if you code something up it would be great to put it in
> > the cookbook so others can re-use it.
> >
> > - Mark
> >
> > You could actually make point locations of all the non-coding
> > nucleotides and then merge the whole lot at the end into a compound
> > location of non-coding
> >
> > On Wed, Apr 23, 2008 at 9:49 PM, Florian Schatz <mail at florianschatz.de>
> wrote:
> >
> > > Hello,
> > >
> > >  I am new to biojava and worked a lot with in the last few weeks. I hope
> > > this is the right place for questions, if not please tell me.
> > >
> > >  I want to get the nucleotid sequence outside the genes of a genebank
> file.
> > > So everything that is not marked by a 'gene' feature.  Unfortunately,
> there
> > > is no sustract or exclude function for the Location class. Any hints?
> > >
> > >  Btw: union() of location worked fine for extracting nucleotids of the
> genes
> > > only.
> > >
> > >  Best,
> > >  Florian
> > >  _______________________________________________
> > >  Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> > >  http://lists.open-bio.org/mailman/listinfo/biojava-l
> > >
> > >
> >
>
>
>  _______________________________________________
>  Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>  http://lists.open-bio.org/mailman/listinfo/biojava-l
>



More information about the Biojava-l mailing list