[Biojava-l] Off Topic: Solutions to evenly overlapping substring problem??
stefan.grzybek at novartis.com
stefan.grzybek at novartis.com
Tue Nov 4 13:51:52 UTC 2008
Hi Mark and David,
I have a slightly different approach:
n = L / N; minimum number of bp in a single interval
o = (l-n) / (N-1); additional overlap possibility (bp) for each interval -
maybe there is a better description for this ?
Then you can start with the interval 1 to l and add to the start point for
each iteration (l-n+o).
It might be that the last segment needs to be a bit longer or shorter than
For your example, n = 1076 ("floor"), o = 158 ("floor"), (l-n+o) = 1582.
#segment, start, end, overlap with previous segment
1 s: 1 e: 2500 - overlap: 0
2 s: 919 e: 3418 - overlap: 1582
3 s: 1837 e: 4336 - overlap: 1582
4 s: 2755 e: 5254 - overlap: 1582
5 s: 3673 e: 6172 - overlap: 1582
6 s: 4591 e: 7090 - overlap: 1582
7 s: 5509 e: 8008 - overlap: 1582
8 s: 6427 e: 8926 - overlap: 1582
9 s: 7345 e: 9844 - overlap: 1582
10 s: 8263 e: 10762 - overlap: 1582
This approach seems to work reasonably well also with l=1500, l=3500,
"Mark Schreiber" <markjschreiber at gmail.com>
Sent by: biojava-l-bounces at lists.open-bio.org
holland at eaglegenomics.com
"biojava-l at biojava.org" <biojava-l at biojava.org>
Re: [Biojava-l] Off Topic: Solutions to evenly overlapping substring
This is what I thought as well but if you use that number to generate
the sub strings it doesn't work. The value 1585 works (with one
character left over). I'm not sure how to make that into a
generalizable formula though.
On Tue, Nov 4, 2008 at 6:49 PM, Richard Holland
<holland at eaglegenomics.com> wrote:
> It's a maths problem.
> Length of total sequence = L
> Number of overlapping sequences required = N
> Number of overlaps required = N-1
> Length of each overlapping sequence required =S
> Offset for each overlapping sequence = length of one non-overlapping
> sequence = L/(N-1) = X
> Overlap = O = S - X
> In your case this gives:
> X = 10763 / (10-1)
> = 10763 / 9
> = 1196 (rounded up)
> O = 2500 - 1196
> = 1304
> So you would start at the beginning, take S bases, then move along X
> bases and take the next S, and so on... your first sequence would be
> 1..2500, your second would be 1197..3697, your third would be
> 2393..4893, etc. etc., and each one would then overlap the next by
> 2008/11/4 Mark Schreiber <markjschreiber at gmail.com>:
>> Hi -
>> Does anyone know how to solve this problem?
>> I have a piece of DNA which is 10763 bp long. I want to divide this up
>> evenly into 10 fragments each of 2500bp in length. What is the overlap
>> required between each fragment?
>> Or more generally, for a sequence of length L, how much overlap O is
>> required to generate N fragments of length l (were N and l are fixed)?
>> A solution would be most appreciated. Extra points for coding it in
>> biojava and posting it on the cookbook!!
>> - Mark
>> Biojava-l mailing list - Biojava-l at lists.open-bio.org
> Richard Holland, BSc MBCS
> Finance Director, Eagle Genomics Ltd
> M: +44 7500 438846 | E: holland at eaglegenomics.com
Biojava-l mailing list - Biojava-l at lists.open-bio.org
The information contained in this e-mail message is intended only for the
exclusive use of the individual or entity named above and may contain
information that is privileged, confidential or exempt from disclosure
under applicable law. If the reader of this message is not the intended
recipient, or the employee or agent responsible for delivery of the
message to the intended recipient, you are hereby notified that any
dissemination, distribution or copying of this communication is strictly
prohibited. If you have received this communication in error, please
notify the sender immediately by e-mail and delete the material from any
computer. Thank you.
More information about the Biojava-l