# [Biojava-l] Off Topic: Solutions to evenly overlapping substring problem??

stefan.grzybek at novartis.com stefan.grzybek at novartis.com
Tue Nov 4 13:51:52 UTC 2008

```Hi Mark and David,

I have a slightly different approach:

n = L / N; minimum number of bp in a single interval
o = (l-n) / (N-1); additional overlap possibility (bp) for each interval -
maybe there is a better description for this ?

Then you can start with the interval 1 to l and add to the start point for
each iteration (l-n+o).
It might be that the last segment needs to be a bit longer or shorter than
l.

For your example, n = 1076 ("floor"), o = 158 ("floor"), (l-n+o) = 1582.

#segment, start, end, overlap with previous segment
1 s: 1 e: 2500 - overlap: 0
2 s: 919       e: 3418 - overlap: 1582
3 s: 1837      e: 4336 - overlap: 1582
4 s: 2755      e: 5254 - overlap: 1582
5 s: 3673      e: 6172 - overlap: 1582
6 s: 4591      e: 7090 - overlap: 1582
7 s: 5509      e: 8008 - overlap: 1582
8 s: 6427      e: 8926 - overlap: 1582
9 s: 7345      e: 9844 - overlap: 1582
10 s: 8263      e: 10762 - overlap: 1582

This approach seems to work reasonably well also with l=1500, l=3500,
l=4500.

Best regards,
Stefan

"Mark Schreiber" <markjschreiber at gmail.com>
Sent by: biojava-l-bounces at lists.open-bio.org
04-11-2008 13:58

To
holland at eaglegenomics.com
cc
"biojava-l at biojava.org" <biojava-l at biojava.org>
Subject
Re: [Biojava-l] Off Topic: Solutions to evenly overlapping      substring
problem??

Hi -

This is what I thought as well but if you use that number to generate
the sub strings it doesn't work. The value 1585 works (with one
character left over). I'm not sure how to make that into a
generalizable formula though.

- Mark

On Tue, Nov 4, 2008 at 6:49 PM, Richard Holland
<holland at eaglegenomics.com> wrote:
> It's a maths problem.
>
> Length of total sequence = L
>
> Number of overlapping sequences required = N
>
> Number of overlaps required = N-1
>
> Length of each overlapping sequence required =S
>
> Offset for each overlapping sequence = length of one non-overlapping
> sequence = L/(N-1) = X
>
> Overlap = O = S - X
>
> In your case this gives:
>
> X = 10763 / (10-1)
> = 10763  / 9
> = 1196 (rounded up)
> O = 2500 - 1196
> = 1304
>
> So you would start at the beginning, take S bases, then move along X
> bases and take the next S, and so on... your first sequence would be
> 2393..4893, etc. etc., and each one would then overlap the next by
> 1304.
>
> cheers,
> Richard
>
>
> 2008/11/4 Mark Schreiber <markjschreiber at gmail.com>:
>> Hi -
>>
>> Does anyone know how to solve this problem?
>>
>> I have a piece of DNA which is 10763 bp long. I want to divide this up
>> evenly into 10 fragments each of 2500bp in length. What is the overlap
>> required between each fragment?
>>
>> Or more generally, for a sequence of length L, how much overlap O is
>> required to generate N fragments of length l (were N and l are fixed)?
>>
>> A solution would be most appreciated. Extra points for coding it in
>> biojava and posting it on the cookbook!!
>>
>> - Mark
>> _______________________________________________
>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>
>
>
>
> --
> Richard Holland, BSc MBCS
> Finance Director, Eagle Genomics Ltd
> M: +44 7500 438846 | E: holland at eaglegenomics.com
> http://www.eaglegenomics.com/
>
_______________________________________________
Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/biojava-l

_________________________

CONFIDENTIALITY NOTICE

The information contained in this e-mail message is intended only for the
exclusive use of the individual or entity named above and may contain
information that is privileged, confidential or exempt from disclosure
under applicable law. If the reader of this message is not the intended
recipient, or the employee or agent responsible for delivery of the
message to the intended recipient, you are hereby notified that any
dissemination, distribution or copying of this communication is strictly