[BioPython] [Fwd: Advice on optimium data structure for billion long list?]
Mark blobby Robinson
m.1.robinson@herts.ac.uk
Wed, 16 May 2001 11:23:46 +0100
Hey Brad,
Thanks for taking the time to think about my problem. As it turns out,
my current implementation is pretty much exactly as you suggested. I
have been finding somewhere in the region of 260 million or so different
combination, and the only reason I haven't found more is cos I can't
handle more than that yet. I am starting to get the feeling I am
attempting too much, and am going to have to compromise and filter some
candidate patterns out at an earlier stage and hope I don't lose any I
am interested in.
Thanks again
Blobby
Brad Chapman wrote:
> Hey Blobby;
>
>> I am building a program that is pattern searching in DNA sequences and
>> generating a list of combinations of 3 patterns that meet certain
>> criteria. My problem is that this list could potentially get as large as
>> ~1.4 billion entries. Now originally I was using a dictionary with the
>> key as a 15 length string (the patterns catted) and the value simply a
>> count of the number of hots for that pattern combination.
>
>
> Just a random idea that popped in my head, but is it possible that
> most of the combination of the 3 patterns are never actually found?
> I'm not sure if this would be the case for your particular problem
> without knowning anything about it, but if it is a potential solution
> that is presented in "The Quick Python Book" by Harms and McDonald is
> Sparse Matrices.
>
> If you think of the 3 patterns as making up a three dimensional
> matrix, you could encode this matrix in a python dicitionary using
> tuples for keys, like:
>
> pattern_dict[("pattern 1", "pattern 2", "pattern 3")] = hit_count
>
> You would only add a pattern to the dictionary if it ever matches, and
> has a hit_count bigger than zero. If most elements are zero, then this
> might reduce the size of the dictionary you have to deal with to
> something smaller and more manageable.
>
> Hope this might help some.
> Brad
>
>
> _______________________________________________
> BioPython mailing list - BioPython@biopython.org
> http://biopython.org/mailman/listinfo/biopython
>
>
>