[Bioperl-l] Assembly package and phredPhrap tools
Robson Francisco de Souza
rfsouza@citri.iq.usp.br
Mon, 5 Nov 2001 13:06:13 -0200 (BRST)
Hi!
Hello everyone, I have just subscribed to this mailing list and I
would like to ask some things, share some thoughts...
I have been working on a perl module to load information from
phrap .ace files, phd files and some other things in the universe of
phredPhrap's data. Although I haven't code it following bioperl's
programming model, neither used bioperl's objects in its implementation, I
would like to move my module to bioperl's approach. I was actually
thinking of merging my code with Chad's code, but I believe that is gonna
be hard so I would like to hear something from you (all of you
and specially Chad) first.
In my implementation .ace file information describing an assembly
is represented as a tree-like data structure:
(PACKAGE Assembly):
assembly (HASH reference):
files (HASH reference):
ace_file (SCALAR = ALL.fasta.screen.ace.1)
number_of_contigs (SCALAR = 521)
total_number_of_reads (SCALAR = 188362)
contigs (ARRAY reference):
0 (SCALAR = )
1 (HASH reference):
consensus (SCALAR = aggggcnnnctattatcgatccctctgtaaacacxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
length (SCALAR = 804)
number_of_reads (SCALAR = 1)
number_of_segments (SCALAR = 1)
orientation (SCALAR = U)
quality (SCALAR = 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
reads (HASH reference):
A0QR5701B11.b (HASH reference):
align_clipping_end (SCALAR = 804)
align_clipping_start (SCALAR = 741)
end (SCALAR = 804)
length (SCALAR = 804)
number_of_read_info_items (SCALAR = 0)
number_of_tags (SCALAR = 1)
orientation (SCALAR = U)
padded_end (SCALAR = 804)
padded_start (SCALAR = 1)
qual_clipping_end (SCALAR = -1)
qual_clipping_start (SCALAR = -1)
sequence (SCALAR = aggggcnnnctattatcgatccctctgtaaacacxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
start (SCALAR = 1)
2 (HASH reference):
consensus (SCALAR = gcggggtattatgatxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxttgtgggttcttggtcagctcgct
length (SCALAR = 932)
As you can see, all sequences are stored as strings (the same is true for
quality values). Now, I was thinking, if I change this representation to
bioperl objects, how would it look like? Or, more generally, what is the
best way to represent a DNA sequence assembly data in the bioperl
framework? I thought that maybe to store contigs as UnivAln objects and
contig data in tables could be a good ideia...
Anyway, I would like to know what you guys are doing on this
subject. I can send you my code anytime so that you see what I have done
and how this could help. Most important, some of the methods I implemented
in my module reproduce consed's function, like finding LCQs or single
strand regions and they could be used by an assembly module.
Well, hope this may start a discussion :).
Best regards,
Robson