[Biojava-dev] Fwd: Assembly data reading

Mon Jul 20 20:11:52 UTC 2009

Hi Paolo,

Not sure if you got a response to your mail off list. If there is
sufficient interest from the people working on processing the output
of the various sequencers, it would be great if those people would
work together to get a new biojava module started. Most probably
somebody needs to take initiative and lead the development, otherwise
it won't happen.

Cheers,
Andreas

On Tue, Jul 14, 2009 at 9:08 AM, Paolo Pavan<paolo.pavan at gmail.com> wrote:
> Dear all,
> I took a day to make a rapid search to try to have a clearest point of
> the situation.
> •       I found the specification of the .sff file in the 454 instrument
> manual, it is fully described and seems to be enough to build a
> reader.
> •       However from a more careful read it seems that a *.sff file brings
> not information about the automatic contig assembling and only stores
> flowgram info that are "reads" (not like a *.ace file indeed).
> •       Two hidden binary files can be found in a 454 gsAssembler project
> folder, they are: .ChordMatrixMetadata and .SeqCacheMetadata. They are
> not described in the manual but they seem to contain the former
> nucleotide data and the latter read names, they are big enough to
> contain such kind of data, the problem is that we don't know how to
> parse them.
> •       It is necessary to decide a "memory structure" in which store the
> information read, I agree on the "memory mapping" solution, maybe
> implemented with a Map object that can associate the names of the read
> and its location on the file.
> •       the parser class then should expose methods to:
>        1) iterate through reads, but maybe this should be heavy and avoidable
>        2) access read sequence from name
> •       if the parser should manage the assembled contigs too and this is
> subordinated to what explained in the third bullet point, it should
> expose method to:
>        1) iterate through contigs names
>        2) iterate through contigs consensus sequences
>        3) access consensus sequences from name (this is a sub problem of point 2)
>        4) access random aligned portions (I mean "slice") of the assembly
> given start-end positions returning an alignment object
> •       any more suggestions?
> I would be glad to be involved in the biojava community through this
> project and I could try but first of all I want to say that I’m not a
> guru like most of the people here ( :-p ) and to say the truth the job
> that my company required me is different and maybe if exists a
> workaround I should be honest to choose it.
> So let me think a bit about starting such adventure, if I can couple
> my job and contributing the community growth I’ll be happy to share my
> work! Any suggestion welcome.
>
> Bye bye,
> Paolo
>
>
> 2009/7/13 Mark Schreiber <markjschreiber at gmail.com>:
>> I would agree that there is a strong need for this kind of thing in biojava.
>>
>> As Richard says you probably can't fit it in memory so you may want to
>> memory map it. There are classes in the javax.nio package that can help a
>> lot with this.
>>
>> Also I have had some success with in-memory compression of large files using
>> LZ compression. Essentially the memory representation of the file is LZ
>> compressed and compression and decompression are handled on the fly. Again
>> there are Java utility classes that can help.
>>
>> - Mark
>>
>> On Mon, Jul 13, 2009 at 1:20 PM, Richard Holland <holland at eaglegenomics.com>
>> wrote:
>>>
>>> Nothing within BJ can parse the 454 .sff files directly. However I think
>>> there is a growing need for it so if anyone is willing to contribute
>>> code, it would be very welcome.
>>>
>>> There is also no .ace parser, although in 2007 someone volunteered to
>>> write one but nothing happened, and there was a previous post (many
>>> years ago!) from someone else who already had some working code but
>>> again nothing seems to have happened:
>>>
>>> http://portal.open-bio.org/pipermail/biojava-l/2001-June/001283.html
>>> http://lists.open-bio.org/pipermail/biojava-l/2007-July/005900.html
>>>
>>> So to start with, someone (perhaps yourself? that would be nice! :) )
>>> needs to volunteer to write either a .ace or .sff parser, or both.
>>>
>>> The thing to bear in mind with 454 contigs as you rightly point out is
>>> the sheer size of the things. The requirement to keep them entirely in
>>> memory is likely to be unworkable as it would leave little room for
>>> anything else to run on your average machine. I would suggest either
>>> memory-mapping the file itself, or parsing and writing out a
>>> memory-mapped summary file containing the bits of data you're interested
>>> in. (Memory-mapping is where you keep an index in memory indicating
>>> where in the file each record is, so that when you need to access them
>>> you load them on-the-fly from the file and drop them out of memory again
>>> immediately after use. An accelerated form of this is to put the loaded
>>> records into some kind of LRU cache which holds only the most recently
>>> accessed records and then check that cache first to see if you've
>>> already loaded the record before accessing the file directly.)
>>>
>>> cheers,
>>> Richard
>>>
>>>
>>> On Sun, 2009-07-12 at 23:41 +0200, Paolo Pavan wrote:
>>> > Hi,
>>> > I would like to post again with some adjustments a question I put some
>>> > times ago because maybe this is a more correct list, apologize for the
>>> > repeating.
>>> > Can someone kindly give me his advise?
>>> >
>>> > thank you in advance,
>>> > Paolo
>>> >
>>> >
>>> > ---------- Forwarded message ----------
>>> > From: Paolo Pavan <paolo.pavan at gmail.com>
>>> > Date: 2009/7/9
>>> > Subject: Assembly data reading
>>> > To: Biojava-l at lists.open-bio.org
>>> >
>>> >
>>> > Hi everybody,
>>> > I'm almost new to this topic, I would like to know if there is
>>> > something can help me to load in my java program data from a large 454
>>> > contig. I need to retain in memory and access data from the single
>>> > reads forming the contig too.
>>> > I suppose these informations are in a *.sff file, if it is not
>>> > possible to load such file it should be ok to load a *.ace (phrap)
>>> > data file that I have too.
>>> > Many thanks for any suggestion you can give me!
>>> >
>>> > Greetings,
>>> > Paolo
>>> > _______________________________________________
>>> > biojava-dev mailing list
>>> > biojava-dev at lists.open-bio.org
>>> > http://lists.open-bio.org/mailman/listinfo/biojava-dev
>>> --
>>> Richard Holland, BSc MBCS
>>> Operations and Delivery Director, Eagle Genomics Ltd
>>> T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com
>>> http://www.eaglegenomics.com/
>>>
>>> _______________________________________________
>>> biojava-dev mailing list
>>> biojava-dev at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/biojava-dev
>>
>>
>
> _______________________________________________
> biojava-dev mailing list
> biojava-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-dev
>