[BioRuby] [Wg-phyloinformatics] Update on phyloXML support for BioRuby project

Thu May 21 04:02:11 UTC 2009

Hi:

Thanks for the detailed replies by Hilmar and Chris!
I think it is a very good idea to keep such very large trees in mind, 
and possibly implement a solution which only loads requested nodes into 
memory (as described by Hilmar and Chris) if there is enough time left 
at the end of the project.

Re "It's tricky with re: to a number of aspects, but it can be done.  For  
instance, if one wanted to modify the created nodes (i.e. if the nodes  
are mutable), or creating a generic Lazy set of classes capable of   
dealing with multiple formats."

How would you do post-order or pre-order iteration of nodes? Wouldn't you have to back and forth in the file?

CZ

Chris Fields wrote:
> On May 20, 2009, at 8:22 AM, Hilmar Lapp wrote:
>
>   
>> On May 19, 2009, at 5:54 PM, Christian M Zmasek wrote:
>>
>>     
>>> I think it is perfectly acceptable to expect to have enough memory
>>> to keep at least
>>> one tree in memory
>>>       
>> Sounds like a good and perfectly reasonable starting point to me too.
>> It's also the way other toolkits (such as BioPerl) work.
>>
>> Having said that, I don't find it inconceivable that we may be working
>> with trees in the near future that don't fit into memory for a 1GB RAM
>> machine if they are richly decorated (which is something that phyloXML
>> wants to enable, isn't it?). Solving that to me though seems to be
>> question of writing an appropriate Tree implementation that happens to
>> store most of the data on disk rather than in memory, and not an issue
>> for how to write a parser. Ideally though, the parser uses a factory
>> for creating the (tree and/or node) objects, so that later it can be
>> made to use an on-disk Tree implementation simply by passing it
>> another factory. I.e., ideally the parser would not assume and hard-
>> code the Tree implementation class.
>>
>> Just my $0.02.
>>
>> 	-hilmar
>>     
>
> This could be implemented in a lazy way or using lightweight objects.   
> The Tree object itself contains the XML parser or a reference thereof  
> (probably LibXML Reader-based) and creates the relevant nodes as  
> needed.  The only thing needed would be some light parsing to indicate  
> start-end file points.
>
> It's tricky with re: to a number of aspects, but it can be done.  For  
> instance, if one wanted to modify the created nodes (i.e. if the nodes  
> are mutable), or creating a generic Lazy set of classes capable of   
> dealing with multiple formats.
>
> Just in case anyone's wondering, I have been thinking along these  
> lines for a while re: BioPerl, Bio::Seq, and very large files... ;>
>
> chris
>