[Bioperl-l] Placement of LargePrimarySeq

Ewan Birney birney@ebi.ac.uk
Sun, 17 Sep 2000 23:07:51 +0100 (BST)


Tomorrow I have to do some comparisons of very large sequence files
(around chromosome 1 size, if people are interested...). Although I could
potentially use bioperl sequences on a machine with a huge amount of real
memory, I decided to make a quick module that stores a sequence a
file in /tmp/ and then executes the subseq command be using seek and read
commands.

I have this object as Bio::LargePrimarySeq. Does anyone have any
objections about having this object in the Bio:: area directly or should
I put it somewhere else (bascially, what do people feel about cluttering
up the top level Bio:: area, or should I make a Bio::Seq:: directory. 
NB - there might be some other extensions, like Bio::CachePrimarySeq which
can cache subseq calls to improve performance for LargePrimarySeq and
the Ensembl database equivalents...)


I need to write a SeqIO system for making this and also writing out very
large fasta files. (it should step through the sequence one MB at a time
using the subseq method, rather than getting the whole thing out as a 
seq). Options:

	(a) make a new Bio::SeqIO::bigfasta module, and ->next_seq would
make sequences with LargePrimarySeq and ->write_seq would write with
this subseq method

	(b) parameterise Bio::SeqIO::fasta for both of these. (have to 
handle boring don't use $/ stuff as reading can't put everything between
'>' as a string, as the whole point is not to have the entire sequence as
a string in memory)

I prefer (a) to (b).



I got to do this tomorrow, so if people have a view, make sure that view
gets back to me soon....




Of course this is all main trunk stuff, not on the branch.





-----------------------------------------------------------------
Ewan Birney. Mobile: +44 (0)7970 151230, Work: +44 1223 494420
<birney@ebi.ac.uk>. 
-----------------------------------------------------------------