[Bioperl-l] concatenate two embl sequence files
Heikki Lehvaslaiho
heikki at sanbi.ac.za
Fri Jan 27 10:06:52 UTC 2006
On Thursday 26 January 2006 15:18, Roy Chaudhuri wrote:
> Heikki Lehvaslaiho wrote:
> > Thanks Roy!
> >
> > I'll check to code in tomorrow when I am less sleepy and can go through
> > the code in detail. In principle the code looks good. It definitely needs
> > tests. If you have written any please do post them.
>
> Not too sure about how to go about writing tests, any suggestions?
I've committed the code and tests. See t/SeqUtils.t. The idea is to test all
methods and a reasonable portion of all edge values to be sure that the
method works as it should.
Note that the code does not create a new sequence object. It modifies the
existing one. Therefore it is best not to return that object. The users would
assign that to a variable that points to the same structure and get confused.
The method now returns true upon completeion.
Creating a new sequence object is problematic because one needs to add one
more dependency (e.g. Clone) and will not work anyway if the sequence
implementation is using a database back end. It is better the way you have
written it.
I added code to move over the annotations from secondary sequences, but did
not do anything remove duplicates if the same sequence gets added twice. I
wrote a note about this so that users know to be prepared if that affects
them.
> It did occur to me that my _coordAdjust method could be adapted to allow
> the Bio::Seq trunc method to retain sequence features (since there's no
> reason why the $add argument can't be negative). This would probably
> need a bit more work to cope with the situation where a feature overlaps
> the trunc coordinates, for example if we truncate to coordinates 1..400,
> but there's a feature 300..500. I guess the 'correct' behaviour might be
> to convert that feature to a fuzzy location of 300..>400? Or is it
> acceptable to have features with coordinates outside of a sequence?
No feature coordinates should always be within the sequence. Fuzzy is the
correct solution to this.
> If we did that then an obvious test would be to cat a sequence to
> itself, then trunc to retain just the second half of the new sequence
> and see if you got back what you started with.
Go ahead an try it!
> > A few more checks to make sure seq_>alphabet is the same in all sequences
> > might be a good idea.
>
> That's easy to implement. Just put the line:
> $self->throw('Trying to concatenate sequences with different alphabets:
> '.$seq->display_id.' ('.$seq->alphabet.') and ' .$_->display_id.'
> ('.$_->alphabet.')') unless $_->alphabet eq $seq->alphabet;
>
> at the start of the for(@seqs) loop of the cat subroutine.
Added.
Thanks,
-Heikki
> Roy.
> --
> Dr. Roy Chaudhuri
> Bioinformatics Research Fellow
> Division of Immunity and Infection
> University of Birmingham, U.K.
>
> http://xbase.bham.ac.uk
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
--
______ _/ _/_____________________________________________________
_/ _/
_/ _/ _/ Heikki Lehvaslaiho heikki at_sanbi _ac _za
_/_/_/_/_/ Associate Professor skype: heikki_lehvaslaiho
_/ _/ _/ SANBI, South African National Bioinformatics Institute
_/ _/ _/ University of Western Cape, South Africa
_/ Phone: +27 21 959 2096 FAX: +27 21 959 2512
___ _/_/_/_/_/________________________________________________________
More information about the Bioperl-l
mailing list