[Bioperl-l] Simple question on reading a .msf file

Terry Jones tc.jones at jones.tc
Sun Dec 5 13:30:36 EST 2004


Hi. I'm new to bioperl (though not to perl itself). I have what I
think should be a simple task, and yet after reading the FAQ and
various POD and web pages, I can't make it work. I'm probably doing
something fundamentally wrong.

I've been sent a file with a .msf suffix. The start of the file is at
the bottom of this email. A bit of googling leads me to think I have a
multiple sequence alignment file of type msf, as produced by a tool
like clustalw.

I want to extract the sequences, do various things to them, write them
out as fasta, etc. After some reading and experimenting, it seems that
to read the file I should do

  my $in = Bio::AlignIO->new(-file => $file, -format => 'msf');

This works, but it seems there is a problem because a subsequent call
to $in->next_aln() results in an exception:

	-------------------- WARNING ---------------------
	MSG: seq doesn't validate, mismatch is 1
	---------------------------------------------------

	------------- EXCEPTION  -------------
	MSG: Attempting to set the sequence to [TTCC
	(some lines of DNA bases deleted here)
	ACGGTTGG~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~] which does not look healthy
	STACK Bio::PrimarySeq::seq /Library/Perl/5.8.1/Bio/PrimarySeq.pm:264
	STACK Bio::PrimarySeq::new /Library/Perl/5.8.1/Bio/PrimarySeq.pm:214
	STACK Bio::LocatableSeq::new /Library/Perl/5.8.1/Bio/LocatableSeq.pm:100
	STACK Bio::AlignIO::msf::next_aln /Library/Perl/5.8.1/Bio/AlignIO/msf.pm:131
	STACK toplevel ./simple.pl:9


So it seems to me that I'm going about this the wrong way. Other .msf
files I've seen, from other sources, also contain ~ characters, and I
can't read these as above either.

Is there something obviously wrong in what I'm doing?

One thing that makes me feel I'm partly on the right track is that if
I simply remove all the ~ characters from the .msf file, then next_aln
works and I can call $s = $aln->each_seq() on that and then $s->seq()
on that, and get out sequences. But this is surely wrong, as deleting
all the ~ characters changes the sequences.

I've tried

  $in->gap_char('~');

but this fails, telling me 

  Can't locate object method "gap_char" via package "Bio::AlignIO::msf" at ./simple.pl line 11.

gap_char is defined in Bio::SimpleAlign, but when I try using
Bio::SimpleAlign instead of Bio::AlignIO, the method next_aln() is no
longer available. So that's not right either.

Can someone tell me what I'm doing wrong here?

Regards,
Terry.


--------------------------------------------------------------------
The start of my .msf file:

!!NA_MULTIPLE_ALIGNMENT 1.0
PileUp of: *.ha1

 Symbol comparison table: GenRunData:pileupdna.cmp  CompCheck: 3341

                   GapWeight: 5
             GapLengthWeight: 1 

 file.msf  MSF: 1106  Type: N  November 12, 2003 11:38  Check: 9322 ..

 Name: jjtll       Len:  1032  Check: 2l50  Weight:  1.00
 Name: ltkaa       Len:  1032  Check: 1129  Weight:  1.00


More information about the Bioperl-l mailing list