[Bioperl-l] Bio::SeqIO::tab deletes gap characters when reading sequences, which is inconvenient
Tim White
exceptlowang at gmail.com
Wed Apr 18 00:00:08 UTC 2012
Hi,
Bio::SeqIO::tab (what you get when specifying -format => 'tab' to
Bio::SeqIO->new()) is perfect for converting sequences into a
one-per-line format, so that standard line-oriented UNIX tools (grep,
comm etc.) work as expected. Except... I just discovered that it
deletes gap ("-") characters when reading sequences, so it can't be used
to round-trip any files that contain these. This is a source of grief
as I frequently work with FASTA files that contain aligned sequences,
and thus gap characters.
This is all because the next_seq() function in Bio::SeqIO::tab.pm
contains the line:
$seq =~ s/\W//g;
which removes all non-alphanumeric characters from the sequence data.
IMHO it would be *much* better if this was changed to:
$seq =~ s/\s//g;
which simply removes all whitespace characters (particularly including
the \r that often appears at the ends of lines on text files that have
visited Windows), enabling gap characters (and, for example, periods and
asterisks) to be preserved. Alternatively, you could simply get rid of
this line of code and allow whitespace characters through.
I'm not sure whether this counts as a "bug", as a cursory search didn't
turn up any docs explaining precisely what characters are and aren't
preserved by classes implementing Bio::SeqIO, but it's certainly
inconsistent (at least Bio::SeqIO::fasta, and Bio::SeqIO::table, with
columns and delimiters set up appropriately, allow round-tripping of
files containing gap characters) as well as extremely inconvenient for
me personally, and I suspect for others. Assuming no harm would be done
by making the above change, what's the best thing to do to get this
changed? I've simply edited my own local copy of tab.pm to make the
above change, but obviously if others agree I'd like to get the change
done upstream.
Thanks,
Tim
More information about the Bioperl-l
mailing list