[Bioperl-l] Bio::SeqIO::tab deletes gap characters when reading sequences, which is inconvenient

Tim White exceptlowang at gmail.com
Thu May 17 11:07:50 UTC 2012


Wonderful thanks Chris!

Tim

On Fri, May 11, 2012 at 8:56 AM, Fields, Christopher J <
cjfields at illinois.edu> wrote:

> Tim,
>
> This one got stuck in my drafts folder :P
>
> Easy enough to do.  I've added this in to the master branch, commit
> eece9dd.
>
> chris
>
> On Apr 17, 2012, at 6:59 PM, Tim White wrote:
>
> > Hi,
> >
> > Bio::SeqIO::tab (what you get when specifying -format => 'tab' to
> Bio::SeqIO->new()) is perfect for converting sequences into a one-per-line
> format, so that standard line-oriented UNIX tools (grep, comm etc.) work as
> expected.  Except...  I just discovered that it deletes gap ("-")
> characters when reading sequences, so it can't be used to round-trip any
> files that contain these.  This is a source of grief as I frequently work
> with FASTA files that contain aligned sequences, and thus gap characters.
> >
> > This is all because the next_seq() function in Bio::SeqIO::tab.pmcontains the line:
> >
> > $seq =~ s/\W//g;
> >
> > which removes all non-alphanumeric characters from the sequence data.
>  IMHO it would be *much* better if this was changed to:
> >
> > $seq =~ s/\s//g;
> >
> > which simply removes all whitespace characters (particularly including
> the \r that often appears at the ends of lines on text files that have
> visited Windows), enabling gap characters (and, for example, periods and
> asterisks) to be preserved.  Alternatively, you could simply get rid of
> this line of code and allow whitespace characters through.
> >
> > I'm not sure whether this counts as a "bug", as a cursory search didn't
> turn up any docs explaining precisely what characters are and aren't
> preserved by classes implementing Bio::SeqIO, but it's certainly
> inconsistent (at least Bio::SeqIO::fasta, and Bio::SeqIO::table, with
> columns and delimiters set up appropriately, allow round-tripping of files
> containing gap characters) as well as extremely inconvenient for me
> personally, and I suspect for others.  Assuming no harm would be done by
> making the above change, what's the best thing to do to get this changed?
>  I've simply edited my own local copy of tab.pm to make the above change,
> but obviously if others agree I'd like to get the change done upstream.
> >
> > Thanks,
> > Tim
> >
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
>



More information about the Bioperl-l mailing list