[Biopython] File format autodetection.

Lenna Peterson arklenna at gmail.com
Tue Jun 24 18:59:34 UTC 2014


On Tue, Jun 24, 2014 at 2:00 PM, Ivan Gregoretti <ivangreg at gmail.com> wrote:

> Indeed, the STDIN stream is the challenge. That is why I though that
> the question was worth documenting in the Biopython list.
>
> Would anybody mind showing how peekline() is used? I tried using it on
> a SeqIO.parse generator but I get an error:
>
> AttributeError: 'generator' object has no attribute 'peekline'
>

peekline() is a method of UndoHandle, not the generator.

Cheers,

Lenna



>
> I am using Biopython 1.61 and Python 2.7.3 on linux 64bit.
>
> Thank you,
>
> Ivan
>
>
>
>
> Ivan Gregoretti, PhD
> Bioinformatics
>
>
> On Tue, Jun 24, 2014 at 1:41 PM, Fields, Christopher J
> <cjfields at illinois.edu> wrote:
> > On Jun 24, 2014, at 11:54 AM, Peter Cock <p.j.a.cock at googlemail.com>
> wrote:
> >
> >> Hi Ivan,
> >>
> >> Biopython's SeqIO does not (and will not) do automatic file
> >> format detection, it is just too hard to get right so instead
> >> that's the user's task:
> >>
> >> Zen of Python: Explicit is better than implicit.
> >> http://legacy.python.org/dev/peps/pep-0020/
> >>
> >> (BioPerl's SeqIO can do format guessing)
> >
> > (somewhat)
> >
> > You are welcome to try it, but Bio::Tools::GuessSeqFormat is IMHO one of
> the misbegotten step-children of Bioperl; if you delve into it, you’ll find
> it also tries to guess whether something is a sequence or an alignment
> file.  My general feeling is that if you don’t know the source of your data
> (and from that the format) then there is only so much we can do to help.
>  Doing so from STDIN is even trickier.
> >
> > So, it’s there, it works in most cases so we keep it around, but caveat
> emptor.  We really don’t really maintain that module any more than very
> routine bugs fixes.
> >
> >> Your use case is one which highlights a technical reason
> >> why this is hard - you are using stdin, a read-once handle.
> >> You cannot peek at the file, guess the format, seek back to
> >> the beginning, and then give the handle to a specific parser.
> >>
> >> You could use Biopython's UndoHandle here, but it will
> >> impose a (modest) performance overhead.
> >>
> >> from Bio.File import UndoHandle
> >> help(UndoHandle)
> >>
> >> e.g. Use the .peekline() method to spot FASTA vs FASTQ?
> >>
> >> Peter
> >
> > That seems like a pretty reasonable option.
> >
> > chris
> >
> >> On Tue, Jun 24, 2014 at 5:16 PM, Ivan Gregoretti <ivangreg at gmail.com>
> wrote:
> >>> Hello Biopythoneers,
> >>>
> >>> The question:
> >>>
> >>> What is the strategy currently used for file format autodetection?
> >>>
> >>>
> >>> The context:
> >>>
> >>> I have written a command line program that gets a stream of FASTQ data
> >>> and reports how many records are contained. You can visualise it like
> >>> this
> >>>
> >>> zcat myfile.fq.gz | fxcounttags.py -i /dev/stdin -o /dev/stdout >
> myfile.counts
> >>>
> >>> That works fine for FASTQ but I need to extend the functionality to
> >>> FASTA streams. How would you write fxcounttags.py to detect
> >>> FASTQ/FASTA?
> >>>
> >>> Thank you,
> >>>
> >>> Ivan
> >>>
> >>>
> >>>
> >>> Ivan Gregoretti, PhD
> >>> Bioinformatics
> >>> _______________________________________________
> >>> Biopython mailing list  -  Biopython at mailman.open-bio.org
> >>> http://mailman.open-bio.org/mailman/listinfo/biopython
> >> _______________________________________________
> >> Biopython mailing list  -  Biopython at mailman.open-bio.org
> >> http://mailman.open-bio.org/mailman/listinfo/biopython
> >
>
> _______________________________________________
> Biopython mailing list  -  Biopython at mailman.open-bio.org
> http://mailman.open-bio.org/mailman/listinfo/biopython
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.open-bio.org/pipermail/biopython/attachments/20140624/99ca9c18/attachment-0001.html>


More information about the Biopython mailing list