[Bioperl-l] need help with phrap parser

Fri Dec 8 14:17:02 UTC 2006

neeti somaiya wrote:
> Can anyone point me to a Phrap parser which parses the ace file to extract
> what reads make up each contig (eg. read_a and read_b make contig1; read_d
> read_e and read_z make contig2, and other information of the reads (like
> whether the read is complemented or not with respect to the contig, what
> region of the contig does each read contribute etc), basically the AF and BS
> lines of the ACE output.
>
>   
neeti,

    To find the reads that went into each contig, you do *not* want the BS tagged records. My understanding is that BS is just what consed uses to populate its consensus line from the ace file. 
I write this because of an email sent me by David Gordon in 2001 included here 
without his permission:

> > Phrap writes BS lines which
> > indicate, for each consensus position, which read phrap uses at that
> > position to become the consensus.  These BS ("base segments") are 
> > manipulated by Consed when there are changes to the assembly, such as
> > joins, tears, removing reads, or changing the consensus.
>   
    The simplest way is:

egrep '^CO|AF|RD' acefilename

if you are on a unix system. Or with perl

while (<>) {
    print if (/^CO|AF|RD/);
}

But then you would need to parse the fields of interest. You get the 
position/strand in the contig from AF, then you get the length of the 
read from RD.

There does look like there is a part of bioperl that meant to perform 
this task--including Bio::Assembly::IO::ace but it looks like it was 
started, but never completed.