[Bioperl-l] Parsing FASTA m10 output

Mon Apr 23 13:49:45 UTC 2007

Aaron,

I find -m 10 defined way back in fasta2 notes:

--------------------------------------------------------------
Changes with 2.0x4  (January, 1996)

The major change in with 2.0x4 is the ability to get a parseable
output from FASTA/TFASTA/SSEARCH.  This can be done using output
option -m 10.  ...
--------------------------------------------------------------

It goes on to define it in more detail (which is nice to have  
around!).  It's possible it wasn't implemented until recently for  
fasta3 but I find references to it in the various fasta3 notes going  
back to at least 2001, so maybe it wasn't not compiled by default  
until recently?  The extra '#' line was added in 2002 to all output  
as far as I can tell.

We could just have SearchIO::fasta fall back to default parsing if  
'#' isn't present.  The default format and m10 are sufficiently  
different enough that we probably want to separate m10 parsing into  
it's own parser subroutine so we don't screw with the default parsing  
too much.

chris

On Apr 23, 2007, at 8:29 AM, aaron.j.mackey at gsk.com wrote:

> Since -m10 is newer than PGM_DOC, you should be fine to use the  
> first line
> as a detection for m10, when that first line exists (when it does  
> not, the
> format cannot be m10, unless someone has re-compiled FASTA with an
> undefined PGM_DOC).
>
> -Aaron
>
> bioperl-l-bounces at lists.open-bio.org wrote on 04/23/2007 08:46:40 AM:
>
>> That's true, but older versions of fasta don't do this.  For
>> instance, the example files in the bioperl distribution in t/data
>> (HUMBETGLOA.FASTA, cysprot1.fasta, cysprot_vs_gadfly.fasta) lack this
>> line.
>>
>>  From the fasta changelog:
>>
>> -------------------------------------------------------------
>>>> Nov 14-22, 2002  CVS fa34t20b6
>>
>> Include compile-time define (-DPGM_DOC) that causes all the fasta
>> programs to provide the same command line echo that is provided by  
>> the
>> PVM and MPI parallel programs.  Thus, if you run the program:
>>
>>      fasta34_t -q -S gtt1_drome.aa /slib/swissprot 12
>>
>> the first lines of output from FASTA will be:
>>
>>      # fasta34_t -q gtt1_drome.aa /slib/swissprot
>>       FASTA searches a protein or DNA sequence data bank
>>       version 3.4t20 Nov 10, 2002
>>      Please cite:
>>       W.R. Pearson & D.J. Lipman PNAS (1988) 85:2444-2448
>>
>> This has been turned on by default in most FASTA Makefiles.
>> -------------------------------------------------------------
>>
>> We could only support newer fasta output (newer that the above
>> version) since there have been several bug fixes and changes to
>> output; not sure how everyone else feels about this.
>>
>> chris
>>
>> On Apr 23, 2007, at 4:45 AM, Ioannis Kirmitzoglou wrote:
>>
>>> I don't know about older versions but the latest version of FASTA
>>> starts its
>>> output with a line similar to those:
>>> # fasta34.exe -m9 -d0 -Q test.faa test.faa OR
>>> # fasta34.exe -m10 -Q test.faa test.faa
>>>
>>> This very first line is also the only one in the output that starts
>>> with
>>> '#'.
>>> Isn't this an easy way to determine the output type?
>>>
>>>
>>> -- 
>>>
>>> *Ioannis Kirmitzoglou*, MSc
>>> PhD. Student,
>>> Bioinformatics Research Laboratory
>>> Department of Biological Sciences
>>> University of Cyprus
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>> Christopher Fields
>> Postdoctoral Researcher
>> Lab of Dr. Robert Switzer
>> Dept of Biochemistry
>> University of Illinois Urbana-Champaign
>>
>>
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>
>

Christopher Fields
Postdoctoral Researcher
Lab of Dr. Robert Switzer
Dept of Biochemistry
University of Illinois Urbana-Champaign