[Bioperl-l] new modules for sarching for patterns in fasta-fi les

Amir Karger akarger at CGR.Harvard.edu
Tue Aug 9 15:20:34 EDT 2005


> From: markus.riester at student.uni-tuebingen.de 
> "Aaron J. Mackey" <amackey at pcbi.upenn.edu> schrieb:
> 
> > Out of curiosity, are your patterns allowed to cross newlines  
> > embedded in the FASTA file?  This is the typical problem 
> > with using  
> > grep/agrep directly with sequence files ...> 
>
> with a cheap trick, yes, split the fasta files in two files. 
> ids in one file,
> sequences -one per line- in the second. 


I wrote a simple one-liner to convert fasta to three, tab-separated columns:
ID (without '>') desc, and concatenated sequence. That way you don't have to
worry about keeping the two files tied together, but agrep should still find
things only in the concatenated sequence. (Unless somebody mean put a
sequence into the description column.) As an added bonus, it means you can
throw a FASTA into Excel for sorting, filtering, etc. Or merge with a gene
list pretty easily. 
It's at
http://cgr.harvard.edu/cbg/scriptome/Tools/Change.html#new__change_a_fasta_f
ile_into_tabular_format__change_fasta_to_tab_
along with the tab-to-FASTA converter, along with a couple sentences
describing potential gotchas (e.g., any tabs in the desc get lost)

> 
> this should be ok for cdna/protein fastafiles (but I am 
> currently writing
> tests-maybe some serious problems with the chars per line 
> limitations show
> up-but I did look good in some first tests.)

Can you tell me what you mean by this?

-Amir Karger


More information about the Bioperl-l mailing list