[Bioperl-l] Re: removing duplicate fasta records

Ewan Birney birney@ebi.ac.uk
Wed, 18 Dec 2002 14:32:42 +0000 (GMT)

On Wed, 18 Dec 2002, Lincoln Stein wrote:

> Ummm, what's wrong with it?  It's a model of lucidity compared to awk.
> My version was designed to show how to do this with bioperl, which I thought 
> was what the original inquiry was asking.  You can do this with sort, uniq 
> and sed too.

Noooo. Please. Gawk... Sed... Sort... Painful flashbacks coming back to

... Back in '97 I was "bequeathed" management of a "database" (Pfam) which
was structured as a series of directories in UNIX and had a large file
with a series (about 30) different 5 or 6 line wrapped sed/sort/gawk
pipelines and a comment about what the (combined) command was meant to do.

In different scenarios, the data manager (me at that point) incanted these
lines by copying and pasting in a particular directory (with deft
line-wrap control, and don't hit newline too early...) and watched the
resulting screen. It was rather as if you were the Perl interpreter (there
were actual "branch conditions" in the text) stepping through code and
executing it yourself using UNIX tools.

Needless to say, when it broke, it broke in... ummm... interesting to 
debug ways, and required a deep knowledge about the sort -k syntax (read 
the manpage and you will appreciate it).

Much to the horror of the person who left this for me I replaced the whole
thing in Perl and moved from a directory of UNIX files to a directory of
RCS files in UNIX (I hadn't met relational databases then and I thought
those were only for "professional" software engineers, and I stayed clear
of them. Stupid in retrospect). It did provide marginally better data
recovery at the very least. (I don't think Erik has ever forgiven me for
not using his beloved gawk 5-liners). Looking back on it, I am not that
proud of what I wrote, but it did work, and it is still used in areas of
Pfam today (a scary thought), though Kevin might have gutted most of my
code by now.

.... and that is where I came across a project called "bioperl" which was 
then a loose collection of websites with principly Steve Chervitz's BLAST 
parser and the Seq.pm object (lots of people) and I said "hey, I don't
want to have to write my own sequence object, and these guys look pretty 
sane", and then Chris D said "I have a spare 486 running linux and a DSL 
line in my bedroom" and we intialised the cvs repository, cvs imported the 
BLAST parser and the Seq object, created the Bio:: space ... 

and 5 years later, here we are.