[Bioperl-l] Request for advice and pointers on a project to h elp biologists d o simple formatting and analysis

Thu Mar 10 10:08:27 EST 2005

[snipped throughout for "brevity"]

> From: Andreas Kahari [mailto:ak at ebi.ac.uk] 
> 
> I'm not quite sure what this has to do with bioperl...

1. From http://www.bioperl.org: "The Bioperl server provides an online
resource for modules, scripts, and web links for developers of Perl-based
software for life science research." I assumed bioperl-l was for disucssions
of doing Bio with Perl.  

2. I asked in my original mail: "Are there any other lists I should post
these questions to?" but no one has suggested any lists or newsgroups yet.

3. My original mail also said, "take advantage of existing tools' APIs: perl
-MBio::Perl -e '...'"  

> On Wed, Mar 09, 2005 at 01:46:17PM -0500, Amir Karger wrote:
> 
> > >Amir Karger wrote:
> > >> I was thinking it would be useful to have a 
> > >> toolkit of outrageously simple
> > >> Perl one-liners.  Here's one:
> 
> http://www.oreilly.com/catalog/cookbook/

How many biologists who don't use Perl will read the Perl cookbook? Or were
you just making a suggestion of where I could take scripts from?

Actually, looking through the table of contents, I see only a few recipes
that would fit.  In any case, writing the scripts is not the hard part; it's
knowing which scripts will be useful and helping biologists find the right
ones to solve their particular problems.

> > I know that many of the tasks proposed for the Scriptome 
> > can be done with
> > grep, sed, cut, Word, or Excel.  But how many experimental 
> > biologists are familiar
> > with Unix cut? I think not many, because they have other 
> things to worry about.
> 
> Hmmm, comparing 'cut' and 'sed' with Word and Excel?  Oh well.

I'm not comparing the quality of sed vs. Find/Replace. Most biologists (at
least here) prefer Windows. They already use Excel to look at their data.
Excel has functions to do simple data analysis, but my impression is that
few biologists use those functions.

> The philosophy of Unix utilities is to do only one thing,
> but to do it very well.  In the case with the 'sort' utility
> for example, it will most likely use an out-of-core sorting
> algorithm to cope with files larger than the available memory
> of the machine, and will probably be a fair bit quicker and
> flexible than your own implementation.

The Scriptome is not aiming at sorting gigabyte files; does a biologist want
to sort an entire Genbank file? I think much more often they'll want to sort
< 10 MB lists of genes or whatever.  On small files, the sorting algorithm
doesn't matter. If they do try to sort too big a file, the script will
break, and they'll need to try a different tool. I'm not claiming that my
solution will solve every conceivable task, just the easy ones. 

> I do understand that there is a need for integrated utilities
> with easy-to-press buttons, and I won't try to put you off
> working on those kind of projects, but...
> 
> What would an experimental biologists, who is not familiar with
> 'sort', 'cut' or 'join', do with a Perl script that implemented
> those functionalities?

sort, cut, or join files! I don't think I understand your question.
An experimental biologist who knows just a little Unix can take a sorting
script, paste it to the command line, and use it.  We're talking about use
cases where the biologist knows exactly what they want to do - sort a file,
merge files together, pull out the 8th column from the data into a new file,
etc. - but not how to implement a solution.

Who knows? Maybe eventually we'll decide to put "sort -u file1 file2" as a
"script". But we wouldn't want to use *only* Unix commands because that
ignores all the stuff Unix can't (easily) do.  

>  Wouldn't it be better to provide a
> high-level interface to common tasks, like parsing the output
> from various programs and providing simple ways of accessing
> and manipulating sequence features etc.

That's exactly what I want to do. My interface is searching for a tool on a
website and pasting it onto the Unix command line.  

>  If you find ways to
> expand the application area of BioPerl, or if you rationalize
> and improve existing BioPerl code, then I'm sure the BioPerl
> maintainers would be happy to consider commiting your code to
> the project.

I believe my project is complementary to Bioperl's bioscripts, but it aims
at a different set of tasks, namely, tasks that are so simple that
Bioperlers haven't bothered to commit the scripts to CVS. If I want to count
how many microarray hits have names and how many just have CG numbers, I'll
do it in a Perl one-liner that takes 3 minutes to write and maybe 10 for
debugging and formatting. Why bother committing that to CVS? Well, an
experimental biologist in my group gave me that exact example, and told me
she spent 20 minutes counting and double-checking. If she had had 1000 hits
instead of 100, she would have needed hours to count.  More likely, she
would have just given up.

To put it another way, I'm aiming to make hard things possible -
specifically things that are hard for biologists who aren't programmers.
Bioperl, on the other hand, is focusing on things that are hard (or hard to
do right, or at least annoying) even for programmers.

I am making at least a couple assumptions about the niche I'm aiming for:
people who know how to use the command line but don't know Perl.
1. There are many such people (or enough to care about)
2. They will be able to put the "atomic" scripts together to solve real
problems (first join two files with a script, sort with another script,
remove duplicates with a third)

I may be wrong about either of these.  It may be that even with the
Scriptome tools, you have to "think like a programmer" to do these sorts of
tasks, and that many biologists' brains just don't work that way. But I
think it's worth trying.

-Amir