[Bioperl-l] Genome scanning questions/strategies

Robert Bradbury robert.bradbury at gmail.com
Tue Sep 15 08:05:22 UTC 2009


I have several applications which require scanning multiple genomes, in some
cases I can get away with scanning the protein sequences, in other cases I
need to scan the mRNA, or in the worst case the DNA sequences themselves.  I
have most of the available genomes on my hard drive but in cases where they
are not complete or undergo frequent revisions, I may need to interface
through the Genbank | Ensembl | JGI (or other?) databases.

Some of the applications are basic counting statistics:
1) How many proteins?
2) How many amino acids in the proteins?
3) What are the species specific codon frequencies in the codons?
4) What fraction of the genome is ncRNA, junk DNA, etc.?

Other applications involve some functional analysis, e.g. find all specified
protein domains of interest (presumably some HMM matching or equivalent),
find all signal sequences (nuclear targeting, mitochondrial targeting, ER
targeting, etc.), find all mRNA restriction enzyme cut sites, etc..

Questions are:
1) Are there "remote" functions that use genome center "supercomputers"
(other than say Remote Blast) that can be used for some of these purposes
and are interfaced in some way to BioPerl?
2) Will I incur genome center wrath by running all my queries "remotely"
(i.e. I do the computing, but they handle the database retreival & network
distribution)?  If not, what is a good "max query frequency"? [I'm on a DSL
line, so I can't push most servers very hard from an I/O standpoint.]

Finally, is there any "archive of experience" documenting the various
information systems limitations on various bioinformatics applications?
I.e. for I/O requirements and/or CPU requirements, is: BLAST <
HMM-domain-searching < Inter-genome-signal-scanning/matching?  Relates to
the question of when home based bioinformaticians need to begin considering
switching from DSL to Cable to FIOS and/or 1/3/4/6/8 core machines/clusters
can handle the workload.

Thank you,
Robert Bradbury



More information about the Bioperl-l mailing list