[EMBOSS] Seqret slowness.....

Peter Rice pmr at ebi.ac.uk
Thu Oct 15 10:47:55 UTC 2009


Richard Rothery wrote:
> Hi,
> 
>  
> 
> I have been trying to update my sequence datasets using the seqret program.
> Step 1 is that I blast my sequence against uniprot using the EXPASY server.
> Unfortunately, because the recent explosion of duplicate data
> ("environmental samples"), it is now necessary to download 1-2K sequences
> and then filter out the random "environmental" sample derived fragments etc.
> Step 2 is assembling a list in gnumeric, exporting it as a multiline text
> file of format "unpirot:accession" . Step 3 is using the command "seqret
> @filename.txt". This is extraordinarily slow. It takes >12 hours to download
> a fasta file containing 3K sequences. Is there a way of speeding this up? I
> used to be able to download directly from EXPASY, but the site now only
> allows about 200-odd sequences to be selected and downloaded at a time.

Perhaps we can download in batches ... what is your EMBOSS database 
definition for uniprot?

Also, how do you select the 1-2K sequences?

> Note that filtering sequence sets is very fast with the program cd-hit. This
> takes about 10 seconds on an old P4 machine to remove sequences from the set
> with >90% identity to any other, for example.

Would this be a useful addition to EMBOSS? For example an application 
that reads selected entries from a database (uniprot) and filters them.

The output would be a sequence file containing the uniprot entries you 
need (so no need for the final seqret step retrieving sequences again).

> I do not have the resources to install and index local databases.

No problem - we try to support users in exactly your situation.

regards,

Peter Rice



More information about the EMBOSS mailing list