[EMBOSS] sequence retrieval

Zheng Jin Tu ztu at msi.umn.edu
Tue Jun 10 21:54:06 UTC 2008


This is very popular requirement from biological
user community especially microarray user community.
They have a list of id (affyid or access number) from
microarray data analysis.  Then they want sequence
from fasta file such as Affymetrix Library xxx.sif
file.

In order to use EMBOSS, emboss admin needs to 
index database first.

NCBI fastacmd is another option for getting 
sequence fast especially for last fasta sequence
file such as nt or nr. 

A perl script will be useful for batch sequence
retrival. It will read input file with
list of IDs line-by-line then do:

1): fastacmd -d database -s ID >> outsequence    # ncbi formatdb case

2): seqret .....                                 # EMBOSS case

3): Or just loop over sequence file with flag for find/not find 
by match id over fasta heading ">id ...". Then
output sequence if flag is on if sequence is 
relative small especially in microarray case.


Thanks, TU

--------------------------------------------------
On Tue, 10 Jun 2008, Sean MacEachern wrote:

> Hi Jay, 
> 
> Just wondering if you have considered the tools from NCBI. If you were to
> dload the blast bundle, I think blast-2.2.17 is the most current release,
> you can use formatdb to create a blastable database of your fasta seqs that
> you can use for blasting using one of the blast programs or retrieving using
> fastacmd.
> 
> I'm not sure what emboss application you are attempting to use but you could
> probably use a for loop to automate some procedure
> 
> Eg.
> 
> For i in `cat seqIDs.txt`; do fastacmd -d blastdb -s $i > seq.fsa | primer3
> -input seq.fsa -output $i_out.primers
> 
> Depending on what you want to do something like that might work for you...
> 
> Cheers,
> Sean
> 
> 
> On 6/10/08 4:15 PM, "Jay" <jboddu at uiuc.edu> wrote:
> 
> > Daniel:
> > I tried seqret in different ways.
> > My problem is EMBOSS is not recognizing my master sequence file (which is in
> > fasta form) as my private database. Even after I did the indexing using
> > dbifasta.
> > When seqret is asking me to input sequence(s), I am not able to figure out
> > what exactly it accepts.
> > I tried dbname:ID, dbname:@listfile.
> > I also tried a crude way of copy pasting my master file and listfile in
> > "embl" folder in EMBOSSwin folder and try the same syntax (embl:ID,
> > embl:@listfile etc.
> > These did not work.
> > I am assuming that my master file is not being recognized as a private DB.
> > I wanted to define my database in .embossrc file. I could not figure this
> > out either.
> > Jay
> > 
> > -----Original Message-----
> > From: Daniel Barker [mailto:db60 at st-andrews.ac.uk]
> > Sent: Tuesday, June 10, 2008 1:50 PM
> > To: rls at ebi.ac.uk
> > Cc: Peter Rice; Jay; emboss at lists.open-bio.org
> > Subject: Re: [EMBOSS] sequence retrieval
> > 
> > Dear Jay,
> > 
> > Are you simply trying to extract specific sequences from a Fasta-format
> > file? The EMBOSS program to do it is seqret, or maybe seqretsplit:
> > 
> > http://emboss.sourceforge.net/apps/release/5.0/emboss/apps/seqret.html
> > 
> > http://emboss.sourceforge.net/apps/release/5.0/emboss/apps/seqretsplit.html
> > 
> > As Peter Rice suggests, you can do stuff to speed the access up, but
> > it'll work without that.
> > 
> > Best regards,
> > 
> > Daniel
> 
> 
> _______________________________________________
> EMBOSS mailing list
> EMBOSS at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/emboss
> 

-- 
==========================================================================



More information about the EMBOSS mailing list