[BioPython] Can't download a FASTA file from NCBI to BLAST

Tue Jun 19 13:50:08 UTC 2007

Hi again Peter,

You are correct in your assumptions as to what I'm trying to accomplish.  I
have a habit of pulling random code from different places when I'm at a loss
for how to do something, when I can't find documentation or examples.  The
Seq5 and Seq3 were residual from a previous script where I was pulling out
overlapping 50mers.  Regardless... I added your code to my script, and
changed the format for the search parameter to "fasta", but I'm getting the
following error:

\      Traceback (most recent call last):
\        File "<pyshell#4>", line 1, in <module>
\          Get_FASTA_Seq("NC_001653")
\        File "C:/Python25/FASTAtry2.py", line 17, in Get_FASTA_Seq
\          SeqIO.write(gb_records, fasta_file, "fasta")
\        File "C:\Python25\lib\site-packages\Bio\SeqIO\__init__.py", line
214, in write
\          writer_class(handle).write_file(sequences)
\        File "C:\Python25\Lib\site-packages\Bio\SeqIO\Interfaces.py", line
243, in write_file
\          self.write_records(records)
\        File "C:\Python25\Lib\site-packages\Bio\SeqIO\Interfaces.py", line
232, in write_records
\          self.write_record(record)
\        File "C:\Python25\Lib\site-packages\Bio\SeqIO\FastaIO.py", line
103, in write_record
\          id = self.clean(record.id)
\      AttributeError: 'str' object has no attribute 'id'

Am I not supposed to use GenBank.search_for, or NCBIDictionary, or do I need
to parse the raw output first.  I did try to use the GenBank RecordParser,
as well as the Feature Parser set to output as "fasta", but I get the same
error?  My current  (+your) modified code is:

\     def Get_FASTA_Seq(NC_ID):
\          i = NC_ID
\
\    ## Search for Viruses based on TXID, and count number of hits ##
\
\         from Bio import GenBank
\         from Bio import SeqIO
\
\         gi_list = GenBank.search_for(i)
\         ncbi_dict = GenBank.NCBIDictionary("nucleotide","fasta")
\
\      ## Extract individual Sequence from NCBI for each TXID ##
\
\         gb_records = [ncbi_dict[gi] for gi in gi_list]
\         fasta_file = open("c:\\Current_Query.fasta","w")
\         SeqIO.write(gb_records, fasta_file, "fasta")
\         fasta_file.close()

Thank you for your insights.  Sorry to be such a noob.

-Roger

On 6/19/07, Peter <biopython at maubp.freeserve.co.uk> wrote:
>
> On 6/19/07, Roger Barrette <rwbarrette at gmail.com> wrote:
> > I am trying to set up a script to automatically go into NCBI and
> retrieve
> > individual FASTA file based on a list of accession numbers (either gi or
> > NC).  The code that I have written gets the sequences and saves the
> file,
> > but when I run a blast against the file, it doesn't work, Am I not using
> the
> > correct parser for preparing to save the file for blasting? I tried to
> set
> > the format to "fasta", but I was getting errors saying that gi_list[0]
> > doesn't contain the arguement 'data.seq'.  I also tried the arguement
> > .sequence, and it gave me the same errror.  I realize I'm not currently
> > calling the file in as a FASTA, but this is the only way I've been able
> to
> > even automate the record retrieval process for the long series of
> Blasting
> > that I have to do.    I have a separate function for calling the Blast,
> > but it works fine with manually downloaded FASTA files, so the
> > problem appears to be here.  Any suggestions for a fix, or even a better
> way
> > to do this would be greatly appreciated. Thanks.   My code is:
>
> You seem to have tried a lot of things and its difficult to follow
> exactly what you are trying to do. I think you want to:
>
> (1) start with a list of gi numbers
> (2) get the matching sequence data online from the NCBI
> (3) save these as a fasta file
> (4) call blast using this fasta file as the input query
>
> Anyway - I've included a bit of code for (2) and (3) at the end of this
> email.
>
> > def Get_FASTA_Seq(NC_ID):
> >
> >     i = NC_ID
> >
> > ## Search for Viruses based on TXID
> >
> >     from Bio import GenBank
> >     gi_list = GenBank.search_for(i)
> >     ncbi_dict = GenBank.NCBIDictionary("nucleotide","genbank")
> >
> >     fasta_file = open("c:\Current_Query.gbk", "w")
>
> Its a bit odd to save a fasta file with a gbk extension, fasta is more
> common. And there may be a problem with the single unescaped slash,
> try r"c:\Current_Query.gbk" or "c:\\Current_Query.gbk".
>
> Remember, in python the slash is used for things like \n (new line),
> \t (tab) etc.
>
> > ## Extract individual Sequence from NCBI based on gi# or NC#  ##
> >
> >     gb_record = ncbi_dict[gi_list[0]]
> >     record_parser = GenBank.FeatureParser()
> >     ncbi_dict = GenBank.NCBIDictionary("nucleotide","genbank", parser =
> > record_parser)
> >     gb_seqrecord = ncbi_dict[gi_list[0]]
> >
> >     SeqValue = ncbi_dict[gi_list[0]].seq.data
> >     NameValue = ncbi_dict[gi_list[0]].annotations["organism"]
> >     Length = len(SeqValue)
> >     Seq5 = 0
> >     Seq3 = Seq5 + Length
> >
> >     print NameValue
> >     print Length
> >     print SeqValue
> >
> > ## Write sequences into the FASTA file ##
> >
> >     fasta_file.write(">" + i + " " + NameValue + "\n")
> >     for j in range(0, len(SeqValue[Seq5:Seq3]), Length):
> >         fasta_file.write(SeqValue[Seq5:Seq3])
> >         fasta_file.write("\n")
> > ## Close and Save the FASTA file ##
> >     fasta_file.close()
>
> That is all very complicated - why mess about with Seq5 and Seq4 when
> you seem to want the whole sequence anyway?
>
> Have you opened the output file in a text editor to check it looks
> sensible?
>
> If you can construct a list SeqRecords, why not write the file using
> Bio.SeqIO (Biopython 1.43 or later) like this:
>
> ...
> gb_records = [ncbi_dict[gi] for gi in gi_list]
> from Bio import SeqIO
> fasta_file = open("c:\\Current_Query.fasta","w")
> SeqIO.write(gb_records, fasta_file, "fasta")
> fasta_file.close()
>
> Peter
>