[Bioperl-l] searchio tags

Liam Elbourne lelbourn at cbms.mq.edu.au
Wed Nov 5 09:04:38 UTC 2008


Thanks Jason and Chris,

I ended up iterating through the features as I needed the locus_tag  
anyway - bit slow but gets there in the end...

I've ccd this to the bioperl list as suggested, and will subscribe soon.

Regards,
Liam.


On 05/11/2008, at 6:00 AM, Chris Fields wrote:

> I've already looked into that using TBLASTN output from the web  
> interface, but I found no relevant lines in HSP sections in XML  
> formatted reports (see below).  It is possible the URLAPI interface  
> is returning different XML data (I've seen stranger things from  
> NCBI).  I'll check into that when I can.  BTW, the default read  
> format for RemoteBlast is still text IIRC (we never changed it to  
> XML), so is it possible Liam is referring to the default (text)  
> output, and not XML?
>
> Also, I have managed to get this working in GenericHSP, will  
> probably commit momentarily with added tests.  Liam, if you have XML  
> output that contains the relevant data could you attach it to an  
> email so I can try parsing it out?
>
> Liam, it might be best to ask these questions on the mail list in  
> case Jason or I can't get to them immediately.  For SeqIO features,  
> in order to retrieve a particular feature region you would need to  
> either iterate through or grep the relevant features and retrieve  
> the sequence using $feature->seq.  If I want surrounding sequence I  
> grab start/end from the feature, then +/- the extra bp and use $seq- 
> >subseq.
>
> -c
>
> Example Text Hit/HSP block:
>
> >emb|AM408590.1| Mycobacterium bovis BCG Pasteur 1173P2, complete  
> genome
> Length=4374522
>
> Features in this part of subject sequence:
>  Conserved hypothetical protein
>  Probable pyrimidine operon regulatory protein pyrR
>
> Score =  372 bits (955),  Expect = 3e-101, Method: Compositional  
> matrix adjust.
> Identities = 193/193 (100%), Positives = 193/193 (100%), Gaps =  
> 0/193 (0%)
> Frame = +3
>
> Query  1         
> MGAAGDAAIGRESRELMSAADVGRTISRIAHQIIEKTALDDPVGPDAPRVVLLGIPTRGV  60
>                
> MGAAGDAAIGRESRELMSAADVGRTISRIAHQIIEKTALDDPVGPDAPRVVLLGIPTRGV
> Sbjct  1577814   
> MGAAGDAAIGRESRELMSAADVGRTISRIAHQIIEKTALDDPVGPDAPRVVLLGIPTRGV  1577993
>
> Query  61        
> TLANRLAGNITEYSGIHVGHGALDITLYRDDLMIKPPRPLASTSIPAGGIDDALVILVDD  120
>                
> TLANRLAGNITEYSGIHVGHGALDITLYRDDLMIKPPRPLASTSIPAGGIDDALVILVDD
> Sbjct  1577994   
> TLANRLAGNITEYSGIHVGHGALDITLYRDDLMIKPPRPLASTSIPAGGIDDALVILVDD  1578173
>
> Query  121       
> VLYSGRSVRSALDALRDVGRPRAVQLAVLVDRGHRELPLRADYVGKNVPTSRSESVHVRL  180
>                
> VLYSGRSVRSALDALRDVGRPRAVQLAVLVDRGHRELPLRADYVGKNVPTSRSESVHVRL
> Sbjct  1578174   
> VLYSGRSVRSALDALRDVGRPRAVQLAVLVDRGHRELPLRADYVGKNVPTSRSESVHVRL  1578353
>
> Query  181      REHDGRDGVVISR  193
>               REHDGRDGVVISR
> Sbjct  1578354  REHDGRDGVVISR  1578392
>
> Matching XML Hit/HSP block:
>
>       <Hit>
>         <Hit_num>2</Hit_num>
>         <Hit_id>gi|121491530|emb|AM408590.1|</Hit_id>
>         <Hit_def>Mycobacterium bovis BCG Pasteur 1173P2, complete  
> genome</Hit_def>
>         <Hit_accession>AM408590</Hit_accession>
>         <Hit_len>4374522</Hit_len>
>         <Hit_hsps>
>           <Hsp>
>             <Hsp_num>1</Hsp_num>
>             <Hsp_bit-score>372.474</Hsp_bit-score>
>             <Hsp_score>955</Hsp_score>
>             <Hsp_evalue>3.27024e-101</Hsp_evalue>
>             <Hsp_query-from>1</Hsp_query-from>
>             <Hsp_query-to>193</Hsp_query-to>
>             <Hsp_hit-from>1577814</Hsp_hit-from>
>             <Hsp_hit-to>1578392</Hsp_hit-to>
>             <Hsp_query-frame>0</Hsp_query-frame>
>             <Hsp_hit-frame>3</Hsp_hit-frame>
>             <Hsp_identity>193</Hsp_identity>
>             <Hsp_positive>193</Hsp_positive>
>             <Hsp_gaps>0</Hsp_gaps>
>             <Hsp_align-len>193</Hsp_align-len>
>              
> < 
> Hsp_qseq 
> > 
> MGAAGDAAIGRESRELMSAADVGRTISRIAHQIIEKTALDDPVGPDAPRVVLLGIPTRGVTLANRLAGNITEYSGIHVGHGALDITLYRDDLMIKPPRPLASTSIPAGGIDDALVILVDDVLYSGRSVRSALDALRDVGRPRAVQLAVLVDRGHRELPLRADYVGKNVPTSRSESVHVRLREHDGRDGVVISR 
> </Hsp_qseq>
>              
> < 
> Hsp_hseq 
> > 
> MGAAGDAAIGRESRELMSAADVGRTISRIAHQIIEKTALDDPVGPDAPRVVLLGIPTRGVTLANRLAGNITEYSGIHVGHGALDITLYRDDLMIKPPRPLASTSIPAGGIDDALVILVDDVLYSGRSVRSALDALRDVGRPRAVQLAVLVDRGHRELPLRADYVGKNVPTSRSESVHVRLREHDGRDGVVISR 
> </Hsp_hseq>
>              
> < 
> Hsp_midline 
> > 
> MGAAGDAAIGRESRELMSAADVGRTISRIAHQIIEKTALDDPVGPDAPRVVLLGIPTRGVTLANRLAGNITEYSGIHVGHGALDITLYRDDLMIKPPRPLASTSIPAGGIDDALVILVDDVLYSGRSVRSALDALRDVGRPRAVQLAVLVDRGHRELPLRADYVGKNVPTSRSESVHVRLREHDGRDGVVISR 
> </Hsp_midline>
>           </Hsp>
>         </Hit_hsps>
>       </Hit>
>
> On Nov 4, 2008, at 11:55 AM, Jason Stajich wrote:
>
>> FYI
>>
>> -
>> Jason Stajich
>> Sent from my iPod
>>
>> Begin forwarded message:
>>
>>> From: Liam Elbourne <lelbourn at cbms.mq.edu.au>
>>> Date: November 3, 2008 10:50:35 PM PST
>>> To: Jason Stajich <jason at bioperl.org>
>>> Subject: Re: searchio tags
>>>
>>> Jason,
>>>
>>> Thanks for the very rapid response! Excuse the circularity, but  
>>> NCBI must now be including this info in the XML format because  
>>> RemoteBlast had it in my blast results, or is there a fatal  
>>> antipodean past pub time flaw in my logic?
>>>
>>> In any event I'll adopt your suggestion to check back against  
>>> SeqIO (and check out DB Cache, that's a new one for me...).
>>>
>>> Is there a method in SeqIO that allows one to go straight to a  
>>> particular region of the sequence (ie something like "$seq- 
>>> >features_after($curr_hsp->start('hit')") or do I have to iterate  
>>> through  all the features until $seq->current-feature->start >=  
>>> $curr_hsp->start('hit') if you'll excuse (and understand) the  
>>> pseudo-bioperl code?
>>>
>>> Regards,
>>> Liam.
>>>
>>> On 04/11/2008, at 5:07 PM, Jason Stajich wrote:
>>>
>>>> Liam -
>>>>
>>>> Sorry - we don't parse this out of the report -- that information  
>>>> is only something that is included in the online-only BLAST  
>>>> report and is not a standard part of the format.  The RemoteBlast  
>>>> will typically get the XML-only version of the report which to my  
>>>> knowledge does not contain this information.  If it is now being  
>>>> included the parser could be updated to also parse this out, but  
>>>> I am not sure that it is.
>>>> Sorry - this is just a limit of how the data has been  
>>>> traditionally available and NCBI makes this kind of thing  
>>>> available only in an HTML form that is not-standardized.
>>>>
>>>> Your best bet is to get the sequence accession for your hit, then  
>>>> obtain the sequence as a genbank record from remote genbank (and/ 
>>>> or also cache this sequence in your local DB, there is a DB Cache  
>>>> in BioPerl for just this sort of thing) or a local genbank - then  
>>>> check and see if your HIT location is close to any of the  
>>>> features you are interested in from the annotation of the record.
>>>>
>>>> Hope that gets you pointed in the right direction, may be worth  
>>>> sending a note to the bioperl list as well to see if other people  
>>>> have similar needs and maybe a better custom solution can be  
>>>> cooked up.
>>>>
>>>> Best wishes,
>>>> -jason
>>>> On Nov 3, 2008, at 9:00 PM, Liam Elbourne wrote:
>>>>
>>>>> Hi Jason,
>>>>>
>>>>> I'm trying to parse a bunch of blast reports produced by  
>>>>> Bio::Tools::Run::RemoteBlast using SearchIO. Everything is fine  
>>>>> except that the:
>>>>>
>>>>> "Features in this part of subject sequence:"
>>>>>
>>>>> information doesn't seem to included in any of the objects  
>>>>> produced by SearchIO - the only method I could find that looked  
>>>>> vaguely promising was the locus() method for  
>>>>> Bio::Search::Hit::HitI, but this isn't initialised.
>>>>>
>>>>> I'm trying to match a set of primers against the genome they  
>>>>> were designed to target to check that they are associated with  
>>>>> the loci they were intended to targeted (as it appears in  
>>>>> retrospect that they have been 'mislabeled'), and if not, which  
>>>>> loci they are near or in. I've included an example blast report  
>>>>> at the end of the email, where the information I'm still trying  
>>>>> to extract (everything else being extractable) is "hypothetical  
>>>>> protein".
>>>>>
>>>>> Sorry to trouble you with such a trivial issue, if you are not  
>>>>> the right person to contact (and I appreciate that this is not a  
>>>>> bug as such, except so far as it is a 'bug' in my knowledge of  
>>>>> bioperl!) please tell me who would be. I've attached the code  
>>>>> I'm using too for what that is worth.
>>>>>
>>>>>
>>>>> <blast_parse.pl>
>>>>>
>>>>> Regards,
>>>>> Liam Elbourne.
>>>>> Research Fellow
>>>>> Paulsen Lab
>>>>> Macquarie University
>>>>> Sydney Australia.
>>>>>
>>>>> blast report below
>>>>> ****************************
>>>>> BLASTN 2.2.18+
>>>>> Reference: Stephen F. Altschul, Thomas L. Madden, Alejandro
>>>>> A. Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and
>>>>> David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new
>>>>> generation of protein database search programs", Nucleic
>>>>> Acids Res. 25:3389-3402.
>>>>>
>>>>>
>>>>> RID: GZ1WTPWW016
>>>>>
>>>>>
>>>>> Database: NCBI Genomic Reference Sequences
>>>>>         11,340 sequences; 45,795,413,781 total letters
>>>>>
>>>>> Query=  BR0042_r
>>>>> Length=25
>>>>>
>>>>>
>>>>>                                                                  
>>>>> Score     E
>>>>> Sequences producing significant  
>>>>> alignments:                       (Bits)  Value
>>>>>
>>>>> ref|NC_004310.3|  Brucella suis 1330 chromosome I, complete  
>>>>> se...  46.4    3e-07
>>>>>
>>>>> ALIGNMENTS
>>>>> >ref|NC_004310.3| Brucella suis 1330 chromosome I, complete  
>>>>> sequence
>>>>> gb|AE014291.4| Brucella suis 1330 chromosome I, complete sequence
>>>>> Length=2107794
>>>>>
>>>>> Features in this part of subject sequence:
>>>>> hypothetical protein
>>>>>
>>>>> Score = 46.4 bits (50),  Expect = 3e-07
>>>>> Identities = 25/25 (100%), Gaps = 0/25 (0%)
>>>>> Strand=Plus/Minus
>>>>>
>>>>> Query  1       GTTTTTCGAGCCGCCCTTTTTGCCC  25
>>>>>             |||||||||||||||||||||||||
>>>>> Sbjct  494130  GTTTTTCGAGCCGCCCTTTTTGCCC  494106
>>>>>
>>>>>
>>>>> Database: NCBI Genomic Reference Sequences
>>>>>  Posted date:  Nov 2, 2008  5:47 PM
>>>>> Number of letters in database: 3,315,177
>>>>> Number of sequences in database:  2
>>>>>
>>>>> Lambda     K      H
>>>>> 0.634    0.408    0.912
>>>>> Gapped
>>>>> Lambda     K      H
>>>>> 0.634    0.408    0.912
>>>>> Matrix: blastn matrix:2 -3
>>>>> Gap Penalties: Existence: 5, Extension: 2
>>>>> Number of Sequences: 2
>>>>> Number of Hits to DB: 0
>>>>> Number of extensions: 0
>>>>> Number of successful extensions: 0
>>>>> Number of sequences better than 1e-05: 0
>>>>> Number of HSP's better than 1e-05 without gapping: 0
>>>>> Number of HSP's gapped: 0
>>>>> Number of HSP's successfully gapped: 0
>>>>> Length of query: 25
>>>>> Length of database: 3315177
>>>>> Length adjustment: 18
>>>>> Effective length of query: 7
>>>>> Effective length of database: 3315141
>>>>> Effective search space: 23205987
>>>>> Effective search space used: 23205987
>>>>> A: 0
>>>>> X1: 22 (20.1 bits)
>>>>> X2: 33 (29.8 bits)
>>>>> X3: 110 (99.2 bits)
>>>>> S1: 33 (31.0 bits)
>>>>> S2: 45 (41.9 bits)
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>> Jason Stajich
>>>> jason at bioperl.org
>>>>
>>>>
>>>>
>>>
>
> Christopher Fields
> Postdoctoral Researcher
> Lab of Dr. Marie-Claude Hofmann
> College of Veterinary Medicine
> University of Illinois Urbana-Champaign
>
>
>
>




More information about the Bioperl-l mailing list