[Bioperl-l] Bioperl-l Digest, Vol 117, Issue 13

Fri Feb 1 14:25:34 UTC 2013

Hi Jason,

Thanks for the detailed feedback.  The real reason I had to write my own parser is that even with close, repeated support from NCBI we couldn't get XML output with short_web_blast.pl because the parameter that turns on XML output was not functioning (they've probably fixed it by now), and I had to crank out a parser asap to support a job talk.

I don't think the upstream and downstream feature reports are particulalry useful, becase in mammals they tend to be so far away that they are not likely to be biologically relevant.  But the internal motif reports are useful, maybe especially if you are blasting short reads, like I was.  A 16-mer preserved domain hit is really good if you're blasting 18-mer Illumina short reads, like I was.

As far as my involvement goes, I got diagnosed with cancer on Wednesday, so I'll be taking a step back until next week's surgery and taking a lot a deep breaths.  On the other hand, this just makes me more motivated: I've been thinking alot about time, and timely contributions, the last two days.

Cheers,
Dan

________________________________
 From: Jason Stajich <jason.stajich at gmail.com>
To: Dan kilburn <dr_kilburn59 at yahoo.com> 
Cc: "bioperl-l at lists.open-bio.org" <bioperl-l at lists.open-bio.org> 
Sent: Friday, February 1, 2013 1:58 AM
Subject: Re: [Bioperl-l] Bioperl-l Digest, Vol 117, Issue 13

Dan - 

I think the answer is yes if others are doing it - I am not in a position to be much of a main coder.

I don't know which format you speak of here or if you had to write something for the text blast changes or something else.  Specific bug reports on formats that aren't working is always helpful.  The XML format has been pretty stable so I would suggest that if you are simply parsing reports not looking at them.

Chris posted instructions on how to contribute and the move to github simplifies this.  That you had to write a whole new parser seems probably a bit severe - I hope that in the future people can speak to the problems sooner. If I hit a wall with something I can't do I usually write the code to fix it and contribute it back but I don't play follow-the-format-changes with the tools anymore, but hopefully others like yourself can make the contributions.

If you speak to the response I made to the question below, I don't think anyone will be trying and support the NCBI's additional markups that refer to the upstream and downstream features as they are laid out in the text files without some serious effort. Perhaps in the future that information will be reported in the XML format and thus be more parseable.
best wishes,
Jason

On Jan 30, 2013, at 1:40 PM, Dan kilburn <dr_kilburn59 at yahoo.com> wrote:

Hi Jason,
>
>Are there any plans to keep SearchIO up to date with ncbi blast? I know they change formats ridiculously often, but I had to write my own parser to get sequence identity, which I would rather not have done. I realize that this job would be a big load on anyone who takes it, but it's so fundamental. Maybe I can help.
>
>--Dan
>Sent from my iPhone
>
>On Jan 30, 2013, at 12:00 PM, bioperl-l-request at lists.open-bio.org wrote:
>
>
>Send Bioperl-l mailing list submissions to
>>  bioperl-l at lists.open-bio.org
>>
>>To subscribe or unsubscribe via the World Wide Web, visit
>>  http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>or, via email, send a message with subject or body 'help' to
>>  bioperl-l-request at lists.open-bio.org
>>
>>You can reach the person managing the list at
>>  bioperl-l-owner at lists.open-bio.org
>>
>>When replying, please edit your Subject line so it is more specific
>>than "Re: Contents of Bioperl-l digest..."
>>
>>
>>Today's Topics:
>>
>> 1. Re:  Parsing Blast-Report extracting "Features flanking    .."
>>    (Jason Stajich)
>>
>>
>>----------------------------------------------------------------------
>>
>>Message: 1
>>Date: Tue, 29 Jan 2013 11:00:16 -0800
>>From: Jason Stajich <jason.stajich at gmail.com>
>>Subject: Re: [Bioperl-l] Parsing Blast-Report extracting "Features
>>  flanking    .."
>>To: buschj at hhu.de
>>Cc: bioperl-l at lists.open-bio.org
>>Message-ID: <6E83E3F3-C304-4DC4-9A11-FE1CA90F207D at gmail.com>
>>Content-Type: text/plain;    charset=us-ascii
>>
>>We don't parse the NCBI feature info from the BLAST reports per your query. To look up a specific feature you can use Bio::DB::GenBank to query for sequence from a specific feature by accession number - see the HOWTOs for that.
>>
>>However, most people use tools that generate SAM/BAM files with short reads - then you can use a tool like bedtools to find overlaps of reads with the locations of features.
>>
>>basically:
>>- download the genome and GFF for arabidopsis
>>- align your sRNA to the genome with a short read aligner - bowtie, bwa, others
>>- convert your sam to bam file with SAMtools or picard
>>- compare the location of features with the reads to get expression summaries or individuals reads with BEDTools
>>
>>
>>On Jan 25, 2013, at 2:20 AM, jobu <buschj at hhu.de> wrote:
>>
>>
>>Am 22.01.2013 19:03, schrieb Mgavi Brathwaite:
>>>
>>>What upstream and downstream elements are you interested in?
>>>>
>>>
>>>I've got a huge pile of short RNA reads.
>>>Part of the question now is whether those RNA fragments originate from
>>>siRNA events,
>>>or may represent miRNAs / parts of pre-miRNAs.
>>>
>>>So I did an online  blast search against database nt.
>>>The resulting report quite often just gives subject information like this:
>>>
>>>-----
>>>
>>>gb|CP002686.1| Arabidopsis thaliana chromosome 3, complete sequence
>>>>Length=23459830
>>>-----
>>>
>>>Now I would like to get the hit's neighbouring regions  for further
>>>analysis.
>>>Preferably I would like to do that  in an automized way, but the only
>>>possible action with this kind of subject gi | description would be to
>>>fetch the entire chromosomal  sequence I guess ?
>>>
>>>However,
>>>right below the line above, the report states more precisely:
>>>
>>>------
>>>Features flanking this part of subject sequence:
>>>8872 bp at 5' side: cytochrome P450 90B1
>>>402 bp at 3' side: U1 small nuclear ribonucleoprotein-70K
>>>------
>>>
>>>Still I would like to have the possibility to automatically fetch the
>>>subject's sequence(s),
>>>as of now I think  parsing the report with SearchIO won't let me aquire
>>>that information, because SearchIO does not recognize report sections
>>>like those.
>>>
>>>I hope I did not miss any of SearchIOs capabilities, but I could not
>>>find any method covering my wish?!
>>>
>>>Right now maybe the only way to get the information I want is to
>>>construct my own parser and write it out into a separate file, which in
>>>turn again  I could read into a hash before processing the Blast-Report
>>>with SearchIO to combine both data for further automized work.
>>>
>>>I am aware though that even successfully getting the flanking features
>>>would leave me with the more or less wide  intergenic gap my hsp is
>>>located in.
>>>
>>>However I'm in need of a way to get the flanking features including
>>>their annotation and the region spanning between them.
>>>But I hope I do not have to get complete sequences to accomplish that,
>>>as this would be kind of an overkill.
>>>
>>>with kind regards
>>>Jochen
>>>
>>>
>>>
>>>_______________________________________________
>>>Bioperl-l mailing list
>>>Bioperl-l at lists.open-bio.org
>>>http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>
>>Jason Stajich
>>jason.stajich at gmail.com
>>jason at bioperl.org
>>
>>
>>
>>
>>------------------------------
>>
>>_______________________________________________
>>Bioperl-l mailing list
>>Bioperl-l at lists.open-bio.org
>>http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>>End of Bioperl-l Digest, Vol 117, Issue 13
>>******************************************
>>
>_______________________________________________
>Bioperl-l mailing list
>Bioperl-l at lists.open-bio.org
>http://lists.open-bio.org/mailman/listinfo/bioperl-l
>

Jason Stajich
jason.stajich at gmail.com
jason at bioperl.org