[Bioperl-l] parsing blast report with long description
Frank Schwach
fs5 at sanger.ac.uk
Mon May 17 08:38:18 UTC 2010
I think you should try to avoid those long IDs anyway, especially
because you have spaces in there too and this may cause problems further
down the line as many programs will use a pattern like />(\S+)/ as the
identifier. I would build a small database for your files and use unique
database identifiers in your FASTA files. That will make it easier in
the future to collect, for example, all sequences from a certain region
etc. If you want to avoid that you could have two file: one FASTA files
using numbers as IDs and a file where you map those numbers to sample
descriptions, i.e. a simple flat-file database.
Frank
On Thu, 2010-05-13 at 11:07 -0400, shalabh sharma wrote:
> Hi All,
> I need some help in parsing blast output.
> I have a inhouse database that contain sequences with really long
> description.
>
> >SMPL_IDI_1105131728043
> /GS026/SMPL_READ_1095454077952/SMPL_READ_1095454041540/TI_1000008216887/Open
> Ocean/Galapagos Islands/134 miles NE of Galapagos/Ecuador/0.1 -
> 0.8/1d15'51N"/90d17'42W"/2 m/2386 m/0.22 ug-kg/32.6 psu/27.8 C/2-1-04
> IHWWLFEVGQKGFLNFSWCFGQVFKRLEHVCIRPKYVPYSSNLYRDSVKTLETPMWRRNSMRVFLKGSLFAVSLIASGAV
>
> So my blast report looks like this:
>
> .....
> .....
> >SMPL_IDI_1105131728043
> /GS026/SMPL_READ_1095454077952/SMPL_READ_1095454041540/TI_100000821
> 6887/Open Ocean/Galapagos Islands/134 miles NE of
> Galapagos/Ecuador/0.1 - 0.8/1d15'51N"/90d17'42W"/2
> m/2386 m/0.22 ug-kg/32.6 psu/27.8 C/2-1-04
> Length = 213
>
> Score = 124 bits (310), Expect = 5e-27, Method: Compositional matrix
> adjust.
> Identities = 62/155 (40%), Positives = 96/155 (61%), Gaps = 1/155 (0%)
> .....
> .....
>
> (note that the tag "TI_1000008216887" is splitting in two lines).
>
> I am using SeqIO to parse this report. What i am doing is parsing the
> description field again to get all the tags. like
> ....
> ....
> my $desc = $hit->description;
> my @f = split('/',$desc);
> for(my $i = 0;$i < scalar
> @f;$i++){ print OUT "$f[$i]\t";}
> .....
> .....
>
>
> *I am getting the perfect parsed report but the field with TI_1000008216887
> has a space **TI_100000821 6887 *.
>
> I would really appreciate if anyone can help me out.
>
> Thanks
> Shalabh Sharma
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
--
The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.
More information about the Bioperl-l
mailing list