[Bioperl-l] retrieval of PRELIMINARY uniprot sequences using Bio::Registry fails
Brian Osborne
osborne1 at optonline.net
Wed Sep 6 16:41:50 UTC 2006
Chris,
Yes, I saw this but was waiting for Daniel's sample.
division() is not a great way to set this value since it's meant for
taxonomic "divisions" (e.g. "PRI" in Genbank). On the other hand what else
is there? authority() doesn't seem right either. What about:
$seq->seq_version($DATA_CLASS)
None of them are ideal but this is the closest, in my opinion. Then
"Swiss-prot" and "TrEMBL" could be set by namespace() or authority().
Brian O.
On 9/6/06 10:59 AM, "Chris Fields" <cjfields at uiuc.edu> wrote:
> Brian,
>
> I have found the issue with Bio::SeqIO::swiss; apparently UniProt has
> switched to using the following ID line format:
>
> ID ENTRY_NAME DATA_CLASS; MOLECULE_TYPE; SEQUENCE_LENGTH.
>
> For SwissProt ID's
>
> ID CYC_BOVIN STANDARD; PRT; 104 AA.
> ID GIA2_GIALA STANDARD; PRT; 296 AA.
>
> For TrEMBL (preliminary protein):
>
> ID Q5XPV6 PRELIMINARY; PRT; 231 AA.
>
> SeqIO 'swiss' sequence output currently uses the first (SwissProt) version;
> it's hardcoded in a sprintf() statement. I guess TrEMBL didn't have a
> designation before, so this complicates things a little.
>
> There are a few other (small) formatting differences I have also found which
> we could update fairly easily.
>
> In the section of the release notes describing differences between
> SwissProt/EMBL format, this is listed:
>
> * EMBL entry ID lines have an additional three-letter taxonomic division
> 'token' inserted between the data class and the molecule type;
>
> I suppose we could use division() to store 'STANDARD' and 'PRELIMINARY' (or
> 'Swiss-Prot' and 'TrEMBL' if that's nicer).
>
> Chris
>
>> -----Original Message-----
>> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
>> bounces at lists.open-bio.org] On Behalf Of Daniel Lang
>> Sent: Wednesday, September 06, 2006 4:12 AM
>> To: bioperl-l at lists.open-bio.org
>> Subject: Re: [Bioperl-l] retrieval of PRELIMINARY uniprot sequences using
>> Bio::Registry fails
>>
>> Hi Brian,
>>
>> I'm iterating now over all uniprot_trembl sequences and record for which
>> retrieval fails - Lets see if STANDARDs also fail...
>>
>> How is the second field of the swissprot ID line handled anyway? Because
>> PRELIMINARYs end up as STANDARD when being parsed by Bio::SeqIO::swiss.
>>
>> On the other side I'm still confused why there's no error or warning
>> when the retrieval fails. Can you give me a hint which modules (besides
>> swiss.pm) to look at?
>>
>> Cheers,
>> Daniel
>>
>> Brian Osborne wrote:
>>> Daniel,
>>>
>>> Well, if you can isolate the bug please add it to bugzilla.
>>>
>>> Brian O.
>>>
>>>
>>> On 9/5/06 5:57 AM, "Daniel Lang" <daniel.lang at biologie.uni-freiburg.de>
>>> wrote:
>>>
>>>> Hi Brian,
>>>>
>>>> sorry for the belated response!
>>>> I've compiled you a set of 100 PRELIMINARY entries from the latest
>>>> uniprot_trembl release. I've tried to reproduce the bug using only
>> these
>>>> as input to build an index, but (sadly) all of them can be retrieved
>>>> using the latest checkout:-(
>>>> Maybe its not connected to these entries after all, but the size or
>> some
>>>> other feature of the uniprot distribution?
>>>> I now could make it work using the 1.5.1 release.
>>>>
>>>> Originally, I've built the index using flat protocol, when I try bdb
>> and
>>>> bioperl-live even more problems occur:
>>>>
>>>> bp_bioflat_index.pl --dbname sw -i bdb -f swiss -l . -c
>> uniprot_sprot.dat
>>>>
>>>> ------------- EXCEPTION -------------
>>>> MSG: The lineage 'Eukaryota, Metazoa, Chordata, Craniata, Vertebrata,
>>>> Euteleostomi, Amphibia, Batrachia, Anura, Mesobatrachia, Pipoidea,
>>>> Pipidae, Xenopodinae, Xenopus, Silurana, Xenopus, tropicalis' had two
>>>> non-consecutive nodes with the same name. Can't cope!
>>>> STACK Bio::DB::Taxonomy::list::add_lineage
>>>> /home/lang/bioperl/bioperl-live/Bio/DB/Taxonomy/list.pm:163
>>>> STACK Bio::DB::Taxonomy::list::new
>>>> /home/lang/bioperl/bioperl-live/Bio/DB/Taxonomy/list.pm:100
>>>> STACK Bio::DB::Taxonomy::new
>>>> /home/lang/bioperl/bioperl-live/Bio/DB/Taxonomy.pm:106
>>>> STACK Bio::Species::classification
>>>> /home/lang/bioperl/bioperl-live/Bio/Species.pm:171
>>>> STACK Bio::SeqIO::swiss::_read_swissprot_Species
>>>> /home/lang/bioperl/bioperl-live/Bio/SeqIO/swiss.pm:1049
>>>> STACK Bio::SeqIO::swiss::next_seq
>>>> /home/lang/bioperl/bioperl-live/Bio/SeqIO/swiss.pm:240
>>>> STACK Bio::DB::Flat::parse_one_record
>>>> /home/lang/bioperl/bioperl-live/Bio/DB/Flat.pm:333
>>>> STACK Bio::DB::Flat::BDB::_index_file
>>>> /home/lang/bioperl/bioperl-live/Bio/DB/Flat/BDB.pm:235
>>>> STACK Bio::DB::Flat::BDB::build_index
>>>> /home/lang/bioperl/bioperl-live/Bio/DB/Flat/BDB.pm:218
>>>> STACK toplevel
>>>> /share/apps/bioperl/bioperl-live/scripts_temp/bp_bioflat_index.pl:113
>>>>
>>>> But I think this is connected to the new changes to taxonomy handling
>> in
>>>> Bio::Taxon...
>>>> I'm unsure wether to submit this separately, but I could also provide
>> an
>>>> example of such a swissprot entry that causes this error.
>>>>
>>>> Thanks, again.
>>>>
>>>> Daniel
>>>>
>>>> Brian Osborne wrote:
>>>>> Daniel,
>>>>>
>>>>> Bug, presumably in SeqIO/swiss.pm. Can you send me a small file with
>> such a
>>>>> PRELIMINARY entry?
>>>>>
>>>>> Brian O.
>>>>>
>>>>>
>>>>> On 9/1/06 6:11 AM, "Daniel Lang" <daniel.lang at biologie.uni-
>> freiburg.de>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> when using Bio::Registry (bioperl-live) to fetch uniprot entries from
>>>>>> local indexed uniprot *.dats, I had to realize that several entries
>>>>>> could not be retrieved despite the fact that they are present in the
>>>>>> files! A closer look reveals that they are of status PRELIMINARY:
>>>>>>
>>>>>> uniprot_trembl.dat:ID Q16EZ1_AEDAE PRELIMINARY; PRT; 222 AA.
>>>>>>
>>>>>> I don't "grep" PRELIMINARY anywhere in my cvs checkout..
>>>>>> I also can't retrieve the sequences from the online database defined
>> as
>>>>>> follows:
>>>>>> [swissprot_ebi]
>>>>>> protocol=biofetch
>>>>>> location=http://www.ebi.ac.uk/cgi-bin/dbfetch
>>>>>> dbname=swall
>>>>>>
>>>>>> Is this a bug or a feature? If its a feature, how can I bypass it?
>>>>>>
>>>>>> Thanks in advance,
>>>>>> Daniel
>>>>>
>>>>
>>>>
>>
>>
>>
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
More information about the Bioperl-l
mailing list