[Bioperl-l] IMGT parsing
Elia Stupka
elia at tigem.it
Tue Feb 22 03:55:01 EST 2005
> Bearing in mind that the only alteration to the original code I have
> made is a slight modification of the regular expression, there should
> be no reason why the method should not read large files of EMBL- or
> IMGT- formatted sequences.
I think Brian was not referring to performance issues, I think what he
meant is that only by parsing a large part of EMBL you can rest assured
that your modification will not cause problems on parsing other records
in EMBL, which, by definition, contain all sorts of oddities.
When I initially wrote EMBL.pm I would take it through full runs of
EMBL to find the odd guy (after thousands of ok records) that would
make my new regexp fall over ;)
Cheers,
Elia
>
> I would make the following alteration to the regular expression:
>
> ($name, $mol, $div) = ($line =~ /^ID\s+(\S+).*;\s+(\S+);\s+(\S+);/);
>
> This should be 'better' Perl than my previous attempt.
>
> Cheers,
>
> Alex Brown
>
> On Friday, February 18, 2005, at 12:37 PM, Brian Osborne wrote:
>
>> Alex,
>>
>> Have you tested this change by reading through some large file of
>> EMBL-formatted sequences? The more you've tested this the happier I'd
>> be to
>> change embl.pm for you.
>>
>> Brian O.
>>
>> -----Original Message-----
>> From: bioperl-l-bounces at portal.open-bio.org
>> [mailto:bioperl-l-bounces at portal.open-bio.org]On Behalf Of Alex Brown
>> Sent: Friday, February 18, 2005 6:16 AM
>> To: bioperl-l at portal.open-bio.org
>> Subject: [Bioperl-l] IMGT parsing
>>
>>
>> Hi.
>>
>> I had a small problem using BioSeqIO (in BioPerl 1.4) to parse the
>> IMGT
>> flat file database - although the IMGT uses an EMBL-like format,
>> BioSeqIO was unable to extract display_id(), which is a bit of a
>> nuisance when converting between formats. This is due to a difference
>> between the ID line of the EMBL and the IMGT formats:
>>
>> EMBL -
>> ID TRBG361 standard; mRNA; PLN; 1859 BP.
>>
>> IMGT -
>> ID MMTCRGBV1 IMGT/LIGM annotation : by annotators; RNA; ROD; 290 BP.
>>
>> The following modification to embl.pm seems to allowing correct
>> parsing
>> of both formats :
>> change the lines:
>>
>> $line =~ /^ID\s+\S+/ || $self->throw("EMBL stream with no ID. Not
>> embl in my book");
>> $line =~ /^ID\s+(\S+)\s+\S+\;\s+([^;]+)\;\s+(\S+)\;/;
>> $name = $1;
>> $mol = $2;
>> $div = $3;
>> if(! $name) {
>> $name = "unknown id";
>> }
>>
>> to :
>>
>> $line =~ /^ID\s+\S+/ || $self->throw("EMBL stream with no ID. Not
>> embl in my book");
>> # this is the new line to replace the above, allowing IMGT
>> records
>> to be read as well
>> ($name, $mol, $div) = ($line =~
>> /^ID\s*(\S*).*;\s*(\S*);\s*(\S*);/);
>> if(! $name) {
>> $name = "unknown id";
>> }
>>
>> Hope this is useful.
>>
>> Alex Brown.
>>
>> PS. BACK-UP embl.pm before changing.
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at portal.open-bio.org
>> http://portal.open-bio.org/mailman/listinfo/bioperl-l
>>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioperl-l
>
---
Telethon Institute of Genetics and Medicine
Via Pietro Castellino, 111
80131 Napoli
Tel. +39 081 6132 335
Fax. +39 081 6132 351
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: text/enriched
Size: 3244 bytes
Desc: not available
Url : http://portal.open-bio.org/pipermail/bioperl-l/attachments/20050222/68907a15/attachment.bin
More information about the Bioperl-l
mailing list