[Bioperl-l] IMGT parsing

Alex Brown abrown at tech.mrc.ac.uk
Mon Feb 21 12:08:00 EST 2005


Hi.

Thanks for the quick response to my original e-mail. i have run some 
tests on my modification, using a small file of EMBL-formatted 
sequences, and on a small file of IMGT-formatted sequences, both with 
success.

Bearing in mind that the only alteration to the original code I have 
made is a slight modification of the regular expression, there should 
be no reason why the method should not read large files of EMBL- or 
IMGT- formatted sequences.

I would make the following alteration to the regular expression:

($name, $mol, $div) = ($line =~ /^ID\s+(\S+).*;\s+(\S+);\s+(\S+);/);

This should be 'better' Perl than my previous attempt.

Cheers,

Alex Brown

On Friday, February 18, 2005, at 12:37 PM, Brian Osborne wrote:

> Alex,
>
> Have you tested this change by reading through some large file of
> EMBL-formatted sequences? The more you've tested this the happier I'd 
> be to
> change embl.pm for you.
>
> Brian O.
>
> -----Original Message-----
> From: bioperl-l-bounces at portal.open-bio.org
> [mailto:bioperl-l-bounces at portal.open-bio.org]On Behalf Of Alex Brown
> Sent: Friday, February 18, 2005 6:16 AM
> To: bioperl-l at portal.open-bio.org
> Subject: [Bioperl-l] IMGT parsing
>
>
> Hi.
>
> I had a small problem using BioSeqIO (in BioPerl 1.4) to parse the IMGT
> flat file database - although the IMGT uses an EMBL-like format,
> BioSeqIO was unable to extract display_id(), which is a bit of a
> nuisance when converting between formats. This is due to a difference
> between the ID line of the EMBL and the IMGT formats:
>
> EMBL -
> ID   TRBG361    standard; mRNA; PLN; 1859 BP.
>
> IMGT -
> ID   MMTCRGBV1 IMGT/LIGM annotation : by annotators; RNA; ROD; 290 BP.
>
> The following modification to embl.pm seems to allowing correct parsing
> of both formats :
> change the lines:
>
>     $line =~ /^ID\s+\S+/ || $self->throw("EMBL stream with no ID. Not
> embl in my book");
>       $line =~ /^ID\s+(\S+)\s+\S+\;\s+([^;]+)\;\s+(\S+)\;/;
>       $name = $1;
>       $mol = $2;
>       $div = $3;
>       if(! $name) {
>           $name = "unknown id";
>       }
>
> to :
>
>     $line =~ /^ID\s+\S+/ || $self->throw("EMBL stream with no ID. Not
> embl in my book");
>       # this is the new line to replace the above, allowing IMGT 
> records
> to be read as well
>       ($name, $mol, $div) = ($line =~
> /^ID\s*(\S*).*;\s*(\S*);\s*(\S*);/);
>       if(! $name) {
>           $name = "unknown id";
>       }
>
> Hope this is useful.
>
> Alex Brown.
>
> PS. BACK-UP embl.pm before changing.
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioperl-l
>



More information about the Bioperl-l mailing list