[Bioperl-l] Can't parse bacterial strain from EMBL OS or RC lines

Jason Stajich jason.stajich at duke.edu
Tue May 2 18:36:08 UTC 2006


This is really a limitation of the EMBL/GenBank format

See this thread:
http://lists.open-bio.org/pipermail/bioperl-l/2006-March/021068.html

or on GMANE
http://comments.gmane.org/gmane.comp.lang.perl.bio.general/10557

I don't know if any of this has been resolved really so hopefully  
James will speak up if he's implemented anything.

-jason
On May 2, 2006, at 7:41 AM, Mark A. Miller wrote:

> Hello all.
>
> I have a recently donwloaded UniProt/TrEMBL flat file.  I am trying to
> make FASTA subset files for some bacterial strains.  I haven't been
> able to parse out the strain information from the OS or RC lines.
> These lines typically look like:
>
> OS Somegenus somespecies subsp. somesubspecies strain ABC123.
> RC STRAIN=ABC123.
>
> I'm not especiialy good with Perl, and I'm definitely weak when it
> comes to OOP.
>
> I have included some code I pasted together from various pages on the
> bioperl wiki.  In addition to the wiki, I have been making use of
> www.pasteur.fr/recherche/unites/sis/formation/bioperl/ch02s02.html
>
> The code I have so far reports the species but not the subspecies or
> variant.  I have also tried to walk through all of the feature,
> annotation and reference objects but I still can't seem to parse out
> the information I need.  (For brevity, the example I'm including below
> only lists the code I used for the annotation objects.)  Also, this
> code only prints the information...  I know that I'll have to write a
> FASTA sequence object seperately.
>
> Any suggestions?
>
> Thanks,
> Mark
>
> ---   ---   ---
>
>
> #!/usr/bin/perl
>
>
>
> use Bio::SeqIO;
>
>
>
> my $usage = "getaccs.pl file format\n";
>
> my $file = shift or die $usage;
>
> my $format = shift or die $usage;
>
>
>
> my $inseq = Bio::SeqIO->new(-file   => "<$file",
>
>    -format => $format );
>
>
>
> while (my $seq = $inseq->next_seq) {
>
>
>
>   my $species_object = $seq->species;
>
>   my $species_string = $species_object->species;
>
>   my $variant_string = $species_object->variant;
>
>   my $common_string = $species_object->common_name;
>
>   my $sub_string = $species_object->sub_species;
>
>   my $binomial = $species_object->binomial('FULL');
>
>
>
>   print "display   ",$seq->display_id,"\n";
>
>   print "accession ",$seq->accession_number,"\n";
>
>   print "desc      ",$seq->desc,"\n";
>
>
>
>   print "species   ",$species_string,"\n";
>
>   print "variant   ",$variant_string,"\n";
>
>   print "common    ",$common_string,"\n";
>
>   print "sub       ",$sub_string,"\n";
>
>   print "binomial  ",$binomial,"\n";
>
>
>
>   print $seq->seq,"\n";
>
>
>
>   my $anno_collection = $seq->annotation;
>
>   for my $key ( $anno_collection->get_all_annotation_keys ) {
>
>     my @annotations = $anno_collection->get_Annotations($key);
>
>     for my $value ( @annotations ) {
>
>       print "tagname : ", $value->tagname, "\n";
>
>       # $value is an Bio::Annotation, and has an "as_text" method
>
>       print "  annotation value: ", $value->as_text, "\n";
>
>
>
>        if ($value->tagname eq "reference") {
>
>         my $hash_ref = $value->hash_tree;
>
>         for my $key (keys %{$hash_ref}) {
>
>           print $key,": ",$hash_ref->{$key},"\n";
>
>         }
>
>       }
>
>     }
>
>   }
>
>   print "\n";
>
> }
>
> exit;
>
>
>
>
>
> ---   ---   ---   ---   ---   ---   ---   ---
>
> Mark A. Miller
>
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l

--
Jason Stajich
Duke University
http://www.duke.edu/~jes12





More information about the Bioperl-l mailing list