[Bioperl-l] parsing GenBank file

Chris Fields cjfields at illinois.edu
Wed May 5 15:32:41 UTC 2010


Shalabh,

What is the source of this file?  It's not from GenBank; if I look up the parent sequence using Bio::DB::GenBank it works fine:

use Modern::Perl;
use Bio::DB::GenBank;

my $id = 'AP010656';

my $gb = Bio::DB::GenBank->new();

my $seq = $gb->get_Seq_by_acc($id);

say join(',',$seq->species->classification);

chris

On May 5, 2010, at 9:46 AM, shalabh sharma wrote:

> Hi Torsten,
>                Thanks for pointing that out. But this is just a warning,
> it will not break the script. i found the the point where script is
> breaking.
> Its breaking and giving this message:
> Can't call method "classification" on an undefined value at parseGB.pl line
> 9, <GEN0> line 10067733.
> 
> So the script is breaking when its coming to this record:
> 
> LOCUS       S001198291              1521 bp    rRNA    linear   BCT
> 02-Feb-2009
> DEFINITION  Candidatus Azobacteroides pseudotrichonymphae genomovar. CFP2.
> ACCESSION   AP010656 REGION: 61786..63306
> PROJECT     GenomeProject:29025
> SOURCE      Candidatus Azobacteroides pseudotrichonymphae genomovar. CFP2
> ORGANISM  Candidatus Azobacteroides pseudotrichonymphae genomovar. CFP2
>           Root; Bacteria; "Bacteroidetes"; "Bacteroidia";
> "Bacteroidales";
>           "Porphyromonadaceae"; unclassified_"Porphyromonadaceae".
> REFERENCE   1 (bases 1 to 1521)
> AUTHORS   Toyoda A., Hongoh Y., Toh H., Hattori M., Ohkuma M., Sakaki Y.;
> TITLE     ;
> JOURNAL   Submitted (21-MAR-2008) to the EMBL/GenBank/DDBJ databases.
>           Contact:Atsushi Toyoda National Institute of Genetics,
> Comparative
>           Genomics Laboratory; Yata 1111, Mishima, Shizuoka 411-8540,
> Japan
> REFERENCE   2
> AUTHORS   Hongoh Y., Sharma V.K., Prakash T., Noda S., Toh H., Taylor
> T.D.,
>           Kudo T., Sakaki Y., Toyoda A., Hattori M., Ohkuma M.;
> 
> It is unable to parse this record, but i don't understand why it is doing
> so? The only reason i can think of is the organism's name which is very long
> as compared to others.
> 
> Thanks
> Shalabh
> 
> 
> 
> On Wed, May 5, 2010 at 3:48 AM, Torsten Seemann <
> torsten.seemann at infotech.monash.edu.au> wrote:
> 
>>>    i have a huge GenBank file ( downloaded from RDP containing all
>>> bacterial 16s). I just want to parse RDP id (in LOCUS) and organism's
>> linage (in ORGANISM).
>>> I am getting the output like:
>>> S000107505 uncultured Acidobacteria bacterium Geothrix Holophagaceae
>>> Holophagales Holophagae "Acidobacteria" Bacteria Root
>>> This is the exact output i want, but i am missing lot of records (they
>> are
>>> there in the genbank file but not in my output).
>>> I also got a warning during parsing:
>>> --------------------- WARNING ---------------------
>>> MSG: Unbalanced quote in:
>>> /db_xref="taxon:35783" /germline"
>>> /mol_type="genomic DNA"
>>> /organism="Enterococcus sp."
>>> /strain="LMG12316"No further qualifiers will be added for this feature
>>> ---------------------------------------------------
>>> So i was just wondering that is this warning message causing that problem
>> or
>>> i am doing something wrong?
>> 
>> "Unbalanced quote" means there is not an even number (multiple of 2)
>> double-quote (") symbols around the tag's value. I can see that the
>> first line (below) looks problematic:
>> 
>> YOU HAVE:
>> 
>> /db_xref="taxon:35783" /germline"
>> 
>> SHOULD BE:
>> 
>> /db_xref="taxon:35783"
>> /germline
>> 
>> I suspect there is a problem either with RDP's genbank producer, or
>> Bioperl is having problem with  the "germline" qualifier which is a
>> 'null valued' qualifier like /pseudo - it takes no ="value" string. (I
>> think in Bioperl this is handled by setting the value to "_no_value"
>> ?)
>> 
>> http://www.ncbi.nlm.nih.gov/collab/FT/
>> 
>> Qualifier       /germline
>> Definition      the sequence presented in the entry has not undergone
>> somatic
>>              rearrangement as part of an adaptive immune response; it is
>> the
>>              unrearranged sequence that was inherited from the parental
>>              germline
>> Value format    none
>> Example         /germline
>> Comment         /germline should not be used to indicate that the source of
>>              the sequence is a gamete or germ cell;
>>              /germline and /rearranged cannot be used in the same source
>>              feature;
>>              /germline and /rearranged should only be used for molecules
>> that
>>              can undergo somatic rearrangements as part of an
>> adaptive immune
>>              response; these are the T-cell receptor (TCR) and
>> immunoglobulin
>>              loci in the jawed vertebrates, and the unrelated variable
>>              lymphocyte receptor (VLR) locus in the jawless fish
>> (lampreys
>>              and hagfish);
>>              /germline and /rearranged should not be used outside of the
>>              Craniata (taxid=89593)
>> 
>> 
>> --Torsten Seemann
>> --Victorian Bioinformatics Consortium, Dept. Microbiology, Monash
>> University, AUSTRALIA
>> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l





More information about the Bioperl-l mailing list