[Bioperl-l] parsing GenBank file

shalabh sharma shalabh.sharma7 at gmail.com
Wed May 5 15:38:11 UTC 2010


Hi Chris,
              I downloaded this file from RDP, it contain all bacterial 16s.

Thanks
Shalabh


On Wed, May 5, 2010 at 11:32 AM, Chris Fields <cjfields at illinois.edu> wrote:

> Shalabh,
>
> What is the source of this file?  It's not from GenBank; if I look up the
> parent sequence using Bio::DB::GenBank it works fine:
>
> use Modern::Perl;
> use Bio::DB::GenBank;
>
> my $id = 'AP010656';
>
> my $gb = Bio::DB::GenBank->new();
>
> my $seq = $gb->get_Seq_by_acc($id);
>
> say join(',',$seq->species->classification);
>
> chris
>
> On May 5, 2010, at 9:46 AM, shalabh sharma wrote:
>
> > Hi Torsten,
> >                Thanks for pointing that out. But this is just a warning,
> > it will not break the script. i found the the point where script is
> > breaking.
> > Its breaking and giving this message:
> > Can't call method "classification" on an undefined value at parseGB.pl
> line
> > 9, <GEN0> line 10067733.
> >
> > So the script is breaking when its coming to this record:
> >
> > LOCUS       S001198291              1521 bp    rRNA    linear   BCT
> > 02-Feb-2009
> > DEFINITION  Candidatus Azobacteroides pseudotrichonymphae genomovar.
> CFP2.
> > ACCESSION   AP010656 REGION: 61786..63306
> > PROJECT     GenomeProject:29025
> > SOURCE      Candidatus Azobacteroides pseudotrichonymphae genomovar. CFP2
> > ORGANISM  Candidatus Azobacteroides pseudotrichonymphae genomovar. CFP2
> >           Root; Bacteria; "Bacteroidetes"; "Bacteroidia";
> > "Bacteroidales";
> >           "Porphyromonadaceae"; unclassified_"Porphyromonadaceae".
> > REFERENCE   1 (bases 1 to 1521)
> > AUTHORS   Toyoda A., Hongoh Y., Toh H., Hattori M., Ohkuma M., Sakaki Y.;
> > TITLE     ;
> > JOURNAL   Submitted (21-MAR-2008) to the EMBL/GenBank/DDBJ databases.
> >           Contact:Atsushi Toyoda National Institute of Genetics,
> > Comparative
> >           Genomics Laboratory; Yata 1111, Mishima, Shizuoka 411-8540,
> > Japan
> > REFERENCE   2
> > AUTHORS   Hongoh Y., Sharma V.K., Prakash T., Noda S., Toh H., Taylor
> > T.D.,
> >           Kudo T., Sakaki Y., Toyoda A., Hattori M., Ohkuma M.;
> >
> > It is unable to parse this record, but i don't understand why it is doing
> > so? The only reason i can think of is the organism's name which is very
> long
> > as compared to others.
> >
> > Thanks
> > Shalabh
> >
> >
> >
> > On Wed, May 5, 2010 at 3:48 AM, Torsten Seemann <
> > torsten.seemann at infotech.monash.edu.au> wrote:
> >
> >>>    i have a huge GenBank file ( downloaded from RDP containing all
> >>> bacterial 16s). I just want to parse RDP id (in LOCUS) and organism's
> >> linage (in ORGANISM).
> >>> I am getting the output like:
> >>> S000107505 uncultured Acidobacteria bacterium Geothrix Holophagaceae
> >>> Holophagales Holophagae "Acidobacteria" Bacteria Root
> >>> This is the exact output i want, but i am missing lot of records (they
> >> are
> >>> there in the genbank file but not in my output).
> >>> I also got a warning during parsing:
> >>> --------------------- WARNING ---------------------
> >>> MSG: Unbalanced quote in:
> >>> /db_xref="taxon:35783" /germline"
> >>> /mol_type="genomic DNA"
> >>> /organism="Enterococcus sp."
> >>> /strain="LMG12316"No further qualifiers will be added for this feature
> >>> ---------------------------------------------------
> >>> So i was just wondering that is this warning message causing that
> problem
> >> or
> >>> i am doing something wrong?
> >>
> >> "Unbalanced quote" means there is not an even number (multiple of 2)
> >> double-quote (") symbols around the tag's value. I can see that the
> >> first line (below) looks problematic:
> >>
> >> YOU HAVE:
> >>
> >> /db_xref="taxon:35783" /germline"
> >>
> >> SHOULD BE:
> >>
> >> /db_xref="taxon:35783"
> >> /germline
> >>
> >> I suspect there is a problem either with RDP's genbank producer, or
> >> Bioperl is having problem with  the "germline" qualifier which is a
> >> 'null valued' qualifier like /pseudo - it takes no ="value" string. (I
> >> think in Bioperl this is handled by setting the value to "_no_value"
> >> ?)
> >>
> >> http://www.ncbi.nlm.nih.gov/collab/FT/
> >>
> >> Qualifier       /germline
> >> Definition      the sequence presented in the entry has not undergone
> >> somatic
> >>              rearrangement as part of an adaptive immune response; it is
> >> the
> >>              unrearranged sequence that was inherited from the parental
> >>              germline
> >> Value format    none
> >> Example         /germline
> >> Comment         /germline should not be used to indicate that the source
> of
> >>              the sequence is a gamete or germ cell;
> >>              /germline and /rearranged cannot be used in the same source
> >>              feature;
> >>              /germline and /rearranged should only be used for molecules
> >> that
> >>              can undergo somatic rearrangements as part of an
> >> adaptive immune
> >>              response; these are the T-cell receptor (TCR) and
> >> immunoglobulin
> >>              loci in the jawed vertebrates, and the unrelated variable
> >>              lymphocyte receptor (VLR) locus in the jawless fish
> >> (lampreys
> >>              and hagfish);
> >>              /germline and /rearranged should not be used outside of the
> >>              Craniata (taxid=89593)
> >>
> >>
> >> --Torsten Seemann
> >> --Victorian Bioinformatics Consortium, Dept. Microbiology, Monash
> >> University, AUSTRALIA
> >>
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
>



More information about the Bioperl-l mailing list