[Bioperl-l] parsing GenBank file

Wed May 5 16:10:33 UTC 2010

Hi Chris,
            I will do that, so how i can solve my problem, do you have any
suggestion?
I am thinking of taking all the accessions from the file i have and use
Bio::DB::Genbank to get classification.

Thanks
shalabh

On Wed, May 5, 2010 at 12:01 PM, Chris Fields <cjfields at illinois.edu> wrote:

> Shalabh,
>
> There are several problems with this file that make it somewhat problematic
> and somewhat non-GenBank like.  It does parse (it has seq data) but doesn't
> catch the SOURCE/ORGANISM b/c of the somewhat non-canonical way of
> displaying the classification:
>
> SOURCE      Candidatus Azobacteroides pseudotrichonymphae genomovar. CFP2
> ORGANISM  Candidatus Azobacteroides pseudotrichonymphae genomovar. CFP2
>          Root; Bacteria; "Bacteroidetes"; "Bacteroidia"; "Bacteroidales";
>          "Porphyromonadaceae"; unclassified_"Porphyromonadaceae".
>
> It's different enough from the NCBI version (from here:
> http://www.ncbi.nlm.nih.gov/nuccore/212548595) that it's probably breaking
> the parser:
>
> SOURCE      Candidatus Azobacteroides pseudotrichonymphae genomovar. CFP2
> ORGANISM  Candidatus Azobacteroides pseudotrichonymphae genomovar. CFP2
>           Bacteria; Bacteroidetes; Bacteroidia; Bacteroidales; Candidatus
>          Azobacteroides.
>
> Please file this as a bug, we can take a look at it.  It's a bit
> non-standard so I can't promise it'll be fixed unless it's fairly easy to
> do.
>
> chris
>
> On May 5, 2010, at 10:38 AM, shalabh sharma wrote:
>
> > Hi Chris,
> >            I downloaded this file from RDP, it contain all bacterial 16s.
> >
> > Thanks
> > Shalabh
> >
> >
> > On Wed, May 5, 2010 at 11:32 AM, Chris Fields <cjfields at illinois.edu>
> wrote:
> >
> >> Shalabh,
> >>
> >> What is the source of this file?  It's not from GenBank; if I look up
> the
> >> parent sequence using Bio::DB::GenBank it works fine:
> >>
> >> use Modern::Perl;
> >> use Bio::DB::GenBank;
> >>
> >> my $id = 'AP010656';
> >>
> >> my $gb = Bio::DB::GenBank->new();
> >>
> >> my $seq = $gb->get_Seq_by_acc($id);
> >>
> >> say join(',',$seq->species->classification);
> >>
> >> chris
> >>
> >> On May 5, 2010, at 9:46 AM, shalabh sharma wrote:
> >>
> >>> Hi Torsten,
> >>>             Thanks for pointing that out. But this is just a warning,
> >>> it will not break the script. i found the the point where script is
> >>> breaking.
> >>> Its breaking and giving this message:
> >>> Can't call method "classification" on an undefined value at parseGB.pl
> >> line
> >>> 9, <GEN0> line 10067733.
> >>>
> >>> So the script is breaking when its coming to this record:
> >>>
> >>> LOCUS       S001198291              1521 bp    rRNA    linear   BCT
> >>> 02-Feb-2009
> >>> DEFINITION  Candidatus Azobacteroides pseudotrichonymphae genomovar.
> >> CFP2.
> >>> ACCESSION   AP010656 REGION: 61786..63306
> >>> PROJECT     GenomeProject:29025
> >>> SOURCE      Candidatus Azobacteroides pseudotrichonymphae genomovar.
> CFP2
> >>> ORGANISM  Candidatus Azobacteroides pseudotrichonymphae genomovar. CFP2
> >>>        Root; Bacteria; "Bacteroidetes"; "Bacteroidia";
> >>> "Bacteroidales";
> >>>        "Porphyromonadaceae"; unclassified_"Porphyromonadaceae".
> >>> REFERENCE   1 (bases 1 to 1521)
> >>> AUTHORS   Toyoda A., Hongoh Y., Toh H., Hattori M., Ohkuma M., Sakaki
> Y.;
> >>> TITLE     ;
> >>> JOURNAL   Submitted (21-MAR-2008) to the EMBL/GenBank/DDBJ databases.
> >>>        Contact:Atsushi Toyoda National Institute of Genetics,
> >>> Comparative
> >>>        Genomics Laboratory; Yata 1111, Mishima, Shizuoka 411-8540,
> >>> Japan
> >>> REFERENCE   2
> >>> AUTHORS   Hongoh Y., Sharma V.K., Prakash T., Noda S., Toh H., Taylor
> >>> T.D.,
> >>>        Kudo T., Sakaki Y., Toyoda A., Hattori M., Ohkuma M.;
> >>>
> >>> It is unable to parse this record, but i don't understand why it is
> doing
> >>> so? The only reason i can think of is the organism's name which is very
> >> long
> >>> as compared to others.
> >>>
> >>> Thanks
> >>> Shalabh
> >>>
> >>>
> >>>
> >>> On Wed, May 5, 2010 at 3:48 AM, Torsten Seemann <
> >>> torsten.seemann at infotech.monash.edu.au> wrote:
> >>>
> >>>>> i have a huge GenBank file ( downloaded from RDP containing all
> >>>>> bacterial 16s). I just want to parse RDP id (in LOCUS) and organism's
> >>>> linage (in ORGANISM).
> >>>>> I am getting the output like:
> >>>>> S000107505 uncultured Acidobacteria bacterium Geothrix Holophagaceae
> >>>>> Holophagales Holophagae "Acidobacteria" Bacteria Root
> >>>>> This is the exact output i want, but i am missing lot of records
> (they
> >>>> are
> >>>>> there in the genbank file but not in my output).
> >>>>> I also got a warning during parsing:
> >>>>> --------------------- WARNING ---------------------
> >>>>> MSG: Unbalanced quote in:
> >>>>> /db_xref="taxon:35783" /germline"
> >>>>> /mol_type="genomic DNA"
> >>>>> /organism="Enterococcus sp."
> >>>>> /strain="LMG12316"No further qualifiers will be added for this
> feature
> >>>>> ---------------------------------------------------
> >>>>> So i was just wondering that is this warning message causing that
> >> problem
> >>>> or
> >>>>> i am doing something wrong?
> >>>>
> >>>> "Unbalanced quote" means there is not an even number (multiple of 2)
> >>>> double-quote (") symbols around the tag's value. I can see that the
> >>>> first line (below) looks problematic:
> >>>>
> >>>> YOU HAVE:
> >>>>
> >>>> /db_xref="taxon:35783" /germline"
> >>>>
> >>>> SHOULD BE:
> >>>>
> >>>> /db_xref="taxon:35783"
> >>>> /germline
> >>>>
> >>>> I suspect there is a problem either with RDP's genbank producer, or
> >>>> Bioperl is having problem with  the "germline" qualifier which is a
> >>>> 'null valued' qualifier like /pseudo - it takes no ="value" string. (I
> >>>> think in Bioperl this is handled by setting the value to "_no_value"
> >>>> ?)
> >>>>
> >>>> http://www.ncbi.nlm.nih.gov/collab/FT/
> >>>>
> >>>> Qualifier       /germline
> >>>> Definition      the sequence presented in the entry has not undergone
> >>>> somatic
> >>>>           rearrangement as part of an adaptive immune response; it is
> >>>> the
> >>>>           unrearranged sequence that was inherited from the parental
> >>>>           germline
> >>>> Value format    none
> >>>> Example         /germline
> >>>> Comment         /germline should not be used to indicate that the
> source
> >> of
> >>>>           the sequence is a gamete or germ cell;
> >>>>           /germline and /rearranged cannot be used in the same source
> >>>>           feature;
> >>>>           /germline and /rearranged should only be used for molecules
> >>>> that
> >>>>           can undergo somatic rearrangements as part of an
> >>>> adaptive immune
> >>>>           response; these are the T-cell receptor (TCR) and
> >>>> immunoglobulin
> >>>>           loci in the jawed vertebrates, and the unrelated variable
> >>>>           lymphocyte receptor (VLR) locus in the jawless fish
> >>>> (lampreys
> >>>>           and hagfish);
> >>>>           /germline and /rearranged should not be used outside of the
> >>>>           Craniata (taxid=89593)
> >>>>
> >>>>
> >>>> --Torsten Seemann
> >>>> --Victorian Bioinformatics Consortium, Dept. Microbiology, Monash
> >>>> University, AUSTRALIA
> >>>>
> >>> _______________________________________________
> >>> Bioperl-l mailing list
> >>> Bioperl-l at lists.open-bio.org
> >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >>
> >>
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
>