[Bioperl-l] parsing GenBank file

shalabh sharma shalabh.sharma7 at gmail.com
Wed May 5 14:46:19 UTC 2010


Hi Torsten,
                 Thanks for pointing that out. But this is just a warning,
it will not break the script. i found the the point where script is
breaking.
Its breaking and giving this message:
Can't call method "classification" on an undefined value at parseGB.pl line
9, <GEN0> line 10067733.

So the script is breaking when its coming to this record:

LOCUS       S001198291              1521 bp    rRNA    linear   BCT
02-Feb-2009
DEFINITION  Candidatus Azobacteroides pseudotrichonymphae genomovar. CFP2.
ACCESSION   AP010656 REGION: 61786..63306
PROJECT     GenomeProject:29025
SOURCE      Candidatus Azobacteroides pseudotrichonymphae genomovar. CFP2
  ORGANISM  Candidatus Azobacteroides pseudotrichonymphae genomovar. CFP2
            Root; Bacteria; "Bacteroidetes"; "Bacteroidia";
"Bacteroidales";
            "Porphyromonadaceae"; unclassified_"Porphyromonadaceae".
REFERENCE   1 (bases 1 to 1521)
  AUTHORS   Toyoda A., Hongoh Y., Toh H., Hattori M., Ohkuma M., Sakaki Y.;
  TITLE     ;
  JOURNAL   Submitted (21-MAR-2008) to the EMBL/GenBank/DDBJ databases.
            Contact:Atsushi Toyoda National Institute of Genetics,
Comparative
            Genomics Laboratory; Yata 1111, Mishima, Shizuoka 411-8540,
Japan
REFERENCE   2
  AUTHORS   Hongoh Y., Sharma V.K., Prakash T., Noda S., Toh H., Taylor
T.D.,
            Kudo T., Sakaki Y., Toyoda A., Hattori M., Ohkuma M.;

It is unable to parse this record, but i don't understand why it is doing
so? The only reason i can think of is the organism's name which is very long
as compared to others.

Thanks
Shalabh



On Wed, May 5, 2010 at 3:48 AM, Torsten Seemann <
torsten.seemann at infotech.monash.edu.au> wrote:

> >      i have a huge GenBank file ( downloaded from RDP containing all
> > bacterial 16s). I just want to parse RDP id (in LOCUS) and organism's
> linage (in ORGANISM).
> > I am getting the output like:
> > S000107505 uncultured Acidobacteria bacterium Geothrix Holophagaceae
> > Holophagales Holophagae "Acidobacteria" Bacteria Root
> > This is the exact output i want, but i am missing lot of records (they
> are
> > there in the genbank file but not in my output).
> > I also got a warning during parsing:
> > --------------------- WARNING ---------------------
> > MSG: Unbalanced quote in:
> > /db_xref="taxon:35783" /germline"
> > /mol_type="genomic DNA"
> > /organism="Enterococcus sp."
> > /strain="LMG12316"No further qualifiers will be added for this feature
> > ---------------------------------------------------
> > So i was just wondering that is this warning message causing that problem
> or
> > i am doing something wrong?
>
> "Unbalanced quote" means there is not an even number (multiple of 2)
> double-quote (") symbols around the tag's value. I can see that the
> first line (below) looks problematic:
>
> YOU HAVE:
>
> /db_xref="taxon:35783" /germline"
>
> SHOULD BE:
>
> /db_xref="taxon:35783"
> /germline
>
> I suspect there is a problem either with RDP's genbank producer, or
> Bioperl is having problem with  the "germline" qualifier which is a
> 'null valued' qualifier like /pseudo - it takes no ="value" string. (I
> think in Bioperl this is handled by setting the value to "_no_value"
> ?)
>
> http://www.ncbi.nlm.nih.gov/collab/FT/
>
> Qualifier       /germline
> Definition      the sequence presented in the entry has not undergone
> somatic
>                rearrangement as part of an adaptive immune response; it is
> the
>                unrearranged sequence that was inherited from the parental
>                germline
> Value format    none
> Example         /germline
> Comment         /germline should not be used to indicate that the source of
>                the sequence is a gamete or germ cell;
>                /germline and /rearranged cannot be used in the same source
>                feature;
>                /germline and /rearranged should only be used for molecules
> that
>                can undergo somatic rearrangements as part of an
> adaptive immune
>                response; these are the T-cell receptor (TCR) and
> immunoglobulin
>                loci in the jawed vertebrates, and the unrelated variable
>                lymphocyte receptor (VLR) locus in the jawless fish
> (lampreys
>                and hagfish);
>                /germline and /rearranged should not be used outside of the
>                Craniata (taxid=89593)
>
>
> --Torsten Seemann
> --Victorian Bioinformatics Consortium, Dept. Microbiology, Monash
> University, AUSTRALIA
>



More information about the Bioperl-l mailing list