[Bioperl-l] Trying to get a mysql DB from genbank flat files
Raphael LaFrance
rafe@scinq.org
Thu, 15 Nov 2001 14:30:50 -0500
Here's the record that crashes it. I'm adding to an empty DB... if that
matters.
==============================Cut
Here========================================
GBBCT1.SEQ Genetic Sequence Data Bank
October 15 2001
NCBI-GenBank Flat File Release 126
Bacterial Sequences (Part 1)
17495 loci, 103824523 bases, from 17495 reported sequences
LOCUS AARPOB2 871 bp DNA BCT
03-FEB-2000
DEFINITION Abiotrophia adiacens RNA polymerase beta subunit (rpoB)
gene,
partial cds.
ACCESSION AF194508
VERSION AF194508.1 GI:6449110
KEYWORDS .
SEGMENT 2 of 2
SOURCE Granulicatella adiacens.
ORGANISM Granulicatella adiacens
Bacteria; Firmicutes; Bacillus/Clostridium group;
Granulicatella.
REFERENCE 1 (bases 1 to 871)
AUTHORS Majewski,J., Zawadzki,P., Pickerill,P., Cohan,F.M. and
Dowson,C.G.
TITLE Barriers to genetic exchange between bacterial species:
Streptococcus pneumoniae transformation
JOURNAL J. Bacteriol. 182 (4), 1016-1023 (2000)
MEDLINE 20115546
PUBMED 10648528
REFERENCE 2 (bases 1 to 871)
AUTHORS Majewski,J., Zawadzki,P., Pickerill,P., Cohan,F.M. and
Dowson,C.G.
TITLE Direct Submission
JOURNAL Submitted (13-OCT-1999) Biology, Wesleyan University, Church
Street, Middletown, CT 06457, USA
FEATURES Location/Qualifiers
source 1..871
/organism="Granulicatella adiacens"
/strain="ATCC 49175"
/db_xref="taxon:46124"
/db_xref="ATCC:49175"
gene order(AF194507.1:<1..510,1..>871)
/gene="rpoB"
CDS <1..>871
/gene="rpoB"
/codon_start=1
/transl_table=11
/product="RNA polymerase beta subunit"
/protein_id="AAF08832.1"
/db_xref="GI:6449113"
/translation="GELTYKRRLSALGPGGLTRDRAGYEVRDVHYSHYGRMCPIETPE
GPNIGLINSLSTYAKINKYGFIETPYRRVDWNTHKVTDKIDYLTADEEDSFVVAQANS
PLNEDGSFVNDVVMARYVSENLEVPVERVDYMDVSPKQVVAVATACIPFLENDDSNRA
LMGANMQRQAVPLLNPKAPFIGTGMEYVSAHDSGVALLCKRDGVVEFVDAKEVRVRTA
DGSLDTYHITKFHGSNAGMCYNQRPIVAQGDKVVKGEILADGPSMEKGELALGQNVLV
AFMTWEGYIYEDAV"
BASE COUNT 263 a 174 c 206 g 228 t
ORIGIN
1 ggagagttaa catacaaacg ccgtctatca gcgttaggac ctggtggttt
gactcgtgac
61 cgtgctggat atgaagttcg tgacgttcac tattctcact atggtcgtat
gtgtcctatc
121 gaaactcctg aaggaccaaa catcgggttg atcaacagct tatcaaccta
tgcgaagatc
181 aataaatatg gtttcatcga aactccatac cgtcgtgtag actggaacac
tcataaagtt
241 acagataaaa ttgactactt aacagctgac gaagaagata gcttcgtagt
agcgcaagca
301 aactctccat taaatgaaga tggaagcttc gtgaatgatg ttgttatggc
gcgttacgta
361 tctgaaaact tagaagtgcc agtagaacgc gttgactata tggacgtttc
tccaaaacaa
421 gtagttgcag ttgcgacagc atgtatcccg ttcttagaaa acgacgactc
aaaccgtgcg
481 ttgatgggtg cgaacatgca acgtcaagct gttccattgt taaatccaaa
agcaccattc
541 atcggtacag gtatggaata cgtatctgca catgactcag gggttgcctt
gttatgtaaa
601 cgtgatggtg tagtcgaatt cgttgatgct aaagaagtac gtgtacgtac
agctgatggc
661 tcattagata cttaccacat cactaagttc cacggatcaa acgcgggtat
gtgttacaac
721 caacgtccaa tcgtggcaca aggggataaa gtcgttaaag gcgaaatcct
agcagatgga
781 ccttctatgg aaaaaggtga attagcatta ggacaaaacg ttctagtagc
gttcatgact
841 tgggaaggtt acatctacga ggatgcggtt a
//
================================Cut
Here========================================
sorry for the slow turnaround.
rafe
Raphael LaFrance wrote:
>
> Sorry for delay in replying... meetings all morning.
>
> Yup, crazy chars abound in the Genbank data. That quote() function is
> sorely missing in my program & a much better solution than my 1st try,
> which was rather brute force-ish. I also like to strip mutil-whitespace
> chars down to a single space char.
>
> Oh yea, you probably already knew this but... on quoted data like in the
> feature's qualifiers you cannot depend on the "/" being a reliable
> delimiter you have to look for the closing quote via a strip of ending
> white space & a check of the last char found, well last two if you think
> there might be a \" sequence in the data itself, which I didn't find in
> the 60 or so files I stripped but you never know.
>
> Maybe new bugs will be by using the substr function, I've seen some odd
> things in the data set. This, however, is a peculiar record where format
> is the same but the fields may be missing & there are no easy delimiters
> to go after. I also didn't post that the date was thrown off too. Same
> problem tho.
>
> I'm of course I am willing to send the entire rippers but I'm not sure
> how much use they'll be (aside from comic relief :D) given that I'm a
> total newbie.
>
> Apologies. I'm trying to track down the nasty record but the obvious n+1
> isn't crashing it. More as soon as I get out of the next set of
> meetings.
>
> rafe
>
> "Osborne, Brian" wrote:
> >
> > Wilfred,
> >
> > Yes. And there is also the quote() method, which escapes offending
> > characters. However, I don't know what the complete offending character set
> > is and how it overlaps with the set of annoying characters in the
> > description, it might not help here.
> >
> > quote
> > $sql = $dbh->quote($string);
> > This method escapes special characters (quotation marks, etc.) from strings
> > and adds the required outer quotation marks. May not be able to handle all
> > types of input (i.e binary data).
> >
> > Example using the above methods:
> > #!/usr/bin/perl -w
> > use DBI;
> > use strict;
> > my $dbh = DBI->connect("DBI:mysql:contacts",undef,undef)
> > or die "Unable to connect to contacts Database: $dbh->errstr\n";
> > my $sth = $dbh->prepare("SELECT uid FROM contact WHERE last_name =
> > 'Flaherty'");
> > $sth->execute or die "Unable to execute query: $dbh->errstr\n";
> > my $row = $sth->fetchrow_arrayref;
> > my $uid = $row->[0];
> > $sth->finish;
> > my $newname = $dbh->quote("The Flahertys'");
> > my $statement =qq(UPDATE contact SET last_name = '$newname'
> > WHERE uid = $uid);
> > my $rc = $dbh->do($statement) or die "Unable to prepare/execute $statement:
> > $dbh->errstr\n";
> > print "$rc rows were updated\n";
> > $dbh->disconnect;
> > exit;
> >
> > Brian O.
> >
> > -----Original Message-----
> > From: Wilfred Li, Ph.D. [mailto:wilfred@sdsc.edu]
> > Sent: Thursday, November 15, 2001 12:41 PM
> > To: bioperl-l@bioperl.org
> > Subject: RE: [Bioperl-l] Trying to get a mysql DB from genbank flat
> > files
> >
> > >thanks for your mail, you have magically bumped into our daily nightmare,
> > >that is that very often people put crazy characters in the description
> > >lines, and we have to find ways of backslashing all of them otherwise
> > >they will break mysql statements. Could you mail me the offending record?
> > Hi,
> >
> > If bind variables are used in place of a plain insert statement, many of
> > the special characters will be taken care of by perl DBI. e.g.
> >
> > $sth->prepare("insert ... values (?, ..., ?)");
> > $sth->execute($id, ..., $kw);
> >
> > or use
> >
> > $dbh->do("insert ... values (?, ..., ?)", $id, ..., $kw);
> >
> > to combine the two steps into one.
> >
> > I had the problem with SeqAdaptor.pm when parsing SwissProt.
> >
> > Wilfred
> >
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l@bioperl.org
> > http://bioperl.org/mailman/listinfo/bioperl-l
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l@bioperl.org
> > http://bioperl.org/mailman/listinfo/bioperl-l