[Bioperl-l] error with files containing multiple BSML formatted entries

Lincoln Stein lstein at cshl.org
Mon Mar 24 11:17:25 EST 2003


Each BSML-formatted sequence must occupy one and only one file.

Lincoln

On Friday 14 March 2003 02:25 pm, Kevin Clancy wrote:
> Hi Folks
> I have written a program very similar to the universal reformatter
> program in the tutorial FAQ on the site. When I run a file of bsml
> formatted sequencs through the converter I get the following error:
> Junk after document element at line 81, column 0, byte 4550 at
> /usr/lib/perl5/site_perl/5.6.0/i586-linux/XML/Parser.pm line 185
>
> Line 81 in the file of bsml formatted sequences corresponds to the
> beginning of a new bsml document and looks like this:
> <?xml version="1.0"?>
>
> I've included the program I'm working with for reference. Simply take a
> file of GenBank sequences and run them against the program - it will
> produce a file of bsml formatted sequences. Then take the file of bsml
> formatted sequences and run them against the program. I've also included
> some of the bsml formatted file to run against the program.
>
> My question is is this a bug or should I process the bsml file into
> separate entries before submitting to the sequence reformater segment of
> the program?
>
> Thanks in advance for the help.
>
> kevin
>
> #!/usr/bin/perl -w
> ########################################################################
> ###############
> # DataConverter.pl
> #
> #Program to convert genbank, swissprot or fasta files to bsml or bsml
> files to genbank
> #run program, input path and file name (/home/kclancy/frog.nt)
> #program checks for input data type and whether file is prematurely
> terminated
> #if all is well, it converts the data as described above.
> #
> ########################################################################
> ################
> use strict;
> use Bio::SeqIO;
>
> my $fh;
> my $offset;
> my ($fileType, $fileEnd);
> my ($returnType, $outFile);
> my $library = <STDIN>;
> chomp $library;
> my $libraryOut = "/home/kclancy/seqout";
> my (@extensions) = qw(bsml fsa gb swp);
>
> $fh = open_file($library);
>
> $fileType = determine_datatype($fh);
>
> unless ($fileType =~ /genbank/ || $fileType =~ /bsml/ || $fileType =~
> /swissprot/ || $fileType =~ /fasta/) {
>     print STDERR "Unrecognised input file type.\n";
>     exit;
> }
>
> $fileEnd = end_check($fileType, $fh);
> if ($fileEnd == 1) {
>     print STDERR "Premature end of file.\n";
>     exit;
> }
>
> if ($fileType =~ /bsml/) {
>     $returnType = "genbank";
>     $outFile = "$libraryOut.$extensions[2]";
> } else {
>     $returnType = "bsml";
>     $outFile = "$libraryOut.$extensions[0]";
> }
>
> my $in = Bio::SeqIO->new('-file' => "$library",
> 		      '-format' => "$fileType" );
> print "$outFile\n$returnType\n";
> my $out = Bio::SeqIO->new('-file' => ">$outFile",
> 		      '-format' => "$returnType");
>
> while (my ($seq) = $in->next_seq()) {
>     $out->write_seq($seq);
> }
>
> exit;
>
> ########################################################################
> #######
> #subroutines
> ########################################################################
> #######
>
> #open_file
> #
> #given a file name, open a file handle
>
> sub open_file {
>     my($fileHandle) = @_;
>     my ($fh);
>
>     unless(open($fh, $fileHandle)) {
> 	print "Cannot open file $fileHandle: $!\n";
> 	exit;
>     }
>     return $fh;
> }
>
> #determine_datatype
> #
> #given a file handle, return the type of the data file
>
> sub determine_datatype {
>     my ($fh) = @_;
>     my ($type) = "";
>
>     my ($file) = <$fh>;
>     my (@lines) = split /\n/, $file;
>
>     foreach my $line (@lines) {
> 	if ($line =~ /^LOCUS/) {
> 	    $type = "genbank";
> 	    return $type;
> 	} elsif ($line =~ /^\>/) {
> 	    $type = "fasta";
> 	    return $type;
> 	} elsif ($line =~ /^ID/) {
> 	    $type = "swissprot";
> 	    return $type;
> 	} elsif ($line =~ /^\n$/) {
> 	    next;
> 	} elsif ($line =~ /\</) {
> 	    $type = "bsml";
> 	    return $type;
> 	}
>     }
> }
>
> #end_check
> #
> #check an input file and file type to see if last line ends correctly
> #return value
> sub end_check {
>     my ($type,$fh) = @_;
>     my ($lastLine);
>     my ($file) = <$fh>;
>
>     my (@lines) = split /\n/, $file;
>     $lastLine = pop(@lines);
>     do {
> 	$lastLine = "";
> 	$lastLine = pop(@lines);
> 	} while ($lastLine =~ m/^\n$/);
>
>     if ($lastLine =~ m/^\<\/Bsml\>/) {
>         my $end = 0;
>         return $end;
>     } elsif ($lastLine =~ m/^\/\/\n/) {
>         my $end = 0;
>         return $end;
>     } elsif ($lastLine !~ m/^\>/) {
>         my $end = 0;
>         return $end;
>     } else {
>         my $end = 1;
>         return $end;
>     }
> }
>
> ########################################################################
> #######################
> <?xml version="1.0"?>
>
> <!DOCTYPE Bsml SYSTEM "http://www.labbook.com/dtd/bsml2_2.dtd">
>
> <Bsml>
> <Definitions>
> <Sequences>
> <Sequence length="1373" title="AF374473" id="SEQ14150744"
> ic-acckey="AF374473" representation="raw">
> <Attribute content="This file generated to BSML 2.2 standards - joins
> will be collapsed to a single feature enclosing all members of the join"
> name="comment"/>
> <Attribute content="Xenopus laevis LIM-only protein LMO-2 mRNA, complete
> cds." name="description"/>
> <Attribute content="1" name="version"/>
> <Attribute content="VRT" name="division"/>
> <Attribute content="14150744" name="primary_id"/>
> <Attribute content="African clawed frog" name="common_name"/>
> <Attribute content="Xenopus" name="genus"/>
> <Attribute content="laevis" name="species"/>
> <Attribute content="laevis Xenopus Xenopodinae Pipidae Pipoidea
> Mesobatrachia Anura Batrachia Amphibia Euteleostomi Vertebrata Craniata
> Chordata Metazoa Eukaryota" name="classification"/>
> <Feature-tables>
> <Feature-table title="Sequence References">
> <Reference>
> <Attribute content="1373" name="end"/>
> <Attribute content="1" name="start"/>
> <RefAuthors>
> Mead,P.E., Deconinck,A.E., Huber,T.L., Orkin,S.H. and
> Zon,L.I.</RefAuthors>
> <RefTitle>
> Primitive hematopoiesis in the Xenopus embryo: the synergistic role of
> LMO-2, SCL and GATA-binding proteins</RefTitle>
> <RefJournal>
> Development (2001) In press</RefJournal>
> </Reference>
> <Reference>
> <Attribute content="1373" name="end"/>
> <Attribute content="1" name="start"/>
> <RefAuthors>
> Mead,P.E., Deconinck,A.E., Huber,T.L., Orkin,S.H. and
> Zon,L.I.</RefAuthors>
> <RefTitle>
> Direct Submission</RefTitle>
> <RefJournal>
> Submitted (26-APR-2001) Pathology, St. Jude Children's Research
> Hospital, 332 North Lauderdale, Room D4047C, Memphis, TN 38103,
> USA</RefJournal>
> </Reference>
> </Feature-table>
> <Feature-table title="Features">
> <Feature class="source" id="FEAT-io0"
> value-type="EMBL/GenBank/SwissProt">
> <Interval-loc endpos="1373" startpos="1"/>
> <Qualifier value="Xenopus laevis" value-type="organism"/>
> <Qualifier value="taxon:8355" value-type="db_xref"/>
> </Feature>
> <Feature class="cds" id="FEAT-io1" value-type="EMBL/GenBank/SwissProt">
> <Interval-loc endpos="726" startpos="250"/>
> <Qualifier value="LIM-only protein LMO-2" value-type="product"/>
> <Qualifier value="AAK54614.1" value-type="protein_id"/>
> <Qualifier value="1" value-type="codon_start"/>
> <Qualifier value="GI:14150745" value-type="db_xref"/>
> <Qualifier
> value="MSSAIERKSLDPADEPVDEVLQIPPSLLTCGGCQQSIGDRYFLKAIDQYWHEDCLSCDLCGCRLG
> EVGRRLYYKLGRKLCRRDYLRLFGQDGLCASCDNRIRAYEMTMRVKDKVYHLECFKCAACQKHFCVGDRYLL
> INSDIVCEQDIYEWTKLSEMM" value-type="translation"/>
> </Feature>
> </Feature-table>
> </Feature-tables>
> <Seq-data>
> GAATTCGGCACGAGCTGCAGCAGGAGAGCAGCTCTCATACACACATACAACTGGCTGGGGAATTACACAGGA
> ATGGAGAAGAGACAGCAGGGGGCTCTCGCCTGTTCCTCAGGGATAACAGCCTGTGAGATTTGGGGCTACAGG
> GAGCTGAGCTTGTGACTGGCATATGCAAAGGAGAGGTCCAGCGGAGCTAATCCATTTGGCAGAAGAATCAGG
> GGACAATAGGAGGAGGCGCACCAGATTCTGCAAATGTCATCAGCTATAGAGAGAAAGAGCCTGGACCCAGCA
> GATGAGCCGGTGGATGAAGTTCTGCAGATCCCCCCCTCACTGCTGACATGTGGGGGCTGCCAGCAAAGCATC
> GGGGATCGCTATTTCTTGAAGGCCATCGATCAGTACTGGCATGAAGACTGCCTGAGCTGTGACCTGTGTGGG
> TGCCGGCTGGGGGAAGTCGGAAGGAGACTTTACTATAAACTGGGGCGCAAATTGTGCCGAAGGGACTACCTC
> AGGCTGTTCGGCCAGGACGGACTGTGCGCCTCTTGCGATAATCGTATCCGAGCCTACGAGATGACCATGAGG
> GTGAAGGACAAAGTGTACCACCTGGAGTGCTTCAAGTGTGCCGCCTGCCAGAAGCACTTCTGCGTGGGTGAC
> CGCTACCTGCTCATTAACTCGGACATTGTGTGTGAGCAGGACATCTACGAATGGACCAAGCTCAGTGAGATG
> ATGTAGCACTGGCAGAATTTCGTGGAGTGCCCAACGATGGACATCAGCCACACGTTCACTAAACGCACTGTA
> ACTGACAACACAGCTACGACTGACAAACCAACATCTTTCCATTGGAACAACAGGTTTAGGTCTAGAGACTAT
> TAGCACTATATGGTCTTATGCACTGAAATAACACTCACACGTACAATGTAAATCCAGAAGGCCATTCTGGTG
> ATCCTACTGTCATTATTGGTGCTTTTGTCTTTATTGAAGCAACACATGGATTATTGCGACATCATAGTGGTG
> AAGCTTTAAAAGAGTAAGATCCCATAGAACCCCTTAATTGGAGAGAGTGGTTTTGGGGGTTCTGCAGTAGAA
> AGGGACACCTCAATGGTACAACCTGGACACCACCTAATTCTATGGACAACACTGTATAAACTCATTTATATC
> ATGGAACACTCGTCCTGAAAAATCTATGCAAAATGTAAAGTTATCAGGGAAGTGTGGTTTTTCTTACAGTAT
> ATTAAGTAACTGGCAAAAAGGGTCAATCTTGCCCTGATCGACTATTGAACTTGACTTGGCACTGAGTGGTAA
> CAGACGATTAGTTTGGGAAAGGGACAAGGGAATAAAGATGTTTTTTTTTGTGGTTAAAAAAAAAAAAAAAAA
> AAAAA</Seq-data>
> </Sequence>
> </Sequences>
> </Definitions>
> <Display>
> <Styles>
> <Style type="text/css">
> Interval-widget { display : &quot;1&quot;; }
> Feature { display-auto : &quot;1&quot;; }</Style>
> </Styles>
> <Page>
> <Screen height="5.5" width="7.8">
> <!-- Must close <Screen>
>  explicitly -->
> </Screen>
> <View seqref="SEQ14150744" title="AF374473" title1="{NAME}"
> title2="{LENGTH} {UNIT}">
> <View-line-widget hcenter="4.6" linear-length="5.8" shape="horizontal"/>
> <View-axis-widget/>
> </View>
> </Page>
> </Display>
> </Bsml>
>
> <?xml version="1.0"?>
>
> <!DOCTYPE Bsml SYSTEM "http://www.labbook.com/dtd/bsml2_2.dtd">
>
> <Bsml>
> <Definitions>
> <Sequences>
> <Sequence length="1891" title="AY028920" id="SEQ13488612"
> ic-acckey="AY028920" representation="raw">
> <Attribute content="This file generated to BSML 2.2 standards - joins
> will be collapsed to a single feature enclosing all members of the join"
> name="comment"/>
> <Attribute content="Xenopus laevis proline-rich Vg1 mRNA-binding protein
> mRNA, complete cds." name="description"/>
> <Attribute content="1" name="version"/>
> <Attribute content="VRT" name="division"/>
> <Attribute content="13488612" name="primary_id"/>
> <Attribute content="African clawed frog" name="common_name"/>
> <Attribute content="Xenopus" name="genus"/>
> <Attribute content="laevis" name="species"/>
> <Attribute content="laevis Xenopus Xenopodinae Pipidae Pipoidea
> Mesobatrachia Anura Batrachia Amphibia Euteleostomi Vertebrata Craniata
> Chordata Metazoa Eukaryota" name="classification"/>
> <Feature-tables>
> <Feature-table title="Sequence References">
> <Reference dbxref="21231177">
> <Attribute content="1891" name="end"/>
> <Attribute content="11331596" name="pubmed"/>
> <Attribute content="1" name="start"/>
> <RefAuthors>
> Zhao,Wm, Jiang,C., Kroll,T.T. and Huber,P.W.</RefAuthors>
> <RefTitle>
> A proline-rich protein binds to the localization element of Xenopus Vg1
> mRNA and to ligands involved in actin polymerization</RefTitle>
> <RefJournal>
> EMBO J. 20 (9), 2315-2325 (2001)</RefJournal>
> </Reference>
> <Reference>
> <Attribute content="1891" name="end"/>
> <Attribute content="1" name="start"/>
> <RefAuthors>
> Jiang,C., Zhao,W. and Huber,P.W.</RefAuthors>
> <RefTitle>
> Direct Submission</RefTitle>
> <RefJournal>
> Submitted (20-MAR-2001) Chemistry and Biochemistry, University of Notre
> Dame, Notre Dame, IN 46556, USA</RefJournal>
> </Reference>
> </Feature-table>
> <Feature-table title="Features">
> <Feature class="source" id="FEAT-io2"
> value-type="EMBL/GenBank/SwissProt">
> <Interval-loc endpos="1891" startpos="1"/>
> <Qualifier value="Xenopus laevis" value-type="organism"/>
> <Qualifier value="taxon:8355" value-type="db_xref"/>
> </Feature>
> <Feature class="cds" id="FEAT-io3" value-type="EMBL/GenBank/SwissProt">
> <Interval-loc endpos="1199" startpos="117"/>
> <Qualifier value="proline-rich Vg1 mRNA-binding protein"
> value-type="product"/>
> <Qualifier value="AAK26172.1" value-type="protein_id"/>
> <Qualifier value="1" value-type="codon_start"/>
> <Qualifier value="Prrp" value-type="note"/>
> <Qualifier value="GI:13488613" value-type="db_xref"/>
> <Qualifier
> value="MNNQGGDEIGKLFVGGLDWSTTQETLRSYFSQYGEVVDCVIMKDKTTNQSRGFGFVKFKDPNCVG
> TVLASRPHTLDGRNIDPKPCTPRGMQPERSRPREGWQQKEPRTENSRSNKIFVGGIPHNCGETELKEYFNRF
> GVVTEVVMIYDAEKQRPRGFGFITFEDEQSVDQAVNMHFHDIMGKKVEVKRAEPRDSKSQTPGPPGSNQWGS
> RAMQSTANGWTGQPPQTWQGYSPQGMWMPTGQTIGGYGQPAGRGGPPPPPSFAPFLVSTTPGPFPPPQGFPP
> GYATPPPFGYGYGPPPPPPDQFVSSGVPPPPGTPGAAPLAFPPPPGQSAQDLSKPPSGQQDFPFSQFGNACF
> VKLSEWI" value-type="translation"/>
> </Feature>
> </Feature-table>
> </Feature-tables>
> <Seq-data>
> GTACGTGATGACGTTCCGTTCCACCCCCTGTATGGAAACCCCGTAGTCTAGCGCCGTCTTACCGTTGGCTGG
> CTTGGTAGGAGAAGCTGCAGAGACCGGAGAGGGGTGACGGAGTTATGAACAACCAAGGCGGGGACGAGATCG
> GAAAGCTCTTTGTTGGTGGCCTTGACTGGAGCACAACGCAGGAAACCCTGCGCAGCTATTTTTCTCAGTATG
> GAGAAGTCGTAGACTGTGTAATAATGAAAGATAAAACAACAAATCAGTCAAGAGGCTTTGGCTTTGTCAAAT
> TTAAGGATCCCAATTGTGTAGGAACAGTCTTAGCCAGCAGACCACACACACTGGATGGCCGGAATATTGATC
> CAAAGCCATGTACCCCTCGAGGAATGCAGCCTGAAAGAAGCCGGCCACGAGAAGGCTGGCAGCAAAAAGAAC
> CCAGAACTGAAAACAGTAGGTCCAACAAAATTTTTGTGGGAGGAATTCCACACAACTGTGGAGAAACTGAAC
> TGAAGGAATATTTTAATAGATTTGGAGTGGTAACTGAGGTGGTTATGATATATGATGCCGAAAAACAGAGGC
> CAAGAGGTTTTGGATTTATAACTTTCGAGGACGAACAATCAGTGGACCAGGCTGTCAACATGCATTTTCACG
> ACATCATGGGCAAAAAAGTTGAAGTCAAACGGGCAGAACCACGTGATAGCAAAAGCCAAACTCCAGGACCTC
> CTGGATCAAACCAATGGGGAAGCCGAGCAATGCAAAGCACAGCTAATGGATGGACAGGGCAACCTCCTCAAA
> CATGGCAGGGCTACAGTCCACAAGGTATGTGGATGCCAACAGGACAGACGATTGGCGGGTATGGGCAACCTG
> CAGGCCGGGGGGGTCCTCCTCCACCACCTTCCTTTGCGCCATTCCTAGTGTCAACAACCCCTGGACCTTTCC
> CACCACCGCAGGGCTTTCCTCCTGGATATGCCACACCGCCTCCCTTTGGCTATGGCTATGGGCCACCCCCTC
> CCCCTCCTGATCAGTTCGTCTCTTCAGGAGTTCCACCACCTCCTGGTACACCAGGAGCAGCACCGTTAGCGT
> TCCCTCCTCCCCCAGGGCAGTCAGCTCAGGATCTGAGCAAACCCCCCAGTGGTCAGCAGGATTTTCCTTTTA
> GCCAGTTTGGAAATGCCTGCTTTGTGAAATTGTCCGAGTGGATTTGACCTCTCGAAGTCCATCAAAGGTTAT
> GGACAAGACATGAGTGGGTTTGGGCAAAGTTTCCCGGACCTCAACCAGCAGCCCCCTTACACGACAGGACCC
> TCCACACCAGCATCGGGAGGCCCAGCCGCCGTGGGAAGTGGTTTGGGCAGAGGACAGAATCACAATGTACAA
> GGATTCCATCCCTACAGGCGTTAATATGAGGCAGGCAGAATTGTGCAAAGTGGCGATGAGCTGATGCTTCCA
> CCTCTGAACTTGTGACAATCACGTGGTGAAGACACATCCTGCTTATAGACTTATAGTTCTATGTTTGAAGGA
> GAAGCTGTGGGTATTTGAACTACAGTTTTCAGATCTTTCTCTGACCCATCAGCACAAATAAAGCCAATGAGT
> CACTGGTTCCAAACAGGGTTTGAAAACATCTGCAGCTTTAATGGAACTCTTCAGGTTTAATTTGGGGTTTGT
> TTTTGTTCTTTTTATTTAGTTTTTTGTTTTGGGAGGGATATTTCTGAGCCTTTGTTTTACCATATAGTAAAC
> TTTTATGTTTAAAGATGAAAATATATACATTTACAGATTGTGAATTTTTAAAAAATGAATTTTCTACTATGT
> ATTCAGGTTTATTTTTTAATTTAATGGCAGGGTTTCCGGTGACACTGGAGTTCAGATTTTAACTCCTTGCTT
> CTAAAAAAAAAAAAAAAAA</Seq-data>
> </Sequence>
> </Sequences>
> </Definitions>
> <Display>
> <Styles>
> <Style type="text/css">
> Interval-widget { display : &quot;1&quot;; }
> Feature { display-auto : &quot;1&quot;; }</Style>
> </Styles>
> <Page>
> <Screen height="5.5" width="7.8">
> <!-- Must close <Screen>
>  explicitly -->
> </Screen>
> <View seqref="SEQ13488612" title="AY028920" title1="{NAME}"
> title2="{LENGTH} {UNIT}">
> <View-line-widget hcenter="4.6" linear-length="5.8" shape="horizontal"/>
> <View-axis-widget/>
> </View>
> </Page>
> </Display>
> </Bsml>
>
>
> Kevin Clancy, PhD
> Senior Bioinformatic Scientist
> InforMax, Inc.,
> 433 Park Point Drive,
> Suite 275,
> Golden, CO 80401
> Direct phone line: (720) 746 3707
> Cell Phone: (240) 417 8604
> Direct email: kclancy at informaxinc.com
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at bioperl.org
> http://bioperl.org/mailman/listinfo/bioperl-l

-- 
========================================================================
Lincoln D. Stein                           Cold Spring Harbor Laboratory
lstein at cshl.org			                  Cold Spring Harbor, NY
========================================================================




More information about the Bioperl-l mailing list