[Bioperl-l] error with files containing multiple BSML formatted entries

Kevin Clancy kclancy at informaxinc.com
Fri Mar 14 12:25:04 EST 2003


Hi Folks
I have written a program very similar to the universal reformatter
program in the tutorial FAQ on the site. When I run a file of bsml
formatted sequencs through the converter I get the following error:
Junk after document element at line 81, column 0, byte 4550 at
/usr/lib/perl5/site_perl/5.6.0/i586-linux/XML/Parser.pm line 185

Line 81 in the file of bsml formatted sequences corresponds to the
beginning of a new bsml document and looks like this:
<?xml version="1.0"?>

I've included the program I'm working with for reference. Simply take a
file of GenBank sequences and run them against the program - it will
produce a file of bsml formatted sequences. Then take the file of bsml
formatted sequences and run them against the program. I've also included
some of the bsml formatted file to run against the program. 

My question is is this a bug or should I process the bsml file into
separate entries before submitting to the sequence reformater segment of
the program?

Thanks in advance for the help.

kevin

#!/usr/bin/perl -w
########################################################################
###############
# DataConverter.pl
#
#Program to convert genbank, swissprot or fasta files to bsml or bsml
files to genbank
#run program, input path and file name (/home/kclancy/frog.nt)
#program checks for input data type and whether file is prematurely
terminated
#if all is well, it converts the data as described above.
#
########################################################################
################
use strict;
use Bio::SeqIO;

my $fh;
my $offset;
my ($fileType, $fileEnd);
my ($returnType, $outFile);
my $library = <STDIN>;
chomp $library;
my $libraryOut = "/home/kclancy/seqout";
my (@extensions) = qw(bsml fsa gb swp);

$fh = open_file($library);

$fileType = determine_datatype($fh);

unless ($fileType =~ /genbank/ || $fileType =~ /bsml/ || $fileType =~
/swissprot/ || $fileType =~ /fasta/) {
    print STDERR "Unrecognised input file type.\n";
    exit;
}

$fileEnd = end_check($fileType, $fh);
if ($fileEnd == 1) {
    print STDERR "Premature end of file.\n";
    exit;
}

if ($fileType =~ /bsml/) {
    $returnType = "genbank";
    $outFile = "$libraryOut.$extensions[2]";
} else {
    $returnType = "bsml";
    $outFile = "$libraryOut.$extensions[0]";
}

my $in = Bio::SeqIO->new('-file' => "$library",
		      '-format' => "$fileType" );
print "$outFile\n$returnType\n";
my $out = Bio::SeqIO->new('-file' => ">$outFile",
		      '-format' => "$returnType");

while (my ($seq) = $in->next_seq()) {
    $out->write_seq($seq);
}

exit;

########################################################################
#######
#subroutines
########################################################################
#######

#open_file
#
#given a file name, open a file handle

sub open_file {
    my($fileHandle) = @_;
    my ($fh);

    unless(open($fh, $fileHandle)) {
	print "Cannot open file $fileHandle: $!\n";
	exit;
    }
    return $fh;
}

#determine_datatype
#
#given a file handle, return the type of the data file

sub determine_datatype {
    my ($fh) = @_;
    my ($type) = "";
    
    my ($file) = <$fh>;
    my (@lines) = split /\n/, $file;

    foreach my $line (@lines) {
	if ($line =~ /^LOCUS/) {
	    $type = "genbank";
	    return $type;
	} elsif ($line =~ /^\>/) {
	    $type = "fasta";
	    return $type;
	} elsif ($line =~ /^ID/) {
	    $type = "swissprot";
	    return $type;
	} elsif ($line =~ /^\n$/) {
	    next;
	} elsif ($line =~ /\</) {
	    $type = "bsml";
	    return $type;
	}
    }
}
    
#end_check
#
#check an input file and file type to see if last line ends correctly
#return value
sub end_check {
    my ($type,$fh) = @_;
    my ($lastLine);    
    my ($file) = <$fh>;

    my (@lines) = split /\n/, $file;
    $lastLine = pop(@lines);
    do {
	$lastLine = "";
	$lastLine = pop(@lines);
	} while ($lastLine =~ m/^\n$/);
    
    if ($lastLine =~ m/^\<\/Bsml\>/) {
        my $end = 0;
        return $end;
    } elsif ($lastLine =~ m/^\/\/\n/) {
        my $end = 0;
        return $end;
    } elsif ($lastLine !~ m/^\>/) {
        my $end = 0;
        return $end;
    } else {
        my $end = 1;
        return $end;
    }
}

########################################################################
#######################
<?xml version="1.0"?>

<!DOCTYPE Bsml SYSTEM "http://www.labbook.com/dtd/bsml2_2.dtd">

<Bsml>
<Definitions>
<Sequences>
<Sequence length="1373" title="AF374473" id="SEQ14150744"
ic-acckey="AF374473" representation="raw">
<Attribute content="This file generated to BSML 2.2 standards - joins
will be collapsed to a single feature enclosing all members of the join"
name="comment"/>
<Attribute content="Xenopus laevis LIM-only protein LMO-2 mRNA, complete
cds." name="description"/>
<Attribute content="1" name="version"/>
<Attribute content="VRT" name="division"/>
<Attribute content="14150744" name="primary_id"/>
<Attribute content="African clawed frog" name="common_name"/>
<Attribute content="Xenopus" name="genus"/>
<Attribute content="laevis" name="species"/>
<Attribute content="laevis Xenopus Xenopodinae Pipidae Pipoidea
Mesobatrachia Anura Batrachia Amphibia Euteleostomi Vertebrata Craniata
Chordata Metazoa Eukaryota" name="classification"/>
<Feature-tables>
<Feature-table title="Sequence References">
<Reference>
<Attribute content="1373" name="end"/>
<Attribute content="1" name="start"/>
<RefAuthors>
Mead,P.E., Deconinck,A.E., Huber,T.L., Orkin,S.H. and
Zon,L.I.</RefAuthors>
<RefTitle>
Primitive hematopoiesis in the Xenopus embryo: the synergistic role of
LMO-2, SCL and GATA-binding proteins</RefTitle>
<RefJournal>
Development (2001) In press</RefJournal>
</Reference>
<Reference>
<Attribute content="1373" name="end"/>
<Attribute content="1" name="start"/>
<RefAuthors>
Mead,P.E., Deconinck,A.E., Huber,T.L., Orkin,S.H. and
Zon,L.I.</RefAuthors>
<RefTitle>
Direct Submission</RefTitle>
<RefJournal>
Submitted (26-APR-2001) Pathology, St. Jude Children's Research
Hospital, 332 North Lauderdale, Room D4047C, Memphis, TN 38103,
USA</RefJournal>
</Reference>
</Feature-table>
<Feature-table title="Features">
<Feature class="source" id="FEAT-io0"
value-type="EMBL/GenBank/SwissProt">
<Interval-loc endpos="1373" startpos="1"/>
<Qualifier value="Xenopus laevis" value-type="organism"/>
<Qualifier value="taxon:8355" value-type="db_xref"/>
</Feature>
<Feature class="cds" id="FEAT-io1" value-type="EMBL/GenBank/SwissProt">
<Interval-loc endpos="726" startpos="250"/>
<Qualifier value="LIM-only protein LMO-2" value-type="product"/>
<Qualifier value="AAK54614.1" value-type="protein_id"/>
<Qualifier value="1" value-type="codon_start"/>
<Qualifier value="GI:14150745" value-type="db_xref"/>
<Qualifier
value="MSSAIERKSLDPADEPVDEVLQIPPSLLTCGGCQQSIGDRYFLKAIDQYWHEDCLSCDLCGCRLG
EVGRRLYYKLGRKLCRRDYLRLFGQDGLCASCDNRIRAYEMTMRVKDKVYHLECFKCAACQKHFCVGDRYLL
INSDIVCEQDIYEWTKLSEMM" value-type="translation"/>
</Feature>
</Feature-table>
</Feature-tables>
<Seq-data>
GAATTCGGCACGAGCTGCAGCAGGAGAGCAGCTCTCATACACACATACAACTGGCTGGGGAATTACACAGGA
ATGGAGAAGAGACAGCAGGGGGCTCTCGCCTGTTCCTCAGGGATAACAGCCTGTGAGATTTGGGGCTACAGG
GAGCTGAGCTTGTGACTGGCATATGCAAAGGAGAGGTCCAGCGGAGCTAATCCATTTGGCAGAAGAATCAGG
GGACAATAGGAGGAGGCGCACCAGATTCTGCAAATGTCATCAGCTATAGAGAGAAAGAGCCTGGACCCAGCA
GATGAGCCGGTGGATGAAGTTCTGCAGATCCCCCCCTCACTGCTGACATGTGGGGGCTGCCAGCAAAGCATC
GGGGATCGCTATTTCTTGAAGGCCATCGATCAGTACTGGCATGAAGACTGCCTGAGCTGTGACCTGTGTGGG
TGCCGGCTGGGGGAAGTCGGAAGGAGACTTTACTATAAACTGGGGCGCAAATTGTGCCGAAGGGACTACCTC
AGGCTGTTCGGCCAGGACGGACTGTGCGCCTCTTGCGATAATCGTATCCGAGCCTACGAGATGACCATGAGG
GTGAAGGACAAAGTGTACCACCTGGAGTGCTTCAAGTGTGCCGCCTGCCAGAAGCACTTCTGCGTGGGTGAC
CGCTACCTGCTCATTAACTCGGACATTGTGTGTGAGCAGGACATCTACGAATGGACCAAGCTCAGTGAGATG
ATGTAGCACTGGCAGAATTTCGTGGAGTGCCCAACGATGGACATCAGCCACACGTTCACTAAACGCACTGTA
ACTGACAACACAGCTACGACTGACAAACCAACATCTTTCCATTGGAACAACAGGTTTAGGTCTAGAGACTAT
TAGCACTATATGGTCTTATGCACTGAAATAACACTCACACGTACAATGTAAATCCAGAAGGCCATTCTGGTG
ATCCTACTGTCATTATTGGTGCTTTTGTCTTTATTGAAGCAACACATGGATTATTGCGACATCATAGTGGTG
AAGCTTTAAAAGAGTAAGATCCCATAGAACCCCTTAATTGGAGAGAGTGGTTTTGGGGGTTCTGCAGTAGAA
AGGGACACCTCAATGGTACAACCTGGACACCACCTAATTCTATGGACAACACTGTATAAACTCATTTATATC
ATGGAACACTCGTCCTGAAAAATCTATGCAAAATGTAAAGTTATCAGGGAAGTGTGGTTTTTCTTACAGTAT
ATTAAGTAACTGGCAAAAAGGGTCAATCTTGCCCTGATCGACTATTGAACTTGACTTGGCACTGAGTGGTAA
CAGACGATTAGTTTGGGAAAGGGACAAGGGAATAAAGATGTTTTTTTTTGTGGTTAAAAAAAAAAAAAAAAA
AAAAA</Seq-data>
</Sequence>
</Sequences>
</Definitions>
<Display>
<Styles>
<Style type="text/css">
Interval-widget { display : &quot;1&quot;; }
Feature { display-auto : &quot;1&quot;; }</Style>
</Styles>
<Page>
<Screen height="5.5" width="7.8">
<!-- Must close <Screen>
 explicitly -->
</Screen>
<View seqref="SEQ14150744" title="AF374473" title1="{NAME}"
title2="{LENGTH} {UNIT}">
<View-line-widget hcenter="4.6" linear-length="5.8" shape="horizontal"/>
<View-axis-widget/>
</View>
</Page>
</Display>
</Bsml>

<?xml version="1.0"?>

<!DOCTYPE Bsml SYSTEM "http://www.labbook.com/dtd/bsml2_2.dtd">

<Bsml>
<Definitions>
<Sequences>
<Sequence length="1891" title="AY028920" id="SEQ13488612"
ic-acckey="AY028920" representation="raw">
<Attribute content="This file generated to BSML 2.2 standards - joins
will be collapsed to a single feature enclosing all members of the join"
name="comment"/>
<Attribute content="Xenopus laevis proline-rich Vg1 mRNA-binding protein
mRNA, complete cds." name="description"/>
<Attribute content="1" name="version"/>
<Attribute content="VRT" name="division"/>
<Attribute content="13488612" name="primary_id"/>
<Attribute content="African clawed frog" name="common_name"/>
<Attribute content="Xenopus" name="genus"/>
<Attribute content="laevis" name="species"/>
<Attribute content="laevis Xenopus Xenopodinae Pipidae Pipoidea
Mesobatrachia Anura Batrachia Amphibia Euteleostomi Vertebrata Craniata
Chordata Metazoa Eukaryota" name="classification"/>
<Feature-tables>
<Feature-table title="Sequence References">
<Reference dbxref="21231177">
<Attribute content="1891" name="end"/>
<Attribute content="11331596" name="pubmed"/>
<Attribute content="1" name="start"/>
<RefAuthors>
Zhao,Wm, Jiang,C., Kroll,T.T. and Huber,P.W.</RefAuthors>
<RefTitle>
A proline-rich protein binds to the localization element of Xenopus Vg1
mRNA and to ligands involved in actin polymerization</RefTitle>
<RefJournal>
EMBO J. 20 (9), 2315-2325 (2001)</RefJournal>
</Reference>
<Reference>
<Attribute content="1891" name="end"/>
<Attribute content="1" name="start"/>
<RefAuthors>
Jiang,C., Zhao,W. and Huber,P.W.</RefAuthors>
<RefTitle>
Direct Submission</RefTitle>
<RefJournal>
Submitted (20-MAR-2001) Chemistry and Biochemistry, University of Notre
Dame, Notre Dame, IN 46556, USA</RefJournal>
</Reference>
</Feature-table>
<Feature-table title="Features">
<Feature class="source" id="FEAT-io2"
value-type="EMBL/GenBank/SwissProt">
<Interval-loc endpos="1891" startpos="1"/>
<Qualifier value="Xenopus laevis" value-type="organism"/>
<Qualifier value="taxon:8355" value-type="db_xref"/>
</Feature>
<Feature class="cds" id="FEAT-io3" value-type="EMBL/GenBank/SwissProt">
<Interval-loc endpos="1199" startpos="117"/>
<Qualifier value="proline-rich Vg1 mRNA-binding protein"
value-type="product"/>
<Qualifier value="AAK26172.1" value-type="protein_id"/>
<Qualifier value="1" value-type="codon_start"/>
<Qualifier value="Prrp" value-type="note"/>
<Qualifier value="GI:13488613" value-type="db_xref"/>
<Qualifier
value="MNNQGGDEIGKLFVGGLDWSTTQETLRSYFSQYGEVVDCVIMKDKTTNQSRGFGFVKFKDPNCVG
TVLASRPHTLDGRNIDPKPCTPRGMQPERSRPREGWQQKEPRTENSRSNKIFVGGIPHNCGETELKEYFNRF
GVVTEVVMIYDAEKQRPRGFGFITFEDEQSVDQAVNMHFHDIMGKKVEVKRAEPRDSKSQTPGPPGSNQWGS
RAMQSTANGWTGQPPQTWQGYSPQGMWMPTGQTIGGYGQPAGRGGPPPPPSFAPFLVSTTPGPFPPPQGFPP
GYATPPPFGYGYGPPPPPPDQFVSSGVPPPPGTPGAAPLAFPPPPGQSAQDLSKPPSGQQDFPFSQFGNACF
VKLSEWI" value-type="translation"/>
</Feature>
</Feature-table>
</Feature-tables>
<Seq-data>
GTACGTGATGACGTTCCGTTCCACCCCCTGTATGGAAACCCCGTAGTCTAGCGCCGTCTTACCGTTGGCTGG
CTTGGTAGGAGAAGCTGCAGAGACCGGAGAGGGGTGACGGAGTTATGAACAACCAAGGCGGGGACGAGATCG
GAAAGCTCTTTGTTGGTGGCCTTGACTGGAGCACAACGCAGGAAACCCTGCGCAGCTATTTTTCTCAGTATG
GAGAAGTCGTAGACTGTGTAATAATGAAAGATAAAACAACAAATCAGTCAAGAGGCTTTGGCTTTGTCAAAT
TTAAGGATCCCAATTGTGTAGGAACAGTCTTAGCCAGCAGACCACACACACTGGATGGCCGGAATATTGATC
CAAAGCCATGTACCCCTCGAGGAATGCAGCCTGAAAGAAGCCGGCCACGAGAAGGCTGGCAGCAAAAAGAAC
CCAGAACTGAAAACAGTAGGTCCAACAAAATTTTTGTGGGAGGAATTCCACACAACTGTGGAGAAACTGAAC
TGAAGGAATATTTTAATAGATTTGGAGTGGTAACTGAGGTGGTTATGATATATGATGCCGAAAAACAGAGGC
CAAGAGGTTTTGGATTTATAACTTTCGAGGACGAACAATCAGTGGACCAGGCTGTCAACATGCATTTTCACG
ACATCATGGGCAAAAAAGTTGAAGTCAAACGGGCAGAACCACGTGATAGCAAAAGCCAAACTCCAGGACCTC
CTGGATCAAACCAATGGGGAAGCCGAGCAATGCAAAGCACAGCTAATGGATGGACAGGGCAACCTCCTCAAA
CATGGCAGGGCTACAGTCCACAAGGTATGTGGATGCCAACAGGACAGACGATTGGCGGGTATGGGCAACCTG
CAGGCCGGGGGGGTCCTCCTCCACCACCTTCCTTTGCGCCATTCCTAGTGTCAACAACCCCTGGACCTTTCC
CACCACCGCAGGGCTTTCCTCCTGGATATGCCACACCGCCTCCCTTTGGCTATGGCTATGGGCCACCCCCTC
CCCCTCCTGATCAGTTCGTCTCTTCAGGAGTTCCACCACCTCCTGGTACACCAGGAGCAGCACCGTTAGCGT
TCCCTCCTCCCCCAGGGCAGTCAGCTCAGGATCTGAGCAAACCCCCCAGTGGTCAGCAGGATTTTCCTTTTA
GCCAGTTTGGAAATGCCTGCTTTGTGAAATTGTCCGAGTGGATTTGACCTCTCGAAGTCCATCAAAGGTTAT
GGACAAGACATGAGTGGGTTTGGGCAAAGTTTCCCGGACCTCAACCAGCAGCCCCCTTACACGACAGGACCC
TCCACACCAGCATCGGGAGGCCCAGCCGCCGTGGGAAGTGGTTTGGGCAGAGGACAGAATCACAATGTACAA
GGATTCCATCCCTACAGGCGTTAATATGAGGCAGGCAGAATTGTGCAAAGTGGCGATGAGCTGATGCTTCCA
CCTCTGAACTTGTGACAATCACGTGGTGAAGACACATCCTGCTTATAGACTTATAGTTCTATGTTTGAAGGA
GAAGCTGTGGGTATTTGAACTACAGTTTTCAGATCTTTCTCTGACCCATCAGCACAAATAAAGCCAATGAGT
CACTGGTTCCAAACAGGGTTTGAAAACATCTGCAGCTTTAATGGAACTCTTCAGGTTTAATTTGGGGTTTGT
TTTTGTTCTTTTTATTTAGTTTTTTGTTTTGGGAGGGATATTTCTGAGCCTTTGTTTTACCATATAGTAAAC
TTTTATGTTTAAAGATGAAAATATATACATTTACAGATTGTGAATTTTTAAAAAATGAATTTTCTACTATGT
ATTCAGGTTTATTTTTTAATTTAATGGCAGGGTTTCCGGTGACACTGGAGTTCAGATTTTAACTCCTTGCTT
CTAAAAAAAAAAAAAAAAA</Seq-data>
</Sequence>
</Sequences>
</Definitions>
<Display>
<Styles>
<Style type="text/css">
Interval-widget { display : &quot;1&quot;; }
Feature { display-auto : &quot;1&quot;; }</Style>
</Styles>
<Page>
<Screen height="5.5" width="7.8">
<!-- Must close <Screen>
 explicitly -->
</Screen>
<View seqref="SEQ13488612" title="AY028920" title1="{NAME}"
title2="{LENGTH} {UNIT}">
<View-line-widget hcenter="4.6" linear-length="5.8" shape="horizontal"/>
<View-axis-widget/>
</View>
</Page>
</Display>
</Bsml>


Kevin Clancy, PhD
Senior Bioinformatic Scientist
InforMax, Inc.,
433 Park Point Drive,
Suite 275,
Golden, CO 80401
Direct phone line: (720) 746 3707
Cell Phone: (240) 417 8604
Direct email: kclancy at informaxinc.com 



More information about the Bioperl-l mailing list