[Biojava-l] [Biojava-dev] Request for help!

Richard Holland holland at ebi.ac.uk
Thu Jul 5 07:40:14 UTC 2007


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

"\n" is used 262 times in 76 different locations:

src/org/biojava/bio/alignment/NeedlemanWunsch.java
src/org/biojava/bio/alignment/SequenceAlignment.java
src/org/biojava/bio/alignment/SmithWaterman.java
src/org/biojava/bio/alignment/SubstitutionMatrix.java
src/org/biojava/bio/chromatogram/graphic/ChromatogramGraphic.java
src/org/biojava/bio/dist/AbstractDistribution.java
src/org/biojava/bio/dp/onehead/SingleDP.java
src/org/biojava/bio/dp/twohead/DPInterpreter.java
src/org/biojava/bio/dp/XmlMarkovModel.java
src/org/biojava/bio/gui/sequence/ImageMap.java
src/org/biojava/bio/program/abi/ABIFParser.java
src/org/biojava/bio/program/blast2html/AbstractAlignmentStyler.java
src/org/biojava/bio/program/blast2html/HTMLRenderer.java
src/org/biojava/bio/program/das/dasalignment/Alignment.java
src/org/biojava/bio/program/das/FeatureRequestManager.java
src/org/biojava/bio/program/sax/BlastLikeAlignmentSAXParser.java
src/org/biojava/bio/program/sax/ClustalWAlignmentSAXParser.java
src/org/biojava/bio/program/sax/FastaSequenceSAXParser.java
src/org/biojava/bio/program/sax/NeedleAlignmentSAXParser.java
src/org/biojava/bio/search/KnuthMorrisPrattSearch.java
src/org/biojava/bio/seq/db/BioIndex.java
src/org/biojava/bio/seq/db/GenbankSequenceDB.java
src/org/biojava/bio/seq/db/TabIndexStore.java
src/org/biojava/bio/seq/io/agave/AGAVEBioSeqHandler.java
src/org/biojava/bio/seq/io/agave/AGAVEContigHandler.java
src/org/biojava/bio/seq/io/agave/AGAVEDbId.java
src/org/biojava/bio/seq/io/agave/AGAVEKeywordPropHandler.java
src/org/biojava/bio/seq/io/agave/AGAVEMapLocation.java
src/org/biojava/bio/seq/io/agave/AGAVEMapPosition.java
src/org/biojava/bio/seq/io/agave/AGAVEMatchRegion.java
src/org/biojava/bio/seq/io/agave/AGAVEProperty.java
src/org/biojava/bio/seq/io/agave/AGAVEQueryRegion.java
src/org/biojava/bio/seq/io/agave/AGAVERelatedAnnot.java
src/org/biojava/bio/seq/io/agave/AGAVESeqPropHandler.java
src/org/biojava/bio/seq/io/agave/AgaveWriter.java
src/org/biojava/bio/seq/io/agave/AGAVEXref.java
src/org/biojava/bio/seq/io/agave/AGAVEXrefs.java
src/org/biojava/bio/seq/io/agave/Embl2AgaveAnnotFilter.java
src/org/biojava/bio/seq/io/FastaFormat.java
src/org/biojava/bio/seq/io/GenbankFileFormer.java
src/org/biojava/bio/seq/io/ParseException.java
src/org/biojava/bio/structure/align/pairwise/AlternativeAlignment.java
src/org/biojava/bio/structure/ChainImpl.java
src/org/biojava/bio/structure/io/FileConvert.java
src/org/biojava/bio/structure/StructureImpl.java
src/org/biojava/bio/symbol/AbstractSimpleBasisSymbol.java
src/org/biojava/bio/symbol/AlphabetManager.java
src/org/biojava/bio/symbol/DoubleAlphabet.java
src/org/biojava/bio/symbol/IntegerAlphabet.java
src/org/biojava/bio/symbol/SimpleAlignment.java
src/org/biojava/stats/svm/tools/TrainRegression.java
src/org/biojava/utils/automata/DfaBuilder.java
src/org/biojava/utils/automata/FiniteAutomaton.java
src/org/biojava/utils/automata/PatternMaker.java
src/org/biojava/utils/candy/CandyEntry.java
src/org/biojava/utils/ChangeSupport.java
src/org/biojava/utils/ExecRunner.java
src/org/biojava/utils/io/CountedBufferedReader.java
src/org/biojava/utils/ParserException.java
src/org/biojava/utils/StaticMemberPlaceHolder.java
src/org/biojavax/bio/db/ncbi/GenbankRichSequenceDB.java
src/org/biojavax/bio/db/ncbi/GenpeptRichSequenceDB.java
src/org/biojavax/bio/phylo/io/nexus/CharactersBlockParser.java
src/org/biojavax/bio/phylo/io/nexus/DistancesBlockParser.java
src/org/biojavax/bio/phylo/io/nexus/NexusFileFormat.java
src/org/biojavax/bio/phylo/MultipleHitCorrection.java
src/org/biojavax/bio/seq/io/DebuggingRichSeqIOListener.java
src/org/biojavax/bio/seq/io/EMBLFormat.java
src/org/biojavax/bio/seq/io/FastaFormat.java
src/org/biojavax/bio/seq/io/GenbankFormat.java
src/org/biojavax/bio/seq/io/UniProtCommentParser.java
src/org/biojavax/bio/seq/io/UniProtFormat.java
src/org/biojavax/bio/taxa/SimpleNCBITaxonName.java
src/org/biojavax/utils/StringTools.java
src/org/biojavax/utils/XMLTools.java

Not all of these are 'bad' newlines - but still, it's a lot to search
through. I've put it on my list of to-do things for when I'm bored.

cheers,
Richard



Mark Schreiber wrote:
> Slightly related to this ...
> 
> It might be worth making a quick check of the biojava code base to see
> how often a "\n" appears in the source code.
> 
> - Mark
> 
> On 7/4/07, Richard Holland <holland at ebi.ac.uk> wrote:
> The problem was that I was using the newline in a tokenizer, which
> needed to return and regcognize the newline symbols themselves (the
> Nexus format is new-line sensitive). Hence I had to deal with files that
> may not have the system new-line operator.
> 
> cheers,
> Richard
> 
> Andy Yates wrote:
>>>> BufferedWriter will always use the value of
>>>> System.getProperty("line.separator") however BufferedReader knows that
>>>> an end of line can be \r\n, \r or \n so in Java land is perfectly legal
>>>> to have any common line terminator & still write files in an OS specific
>>>> manner.
>>>>
>>>> I sent a regex to Rich which he improved on but the net result is the
>>>> extraction of the EOL regardless of which one it is.
>>>>
>>>> I'm not 100% sure on where the problem lies. So long as the parsers use
>>>> BufferedReader for it's text file reading (which they all seem to do)
>>>> this shouldn't have been a problem. In fact this is the line from the
>>>> BufferedReader.readLine() in the JDK:
>>>>
>>>> "Read a line of text. A line is considered to be terminated by any one
>>>> of a line feed ('\n'), a carriage return ('\r'), or a carriage return
>>>> followed immediately by a linefeed."
>>>>
>>>> Very very strange but the regex sounds like it was a pragmatic solution
>>>>
>>>> Andy
>>>>
>>>> Mark Schreiber wrote:
>>>>> BufferedWriter provides a newLine() method that writes a line
>>>>> separator but I'm not sure if that gives you a different result or
>>>>> not.
>>>>>
>>>>> This may be a JVM bug that needs to be submitted to Sun.
>>>>>
>>>>> As a very ugly work around it is possible to determine the OS from the
>>>>> System object as well.
>>>>>
>>>>> - Mark
>>>>>
>>>>> On 7/4/07, Hilmar Lapp <hlapp at gmx.net> wrote:
>>>>>> In Perl it is easy enough to regex-replace s/\n\r/\n/g and s/\r//g
>>>>>> though I'm not sure this wouldn't incur too much overhead in Java.
>>>>>>
>>>>>> You can certainly detect the eol character(s) by line.indexOf('\r');
>>>>>> if found and the preceding character is '\n' you have DOS/Win-style
>>>>>> line endings, and otherwise if found it is Mac-style.
>>>>>>
>>>>>> However, this all seems like a lot of trouble to go through if all
>>>>>> that one would need to ask of people is to make sure that the file
>>>>>> matches the native eol style of the platform, which is really trivial
>>>>>> to achieve.
>>>>>>
>>>>>> For example, to convert Win-style line endings to  Unix:
>>>>>>
>>>>>>         $ perl -pi -e 's/\r//g;' <your-files-here>
>>>>>>
>>>>>> and from Mac to Unix:
>>>>>>
>>>>>>         $ perl -pi -e 's/\r/\n/g;' <your-files-here>
>>>>>>
>>>>>> I have these and other simple conversions defined as aliases in
>>>>>> my .profile, and don't really ever worry about writing lots of code
>>>>>> to accommodate arbitrary line endings :-)
>>>>>>
>>>>>> -hilmar
>>>>>>
>>>>>> On Jul 4, 2007, at 4:06 AM, Richard Holland wrote:
>>>>>>
>>>> Hi guys.
>>>>
>>>> I need help with a programming question!
>>>>
>>>> In Java, you can find out the line-end symbol that the JRE is using by
>>>> calling:
>>>>
>>>>    System.getProperty("line.separator");
>>>>
>>>> On *nix this returns "\n", for instance.
>>>>
>>>> Our file parsers all rely on this to return the symbol to break
>>>> lines at
>>>> when parsing files. This usually works fine.
>>>>
>>>> BUT... on Windows machines, for certain files, it does not appear to
>>>> work! I suspect that these text files were generated on a *nix machine
>>>> then transferred by copying files across file systems using native
>>>> copy
>>>> commands, or using binary FTP so that the system retained the *nix
>>>> line-end symbols instead of replacing them for the local line-end
>>>> symbols as it would have done if they were transferred in text mode
>>>> via
>>>> FTP.
>>>>
>>>> I don't have access to a Windows machine I can test on, but I suspect
>>>> that the fix is quite a simple one and boils down to replacing the
>>>> System() call with something more intelligent.
>>>>
>>>> Is there any regex or similar thing we can use to spot _all_ kinds of
>>>> line-end symbols in text files regardless of the platform the file was
>>>> created on or the platform the parser is being run on?
>>>>
>>>> (For information, the only two users who have reported problems like
>>>> this are both using Nexus files - I'm not sure what tool generated
>>>> them
>>>> though. The Nexus parser uses the same rules as all the other
>>>> parsers in
>>>> BioJava so I don't think there's anything specifically wrong with
>>>> it as
>>>> opposed to say the GenBank or FASTA parsers.)
>>>>
>>>> cheers,
>>>> Richard
>>>>
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>>> --
>>>>>> ===========================================================
>>>>>> : Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
>>>>>> ===========================================================
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>>>
>>>>> _______________________________________________
>>>>> biojava-dev mailing list
>>>>> biojava-dev at lists.open-bio.org
>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-dev
>>>> _______________________________________________
>>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
_______________________________________________
biojava-dev mailing list
biojava-dev at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/biojava-dev
>>
> _______________________________________________
> biojava-dev mailing list
> biojava-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-dev

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGjKBd4C5LeMEKA/QRAuARAJsGmSZpdOEuNyYDNn0Xn1rBA6KBjgCeLr8s
qkMnk1CwoMnqBT0RCwQjuSI=
=X9+G
-----END PGP SIGNATURE-----



More information about the Biojava-l mailing list