[Biojava-l] GSCO 2012: New File Parsers for BioJava

Tue Mar 27 11:53:59 UTC 2012

Hello,

My name is Mihaiu Nicolae, I'm a 2nd year student in Computer Science at the Politehnica University of Bucharest and I'm very interested in working at the "New File Parsers for BioJava" project. I choose BioJava because it blends two passions of mine: coding and biology. Back in highschool biology was one of my favourite subjects, having  a very good teacher from whom I learned a lot, I finished every year with 10.  

About my knowledge and experience

- 1 year and a half experience with Java; it became my first choice in coding; currently I do all my tasks and homework in Java, also developing a bot for aichallenge [1] in Java as a university project. And a little personal project I'm working at, a memory test game, also written in Java.
- 5 years of C/C++ 
- web: HTML, PHP, CSS, MySQL - made a module for my school's website 

Some thoughts and questions about the project 

- I took a look at your sources and saw you already have parsers for a lot of files like: FASTA, FASTQ, PDB, mmcif etc. What are the priorities for the new parsers, which is needed most ? 
- Should we choose only one parser to work on for this project, or the expectations are to implement more than one ? 

Questions  about the "Coding exercise"

- About the "ambiguous characters", lets say we have ambiguous DNA. For these two sequences: "ACTATATCGG" and "ATGKMCGW" we should have in one FASTA output file the sequence  "ACTATATCGG" and in another one "ATGKMCGW" ?

- What do you mean by large, “be capable of reading large files”, because afterwards under “Submission”  it says “the test data file named data.fasta up to 10Kb in
size” ? Should I understand that 10Kb is the limit for a “large file” ?

Best regards,
Nicolae

[1] http://aichallenge.org