[GSoC] Fwd: Standardizing Organizations Keywords

Hilmar Lapp hlapp at drycafe.net
Sun Feb 24 16:57:56 UTC 2013


Any other suggestions for adding/fixing DBpedia (= Wikipedia articles) keywords for OBF, or literal string keywords? I'd be happy to add suggestions before I issue a pull request. Alternatively, everyone is of course welcome to fork their own copy and submit a change request.

https://github.com/hlapp/dbpedia-spotlight-gsoc/compare/master...obf-fixes

	-hilmar

Begin forwarded message:

> From: "Pablo N. Mendes" <pablomendes at gmail.com>
> Subject: Re: [GSoC Mentors] Re: Standardizing Organizations Keywords
> Date: February 22, 2013 1:53:11 PM EST
> To: Joel Sherrill <joel.sherrill at gmail.com>
> Cc: Stefan Seefeld <stefan at codesourcery.com>, google-summer-of-code-mentors-list at googlegroups.com, Max Jakob <max.jakob at gmail.com>, Joachim Daiber <daiber.joachim at gmail.com>
> 
> 
> Thanks for the feedback Joel. TL;DR: If the community is interested, we can regenerate better tags and you all can contribute with the necessary fixes/additions.
> 
> As this thread may quickly become unwelcome here, let's move the discussion to our issue tracker?
> https://github.com/pablomendes/dbpedia-spotlight-gsoc/issues
> 
> Now the longer message, answering your questions:
> 
> how could a set of keywords be checked automatically and then the organization admin choose to use them or not.
> 
> Easy! That perks of being among Open Source folks, right? I pointed out in my earlier e-mail the files that our system created automatically [1][2]. All you need to do is to git clone our repo, make the fixes in any text editor and submit a pull request on GitHub. I think it was Jim from Apertium that has already done this for some orgs. We'd be glad to merge the fixes and update our engine.
> 
> But I don't know how to make your engine happy. 
> 
> I can try to help you to understand what is happening there. But first I need to find out which org is yours. After some searching, I figured it must be RTEMS? If not, just let me know. But these are the tags that we have for RTEMS for 2011:
> 
> <http://dbpedia.org/resource/Thread_%28computer_science%29> .
> <http://dbpedia.org/resource/Embedded_system> .
> <http://dbpedia.org/resource/Shapeshifter_%28band%29> .
> <http://dbpedia.org/resource/C%2B%2B> .
> <http://dbpedia.org/resource/POSIX> .
> <http://dbpedia.org/resource/Ada_%28programming_language%29> .
> <http://dbpedia.org/resource/Operating_system>
> 
> Clearly some wrong ones, but also some good ones. So searching for Operating System should definitely have retrieved something. I think that I only loaded the 2012 tags. For 2012, I have no tags for your project at all. This is probably because an error happened in our crawler when processing your page, and we missed those tags. We seem to be able to process it now, as you can see here:
> http://spotlight.dbpedia.org/rest/annotate?url=http://www.rtems.org/wiki/index.php/Open_Projects
> 
> Even better, our new code (result of Jo Daiber's work from GSoC 2012) seems to perform much better:
> http://spotlight.sztaki.hu:2222/rest/annotate?url=http://www.rtems.org/wiki/index.php/Open_Projects
> 
> We can regenerate the tags and have the community helping with some cleanup/additions for their orgs, if anybody is interested.
> 
> Cheers,
> Pablo
> 
> [1] https://raw.github.com/pablomendes/dbpedia-spotlight-gsoc/master/data/gsoc-projects-2011.nt
> [2] https://raw.github.com/pablomendes/dbpedia-spotlight-gsoc/master/data/gsoc-projects-2012.nt
> 
> 
> On Fri, Feb 22, 2013 at 5:08 PM, Joel Sherrill <joel.sherrill at gmail.com> wrote:
> This is very cool!!!  
> 
> It certainly provides more information about the tags but it still suffers from issues like
> me not putting commas between the tags which results gibberish instead of a set
> of Wikipedia pages, various forms of "big data", and strange or useless tags. [1]
> 
> But as a rule, if your tag doesn't match something related at Wikipedia, then 
> it very likely isn't a useful tag. That's a pretty good rule.
> 
> But I don't know how to make your engine happy. Using your search page, I 
> couldn't even find my own project even though we were in both projects. When I
> typed "operating system", I missed a most OS GSOC participants. And when
> I got RTOS, I only got a few closed source products.
> 
> it is too late for 2013, but how could a set of keywords be checked automatically
> and then the organization admin choose to use them or not.
> 
> --joel
> RTEMS
> 
> [1] I am personally interested in the "Awesome_God" project. :)
> 
> On Fri, Feb 22, 2013 at 2:53 AM, Pablo N. Mendes <pablomendes at gmail.com> wrote:
> 
> Well, what about just using Wikipedia pages as topics? All major languages, algorithms, techniques, technologies, etc. are there already. That is the idea of the DBpedia project: use each Wikipedia page as describing a "thing", and subsequently classify these things according to a taxonomy. I don't see why we would need to repeat that work.
> 
> Moreover, our tool (DBpedia Spotlight) has the ability to churn through your "project ideas" pages and suggest DBpedia "tags" based on the content of those pages. Since this information extraction process is automated and tackles a difficult problem, the tool makes mistakes. But, no problem, updating this list should be less work than just creating a taxonomy from scratch and then tagging each project manually.
> 
> By provocation from Jimmy O'Reagan, I sit down together with Max Jakob and hacked something together. Here are the tags that we have extracted for 2011 and 2012 projects:
> https://github.com/pablomendes/dbpedia-spotlight-gsoc/tree/master/data
> 
> We later sit down with Jo Daiber and also prettied up a little search application to help users to find projects and to demonstrate what other advantages you may gain by using DBpedia tags rather than just plain old "string" tags.
> http://spotlight.dbpedia.org/gsoc-searcher/
> 
> The function described there as "Expand" uses some knowledge that we have about the tags to try and model "relatedness", so that we can suggest extra related tags for the students to search.
> 
> Everything was hacked together quickly, so it is still very preliminary. But I think it demonstrates the core ideas, if you are open minded. :)
> 
> What do you think?
> 
> Cheers,
> Pablo
> 
> 
> On Wed, Feb 20, 2013 at 9:59 PM, Stefan Seefeld <stefan at codesourcery.com> wrote:
> On 02/20/2013 03:46 PM, Aaron Meurer wrote:
> > An ideal situation would be if there were some kind of StackOverflow
> > type system where keywords autocompleted against what has already been
> > used, and synonyms could be defined. I don't know if the Melange team
> > has the desire and/or resources to do something like this.
> 
> Indeed, a shared taxonomy would help greatly. And there are a few
> precedents used in the FLOSS world to pick from, such as:
> https://pypi.python.org/pypi?%3Aaction=list_classifiers
> 
> I still struggle understanding why the GSoC (meta-)project needed its
> own web application. It could have picked one with such features already
> in place.
> 
> But that's a separate discussion.
> 
>         Stefan
> 
> --
> Stefan Seefeld
> stefan at codesourcery.com
> 
> 
> 
> -- 
> 
> Pablo N. Mendes
> http://pablomendes.com
> 
> -- 
> You received this message because you are subscribed to the Google Groups "Google Summer of Code Mentors List" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to google-summer-of-code-mentors-list+unsubscribe at googlegroups.com.
> To post to this group, send email to google-summer-of-code-mentors-list at googlegroups.com.
> Visit this group at http://groups.google.com/group/google-summer-of-code-mentors-list?hl=en-US.
> For more options, visit https://groups.google.com/groups/opt_out.
>  
>  
> 
> 
> 
> 
> -- 
> 
> Pablo N. Mendes
> http://pablomendes.com
> 
> -- 
> You received this message because you are subscribed to the Google Groups "Google Summer of Code Mentors List" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to google-summer-of-code-mentors-list+unsubscribe at googlegroups.com.
> To post to this group, send email to google-summer-of-code-mentors-list at googlegroups.com.
> Visit this group at http://groups.google.com/group/google-summer-of-code-mentors-list?hl=en-US.
> For more options, visit https://groups.google.com/groups/opt_out.
>  
>  

-- 
===========================================================
: Hilmar Lapp -:- Durham, NC -:- hlapp at drycafe dot net :
===========================================================








More information about the GSoC mailing list