Wednesday, September 1, 2010

Coling 2010 Workshop The People’s Web Meets NLP: Collaboratively Constructed Semantic Resources

Saturday, 28th August, 2010 was a long and eventful day as I attended a workshop co-located with COLING 2010 in Beijing, China. The workshop was organized by the Ubiquitous Knowledge Processing (UKP) Lab in the Department of Computer Science at the Technische Universität Darmstadt, Germany. The main theme of the workshop centered around collaboratively constructed semantic resources and their role and influence in Natural Language Processing researches of today. Though this workshop was organized for the second time and had a small community with new faces appearing this time yet the way the discussions were carried out seems to offer much promise for this community.

The following diagram gives a pictorial representation of what the workshop was all about:

Within the natural language processing domain there used to exist what we can call a knowledge acquisition bottleneck; earlier this bottleneck was resolved through development of semantic, lexical resources by experts; WordNet being a typical, classical example. However with the emergence of Web 2.0 the scenario has completely changed and the focus of the NLP community has moved from the classical resources to collaboratively constructed semantic resources (CCSRs); Wikipedia being the most popular example which is why we can now see an increasing number of research publications directed towards such themes in many reputed conferences such as CIKM, WWW, EMNLP etc.

As the diagram above shows the workshop's focus was on researches directed towards use of CCSR's for enhancement and furthering of NLP and also the other way around using NLP to improve CCSR's

The best thing about the workshop was that it welcomed researchers from diverse domains some this time and promising more for the next time; in fact the invited speech by Professor Tat-Seng Chua of National University of Singapore was also from a diverse area namely community-based question answering. Hence the papers presented were from two categories:
  • Those using collaboratively constructed resources as sources of lexical semantic information for NLP purposes such as information retrieval, named entity recognition, or keyword extraction
  • Those using NLP techniques to improve the resources or extract and analyze different types of lexical semantic information from them.
Overall there were 8 papers presented in the workshop of which 5 were on Wikipedia, 1 on Amazon Mechanical Turk, 1 on translation resources and 1 on the blogosphere.

The paper presentations at the workshop were followed by an extremely interactive and knowledgeable discussion on the theme of the workshop - collaboratively constructed semantic resources. In my opinion everyone amongst the attendees had something to take from the discussion and below I am sharing all the questions raised during the discussion along with the thoughts of the attendees. Readers can chip in their comments/thoughts/suggestions for the mutual benefit of the whole community.

The first question centered around the scope of the workshop's name i.e. whether the scope of name "Collaboratively Constructed Semantic Resources" was too wide or too narrow. The very first suggestion from the invited speaker Professor Tat-Seng Chua was a very useful one and he suggested using the term Community Based Semantic Resources instead. Two other suggestions suggested the use of the term crowd sourcing or wisdom of the crowds as these are the more popular terms used within this area and workshops co-located with other reputed conferences use these terms. However one of the workshop chairs Dr. Torsten Zesch put forward a valid argument against these two terms: one can find it hard to agree that Wikipedia falls under a crowd sourced resource and as such this term may narrow down the scope of the workshop too much; on the other hand wisdom of the crowds is a very widespread term for Web 2.0 tasks and many other conferences use it but it is too broad a theme keeping in mind that the workshop centers around use of the resources for NLP tasks and using NLP tasks for improving collaboratively constructed semantic resources.

The next question was a very important one as it bridges NLP researches of yesterday and today: what is the relation between expert-made and collaboratively constructed resources; are they complimentary or are they different? Further explanation of this question was provided by the workshop chair Professor Dr. Iryna Gurevych: for many years the NLP community has relied on classical lexical, semantic resources and they have served us well but with Web 2.0 CCSR's are in widespread use so do we still need the classical resources, shall we spend our efforts in improving the classical resources? On this question almost all the attendees were on agreement that the quality and correctness issues in CCSR's have to be addressed for example Wikipedia has some quality and correctness issues and there is very less work on addressing these issues. When compared to classical resources CCSR's are better in terms of coverage and the classical resources are better in terms of quality so the need is to incorporate quality of classical,expert-made resources into CCSR's and for this we need to provide incentives for guiding the crowds who are generating the CCSR's: Mechanical Turk for example has a nice way of ensuring quality through monetary incentives and the research community needs to think of more ways in which to ensure that CCSR's of high-quality are produced.

The third question focused on the various types of CCSR's and their classifications: what are the most valuable collaboratively constructed semantic resources, how can we classify them? The various types of CCSR's mentioned by the attendees were Wikipedia, Twitter and social networks, forums and CQA sites, YouTube, Flickr, Wiki family e.g. Wiktionary, Wikiversity, Wikitravel etc. As for the value of any CCSR the attendees held agreement that coverage and number of people involved in the resource creation are the determining factors. As for the most valuable CCSR's Wikipedia's importance cannot be denied by any means and there is ample evidence to suggest this; for the future of CCSR's Twitter may emerge as an extremely valuable resource as it a whole wealth of knowledge waiting to be mined; moreover Twitter has managed to attract the attention of the research community in a very short span of time and this can be assessed from the amount of research publications using Twitter as a source of data; reputed publication venues such as WWW, CIKM, SIGKDD, COLING etc. now have many papers on Twitter and if the NLP community directs its efforts towards this resource properly then it can be used as a very effective and useful CCSR. As for the classification of CCSR's a well-grounded classification of CCSR's does not exist and this may also be one research problem within this area as classification is a multi-dimensional thing. One significant line of discussion between Wikipedia and Twitter is that in Wikipedia the user's intention is to make a single,useful resource whereas on Twitter people share their thoughts/message separately in a 140-character long tweet implying that the nature of each CCSR is different and it is important to take this factor into account.

The fourth question centered around the impact CCSR's are having: where do CCSR's have the largest impact, do they really make a difference? Everyone agreed that impact of CCSR's is huge as it has solved the knowledge acquisition bottleneck for researchers and data is now a free resources and since free zones empower people hence its worth the effort and exertion of research efforts towards CCSR's will be beneficial in the future as well. People start to think in new ways with new resources and it benefits the whole community: within this direction an important point was raised by one of the attendees that the commercial giants such as search engines may already be using CCSR's for their tasks and this may also be a significant business secret for them. All this implies that CCSR's have a great potential to heavily impact both research and commercial applications and the community needs to think about more and more ways for creation and improvement of such resources: an example being "games for a purpose" which gives people a leisure incentive rather than monetary one for creating the resources and Google's Image Labeler is one such application which Google uses to generate image tags and hence improvement of its image search.

The fifth and last question concerned the different research areas that have interest in CCSR's: which scientific communities have collaboratively constructed semantic resources as their distinct topic, which fields other than Computational Linguistics/Natural Language Processing/Human Language Technologies should we collaborate with regarding CCSR's? This question has a broad range of answers in my opinion; some answers discussed during the workshop had suggestions to collaborate with people from the social sciences field as "Social Science meets Computer Science" is an emerging, prominent theme recently; moreover people from the computer networking domain can also provide useful insights with respect to CCSR's and hence in the future we may see a broad range of researchers gathering to work collaboratively on collaboratively constructed semantic resources.

The workshop was a great experience for me and I look forward to attending and presenting my work in it in future as well. Readers are advised to drop in any comments/suggestions with respect to collaboratively constructed semantic resources.

No comments:

Post a Comment