Saturday, August 27, 2011

Visit to Russia: RuSSIR/EDBT Summer School

Although I constantly microblogged on Twitter during my trip to Russia but nothing replaces a detailed blog post when it comes to coverage. I definitely wish to have an archive of details for myself and Information Retrieval (with of course other related areas) students around the world. I along with my husband and colleague Muhammad Atif Qureshi visited St. Petersburg, Russia from 14th August, 2011 to 20th August, 2011 for attending the prestigious Russian Summer School in Information Retrieval (RuSSIR) which was co-located with Russian Young Scientists' Conference where we presented our research work. This year's RuSSIR was quite special as the EDBT summer school was also co-located with it and as such the breadth and depth of the lectures presented at the school was immense. Here is a brief overview of the lecture sessions that I attended along with a good news for students in Karachi, Pakistan.

SocM Session: The Social Mining session was conducted by two well-known industry people namely Vladimir Gorovoy of Yandex and Yana Volkovich of Barcelona Media Innovation Center. It was highly interactive and practical with a practical recommendation task for students for which they were provided with a real dataset from Yandex Market. Here is a link for students who wish to try it out: Yandex Market practical task from RuSSIR. The session fundamentally covered various aspects of mining social media data, it began with a very correct observation borrowed from Google's analytics evangelist Avinash Kaushik that "Social media is the hot thing today, almost every one seems excited to get involved in it but no one actually knows how." This session covered that how with a glimpse into graph mining methods (PageRank, TunkRank and TwitterRank being some examples), models for opinion mining of reviews left by customers, social media engagment metrics and social innovation platforms for the future. In short, it was an extremely engaging and knowledge-enriched session particularly helpful for social media analytics students: I learned a lot during the course of this session and am particularly thankful to Dr. Yana Volkovich for some of her wonderful suggestions that will really help me in my own research.

Plenary Session (Knowledge Harvesting from Web Sources): I found this session very informative and full of pointers for new research ideas although it was a bit away from my own research area. Gerhard Weikum (Research Director at Max-Planck Institute for Informatics (MPII) in Saarbruecken, Germany) presented a comprehensive overview of research methodologies that can turn the Web into a large-scale Knowledge Base and few examples of such Knowledge Bases include DBpedia, KnowItAll, ReadTheWeb, and YAGO-NAGA, as well as industrial ones such as Freebase and Trueknowledge. The tutorial presented research methodologies along the avenue of knowledge harvesting with some examples of work on unification of WordNet and Wikipedia in YAGO, identification of a long tail of instances of entity classes through harvesting textual snippets on the Web and entity search through language model ranking. Overall the session was intense and the slides quite heavy with lots and lots of natural language processing material but definitely a great learning activity from the point of view of tools to use for your own research.

SentA Session: This session was one of the most exciting ones for me as my own research centered around Sentiment Analysis. Professor Mike Thelwall who heads the Statistical Cybermetrics Research Group at the University of Wolverhampton delivered the talks in this session and it mostly centered around the Sentiment Strength detection tool of his research group namely SentiStrength. We were also taken through a live demonstration of the tool after which Professor Mike Thelwall explained in detail its various features along with the underlying algorithms and its experimental evaluations. The SentiStrength team has done a pretty good job at managing this tool and the best things about it is that the word list marked with a word's positive/negative strength is publicly available for research purposes. During this session students were also introduced to machine learning methods of Sentiment Analysis with detailed explanation on feature selection, gold standard creation and 10-fold cross validation. To sum up, this session was extremely useful for students wishing to make a career in Sentiment Analysis and I specially thank Professor Mike for his valuable suggestions on various aspects of the field.

ColIR Session: This was a short session conducted by Chirag Shah of Rutgers University. It touched completely new dimensions within the field of Information Retrieval namely Information Retrieval facilitated through collaboration. According to Professor Chirag Shah with the emergence of collaborative Web platforms, information retrieval has also moved towards a completely new dimension. The traditional view of IR is that it is an individual activity: the Collaborative IR community challenges this notion by describing it as a co-ordinated activity and they have also proved their ideas in both theory and practice. This session covered both the theory and practice behind collaborative IR situations, systems, and evaluation techniques.

TopK Session: This session presented by the two charming ladies Sihem Amer-Yahia and Julia Stoyanovich was simply fantastic. We were introduced to a whole new approach of solving some of the toughest problems in social media and this approach comes from the old, classical database field. The session mainly centered around Top K processing, one of the well-known methods for ranked retrieval within the DB-IR research community, which was presented in a unique manner with a special focus on applying it to search and information discovery on the Social Web. Such applications were discussed from two significant viewpoints: 1) efficiency (minimizing both space and time requirements) and 2) user satisfaction. Both the researchers presented a comprehensive overview of their papers published in top Database and Information Retrieval conferences: VLDB, ICWSM, SIGMOD and ACM HT. Their research within the efficiency dimension was based on incorporation of upper bounds on classical top-k algorithms (threshold algorithm and no-random access algorithm) in order to minimize time and space complexity. Their research within the user satisfaction dimension presented the fundamental idea of scaling up user studies to thousands of users through leverage of crowd-sourcing platforms such as Amazon Mechanical Turk.Currently I am reading these papers to look for dimensions that can be applied to my own research in Social Media Analytics.

Here is an archive of tweets during my attendance at RuSSIR:

#RuSSIR sessions kick off with interesting presentation on Social Media Mining by @yvolkovich and Vladimir Gorovoy

Not many people know abt. a social network exclusively devoted to travel and hospitality: CouchSurfing

Can an online social network build enough trust to allow strangers to sleep on each others’ couches: Adamic's paper

"The Web today is the largest knowledge encyclopaedia - we need it to turn it into a comprehensive Database" - Gerhard Weikum at #RuSSIR

In a very interesting talk by Mike Thelwall explaining the working of the famous sentiment analysis tool SentiStrength #RuSSIR

Automatic sentiment analysis has more or less the same accuracy as human sentiment analysis due to complexity of problem - Mike Thelwall

A look into inside of Yandex Market by @vgorovoy in session of Social Media Mining

Interesting talks in TopK session at #RuSSIR: essentially about converting social media research problems to traditional database problems

Researcher from Barcelona Media Innovation Center explains the science of social media mining #RuSSIR

Mention of work of KAIST's @sbmoon in #RuSSIR in Social Media mining lecture

Andrey Plakhov explains how entity-oriented search works at Yandex: Russia's search engine that has larger market share than Google Russia

Wonder where this rule came from #RuSSIR #Yandex

Sihem Amer-Yahia of Qatar Computing Research Institute continues day 3 of session on TopK Processing for Social Applications

Wonderful graphic by @yvolkovich on visualization of social media conversations during Spain protests #RuSSIR

Take-home from ColIR session: Science is all about collaboration unlike the Humanities #RuSSIR
AlJazeera English tracking information of users who visit the site for improved user experience - Sihem of QCRI at #RuSSIR

SearchTogether by Microsoft Research takes user-mediated Collaborative Information Retrieval one step ahead #RuSSIR

ColIR session: reason behind failure of Google Wave was the difficulty of the system requiring a 60-minute video tutorial #RuSSIR

Take-home of TopK #RuSSIR session: Social Web is full of challenges, our online social experience will be as good as we researchers make it

A week of super-duper learning and knowledge-sharing, intense discussions and lots of research take-aways. Hats off to #RuSSIR team!!

In short Russia is a wonderful place to visit and St. Petersburg is mind-blowing. Russian people are extremely hospitable, friendly and what's best about them is their love and passion for Mathematics. All in all Russia is a great place to visit if you are a Computer Science researcher as it is full of wonderful Computer Scientists both established researchers and young science-aspiring students.

At the end I am glad to announce that Web Science group at Institute of Business of Administration will conduct an open seminar which will educate Pakistani students in some of the above-mentioned topics. Feel free to contact me in case of any suggestions for the seminar, or any topic you wish to include.

Thursday, July 28, 2011

Thoughts on Computer Science's 'Sputnik Moment'

I have not had the chance to blog of late. The past few months have been extra-ordinarily busy with lots of research ideas in the pipeline and I along with my colleague and husband am also into teaching now with the newly introduced "Introduction to Web Science and Technology" course at the Faculty of Computer Science, Institute of Business Administration. It's been a great experience working in Pakistan trying to evolve Computer Science research culture here at par with international standards: it is a tough but all the same a fascinating journey.

Today I am writing on the request of a student who asked for my thoughts on the debate being conducted in New York Times on the topic of "Computer Science's Sputnik Moment", it all began when I shared one aspect of this debate on my Facebook wall. I shared the viewpoint of Dr. Ed Lazowska (University of Washington) who believes Computer Science to be central to our future. What particularly appealed me was his statement below:

For students who want to change the world, there is no field with greater impact or leverage than computer science. Just take a look at the 2010 report by the President's Council of Advisers on Science and Technology, which characterized computer science as “arguably unique among all fields of science and engineering in the breadth of its impact.

I received a private message from a student who had a disagreement with this view point and he shared Vivek Wadhwa's arguments on the same debate. The student who happens to be an alumnus of FAST-NUCES wanted to know my viewpoint on the famous "tech bubble." The premise behind his argument was that of today's students flocking to Computer Science due to their passion to become the next Zuckerberg, and the driving factor behind a rise in Computer Science grads is gimmicky social media applications which in spite of being a major innovation is a bubble. The premise is no doubt strong; however, the point being missed here is the difference between a scientist's approach and a technologist's approach. Wadhwa lacks the insight necessary to grasp the point being made by Dr. Lazowska which is that Computer Science as a whole new science has the potential to impact almost all other fields of science: it is indispensable for society today.

Wadhwa is an entrepreneur turned academic and this in my opinion may be one of the reasons he fails to grab the essence of Computer Science as a whole. True that a large chunk of today's students run after the sparkling thing called social media but it often happens that their perception of Computer Science changes once they explore the theoretical marvels of this field. A glaring example of this is the Web Science course I am conducting at the Institute of Business Administration - initially students did not understand what the course was about and what they will be learning in it for sadly Web to them means ASP, PHP, HTML and nothing beyond that. Once we began teaching the Web from a scientific perspective students were simply amazed; we are at the point where they think beyond SEO and are well aware of the science behind search engines. The point to be illustrated is that students may not see the real depth in science first but it is not just their fault: those responsible for Science curricula should be doing things the right way and this will definitely create a difference.

Secondly, the point is not about lasting careers or high-paying jobs: it's about making a difference to the world through Computer Science. The point is about pursuing Computer Science because your country needs you and not because you need a mere job! That's what's meant by a "Sputnik moment." Look at the reports that Lazowska links to -- Computer Science is a key to the future due to its vast potential to deliver in areas that matter to our countries such as the health sector, the energy sector, the military surveillance sector and many such others. I can go on and on but what really is disturbing is the naiive approach of our students who have limited life goals and no vision on a broader scale.

Furthermore the examples that Dr. Lazowska quotes are of Noam Chomsky, Watson and Crick. Obviously, these people were not new kids on the block aiming to become the next Zuckerberg, and were not simply running after some social media setup. They were scientists with a vision: a vision to further knowledge so that it serves as a foundation for generations to come. Many of the Google tools you play with and spread on your social networks would not have even existed without Computer Scientists like Dennis Ritchie, Ken Thompson, and Brian Kernighan.

I would love to hear thoughts on this particularly from the Pakistani Computer Science circles be it students or researchers. It is hard to get people in Pakistan engaged in a knowledgeable debate and this is true even for people who have done their PhD's or PostDocs, but it's always worth a try. So feel free to add your viewpoint in the comments section.

Tuesday, May 3, 2011

Interacting with Pakistani Students: Some Tips for Taking Up a Research Career

Almost every week or two I receive emails from students around the world requiring help in their research work and tips on getting into a research career. However, there is a marked difference in the emails that I receive from Pakistani students and the ones that I receive from students in other parts of the world. European students in particular are normally requesting for my Master's thesis or papers and are at times asking questions about the techniques we use in our papers. Similarly students from Korea, China, Hong Kong, Taiwan, Egypt, and Malaysia ask brilliant questions with respect to research and are more focused towards a specific topic, in fact they even suggest some novel aspects into already existing work including pointers for some useful technique we can incorporate in our work. In short they have already identified a research path for themselves and work towards that research path with their questions aimed at getting guidelines towards their chosen topic. On the other hand, most of the students from Pakistan have this single question: please suggest me some research topic or research idea?

Today I feel the urge to write to specially address this question by Pakistani students as I feel this issue has to be taken up carefully. My advice for such students is very simple: no one can tell you a research topic of your interest. I am sure Pakistani CS students would find this answer slightly confusing so I will elaborate further. Just like nobody can tell you what is your favorite food, similarly no one can tell you what area of research you should pick for that is completely dependent upon your likes and dislikes. The fundamental problem with such a question asked by Pakistani students is that they do not even narrow down the research area/domain within which they want to work and rather put the question up at others that please suggest a research topic for me, it would be understandable if the students at least narrow down research area in which they wish to work. Dear students, please remember one thing: if you would be told your research area by someone else although you may be able to finish up the task at hand but you will never be able to realize the passion that is needed in research, you will never enjoy your research and research without enjoyment can never attain fruitful results.

"If you fancy a career as a researcher, you'll spend tens of thousands of hours on work over the next 10 years. The only way you're ever gonna spend 10,000 hours on research is only when you truly deeply love it. If something really engages you and makes you happy, then you will put in the kind of energy and time necessary to become an expert at it." - Click for Source

This is not to blame or strongly condemn the students. In fact my point is to convey what mistake our students do and I do not blame them for this state of affairs. In a country where education is more of a corporate business, and where in particular Computer Science education is hijacked by technologists who know nothing about science and where Professors do not know international standards of research and are not even aware of the best academic conferences of their field such a confusion among students is bound to exist. The problem is clearly lack of guidance for the students and not many people wish to do anything about it, in fact there are some "technology experts" who are even cashing on this "lack of guidance" for their own fame and publicity. In fact the state is so pathetic that our students do not even know what a research paper is let alone reading one, and hence they fail to grab the whole point of scientific research. When a student has no idea where to begin how can he/she get any idea about a research topic.

Here I list down some tips on basis of my research experience, these are specifically for such students who wish to do research but have no clear idea of how to carry on.

1. Narrow down your research area: if you do not know which research areas exist within the broad field of Computer Science then no worries: simply visit the web site of Computer Science departments of famous research universities such as MIT, Stanford, Berkeley, CMU, Cambridge, Oxford, CUHK, ANU, KAIST etc. and browse to their research sections where you will find many research areas listed. Do not just get fascinated by the name of a particular research area, read more about it and then make your decision on whether the area interests you or not.

2a. After step 1 i.e. identification of your research area find out the conferences/journals that are well-known for that particular area. This task will also not be hard, use DBLP for that purpose which is a Computer Science bibliography web site listing all reputed conferences and journals: the name of the conference/journal will pretty much tell you whether it's for the field you have identified or not.
2b. In addition to step 2a one more step is to google out names of famous research groups working in your identified research areas, for instance if the field you have narrowed down is Social Computing then simply search for "Research Groups Social Computing" and then browse the works of the well-known groups of that domain.

3. After listing down conferences and journals within the research area of your interest, read the most popular and latest papers of those conferences. For example any one interested in distributed systems would immediately discover Google's MapReduce paper as the de-facto distributed computing standard and should read that. Another significant factor to look for is the citations the paper has received, read the most cited papers first to get a grip on the topic, Google Scholar will help you in finding number of citations for a paper.

After having read 20-30 papers you will definitely come up with a crude idea and refinement of that idea will of course require discussions with your advisor/seniors researchers, in fact you can even email the authors of some of the papers you read. Researchers love to share and increase knowledge for that is the whole point of research: unlike the corporate, commercial world the research world does not like to hide for it is all about knowledge-sharing and a researcher who does not share his/her knowledge is never looked upon with respect.

Another handy and useful tool that can help immensely in research is Twitter, although it's known as a social networking or micro-blogging service yet it is known as the new journal archive by many in the scientific community. Some of the groups you identify within your areas of interest would be active on Twittter and you can follow them there for updates, for their latest works, and many a times for useful reading material that can help you a lot in your research. But a note of caution: don't bother them with silly questions like please tell me a topic of research, they are quite mature researchers with top-quality students and when anyone would come up to them with such questions they will consider that student as an alien and this is where you have to be extra-cautious.

Feel free to email me with any questions, and I will be glad to help. Please remember that a research career on the surface seems to be attractive but it requires extra hard-work than you would normally have to do in the software house or technology culture of Pakistan because there are no ready-made sweets in research: crafting and scientific knowledge discovery is what you would have to master which of course requires years and years of efforts.

Sunday, March 27, 2011

Invited Talk in Research Universities of Malaysia: The Web Goes Social

As I write this blog post, I am packing stuff and getting ready to leave Malaysia where I came for invited research talks in 3 best schools, namely UTM, MMU and UPM. It was indeed a very hectic but fun-filled trip nevertheless with lots of interaction with faculty and students. Many of the faculty members at Malaysian universities gave an overwhelming response, and inquired about dimensions for research using social media. It was a great learning experience for me to answer their questions, and also to share mutual research discussions that can prove fruitful for joint research collaborations in future.

At UTM I presented my work on Web crawlers done in the Database and Multimedia lab of KAIST as part of my Master's thesis, while at the other two universities I presented my work on Social Web Mining which has been done in collaboration with different institutes of Pakistan.

As many researchers know that social media is not just a toy for masses but also a pool of Web data where Web mining researchers can have lots and lots of data to further their research in various dimensions. The dimension I took up in the talks was related to my research area "Web Search" and it focused on what social media tools have to offer to search engines and how social media along with its vast pool of data serve as an effective enhancement tool for search engines. Not only that I talked about how my research (in particular on Twitter) can provide an effective news monitoring platform that can be useful for media outlets, journalists, political organizations and even governments. I then focused on two works by my research group in this dimension: 1) blogosphere clustering, and 2) Twitter as real-time news analysis service.

Slides are attached here, also I managed to make a video during the last talk at UPM (Universiti Putra Malaysia), and those interested can listen to the talk along with viewing the slides. As before, interested students/researchers may contact me personally if they wish to work in any of these areas as my research group is now actively seeking for students to work on research publications of this domain.

The talk is in four parts which I have included in this post:

Sunday, March 20, 2011

Good Bye Korea: Memories of Database and Multimedia Lab of KAIST

Tonight happens to be my second last night in Korea and Korea for me was for the most part limited to my lab which is the Database and Multimedia Lab of KAIST. Even tonight as I am about to leave the land of Kimchi I am working in the lab doing some final experiments as part of my Web crawling paper. I still have lots and lots of work remaining, cleaning the home, packing many small things and yet here I am in the lab doing final jobs for my paper.

It is said that the path of a graduate student and that too in a subject as innovative as Computer Science is really hectic and requires a lot of patience. This was practically experienced during my Master's degree at KAIST. Though it's been a journey of immense stress and pressure, yet it has been enjoyable and fun all along. In particular life in the Database and Multimedia lab has a culture of its own. Even in the "kaali raats" (the term used by Database and Multimedia Lab members to refer to a night which we completely spend in the lab), we had fun and I will surely miss each and every member of this family of mine. I may or may not come back to Korea, that is still undecided but with this video I wish to thank my family at KAIST. We have had some extreme tough times but we have been a family, times spent in this lab are one of the most precious memories for me where I learned a lot both from point of view of scientific research and enjoying work.

Friday, January 7, 2011

Co-relating News and Tweets: "Ins and Outs of News Twitter as a Real-Time News Analysis Service"

The Web has seen a massive transformation with its read-only nature diminishing more and more and evolving into a read-write nature. Social networks are one of the driving forces behind this transformation and hence, the Social Web can be seen as a fundamental source of more and more UGC (user-generated content).

The phenomenon of "UGC" has also had a significant impact in the domain of Web Search, which happens to be my area of research: a study conducted in 2010 puts Facebook ahead of Google in terms of Web site hits. Many of the major search engine companies such as Google, Yahoo and Bing are now looking at means to take into account the Social Web into their search results. The WWW 2010 paper titled "Anatomy of a Large-Scale Social Search Engine" describes the phenomenon in considerable detail and I recommend it as a must-read to those interested in the field. In fact the team behind this paper created a social search system Aardvark that has now been acquired by Google.

Despite the tremendous amount of importance and attention being given to the concept of social search, one significant domain within this area has not yet been explored much which this recent paper by me and my research group attempts to explore. In this paper we present a system which aims to identify and detect hot news items in real time by taking into account user popularity and temporal features. We present a prototype of the approach using the popular microblogging service "Twitter" and present the results of some initial evaluations of our approach.

The proposed system analyzes real-time news by using the data from Twitter. We give a description of news services, followed by an architecture of how one can assess news popularity. The architecture is built upon a Web crawling framework and a news parser followed by application of natural language processing techniques on the news data which is then finally linked with the Twitter Search API. At the user interface end, we use a simple timeline-based visualization to showcase the popularity of news across time. Furthermore, data from the popular news service over a period of 10 days was crawled on a daily basis and analyzed for co-relation with tweets, this analysis reveals interesting results such as the news bias exhibited by news services. Below is the paper, which can be downloaded as well.

The paper will be published in the proceedings of the workshop "Visual Interfaces to the Social and Semantic Web (VISSW 2011)" co-located with International Conference on Intelligent User Interfaces (IUI 2011) to be held at Stanford University in February, 2011. I am sharing it over my blog on request of some students who have shown a lot of interest in the field. For further details/questions/feedback a personal email to would be preferred. Also, students willing to work with our research group in this dimension may contact me in person. I will also be uploading the slides and talk for this paper soon.