Saturday, November 10, 2012

Mining Tweets of World Cup T20 Match between India and Pakistan: Interesting Insights from Social Network Analysis

I have been quite absent from this space I call my blog for quite sometime now and this is not without reason. The past few months have been extremely busy with lots of traveling (Milan, Venice, Rome, Nijmegen and Copenhagen all in three-four months' time),  and of course the never-ending paper submissions. As I had explained in my previous post on online education initiatives I am also taking the Computer Science online courses on Coursera and this semester I happened to take up a very interesting course by Lada Adamic of University of Michigan (Social Network Analysis). Though I have myself taught some aspects of Social Network Analysis during a summer course at Faculty of Computer Science at IBA, Karachi but despite that I found this course intriguing and the way Lada enriched it with cool applications of SNA was simply amazing.

As an optional part of this course the students were to submit a programming project and I thought what better opportunity than this to submit a part of the TweetCric project being undertaken by our research group. It's always good to get some early feedback on your work in order to gain useful, innovative directions and hence, I decided to blog about my Social Network Analysis project. Readers are welcome to suggest any new directions or give their feedback in comments so as to help us in this project. Following is a description of the project for interested readers of my blog:

Social media applications have considerably influenced the lives of millions and everyday there is a huge amount of updates to various social networks such as Facebook and Twitter. As of March 2012, more than 400 million tweets were being posted on Twitter each day. The volume of tweets becomes significantly high during a sporting event as many sports fans now use social media as a part of their viewing experience. Users describe this as an experience full of pleasure and fun as described in following Facebook status update during the recent World Cup T20 match between India and Pakistan:

"Facebook comments are more interesting than the match. Already more than two pages of comments. Looks like PakInd Vs Facebook"

Interestingly, the huge amount of content produced during sporting events can be used for analysis of players' performance and in light of that sports managers can decide future sports strategies and hence the notion of crowd-sourced sports critics can be realized in practice. Researchers have already begun to explore the possibility of using this huge volume of user-generated content to solve various research issues such as event detection, video annotations for sports summaries etc. [1, 2]. We argue in this work to utilize this huge crowd-sourced content for the usefulness of sports strategy analysts and decision-makers. In this work, we use social network analysis to highlight significant players during the match along with an analysis of the reasons of why social network analysis methods detect these players.

Social Network Modeling
The data was obtained using the Twitter Search API. During the epic match held on September 30th 2012, we gathered tweets for the match using the Twitter Search API. We regularly queried the Search API through a Python script on half-hour intervals thereby collecting fresh tweets as the match progressed. In total we collected a sample of 43,450 tweets during the match with hashtag PakvsInd.

We modeled the social graph of the players and commentators using the text content of the tweets. First, using Wikipedia and ESPN CricInfo as an external resource we compiled a list of players and commentators relevant to the India-Pakistan cricket match. This list was then used to detect tweets containing a mention of any player or commentator; following list shows some sample tweets

  1. hafeez goes, 15 from 28 balls.. idiot, wasted his time big time. game over. #pakvsind
  2. like the world cup of pakistani batsmen falling against yuvraj. kamran also departs edging to dhoni. pak 56/4 after 9. #pakvsind
  3. rt @maria_memon: rt @maria_memon: afridi! quit playing games with our hearts....our hearts....#pakvsind'
  4. hafeez is the reason for todays batting performance.... after nazir he put all of the team under pressure! #pakvsind
  5. is dhoni trying to piss off pakistanis by bringing in kohli? #pakvsind'
We now explain how we formulate the nodes and edges in our social network of players and commentators. Each player/commentator is treated as a node and an edge is represented between players/commentators if they co-occur in a tweet. As an example consider tweet 2 above; there would be edges between yuvraj, kamran and dhoni according to our model. In total 8,587 tweets (19.8%) contained a mention of some player or commentator.

The following figure shows the visualization that was obtained from this social network (Gephi was used for the generation of the graph)

Modeled social network of players/commentators during World Cup T20 India-Pakistan match
As clear from the Figure, there are three communities within this social network. Nodes are sized according to betweenness centrality and it can be seen that Hafeez is the node with highest betweenness: this is because this particular player was the captain of Pakistani team in that match, Pakistan lost the match due to his poor captaincy, poor fielding placements and poor batting (as per most of the tweets). The node with second-highest betweenness i.e. Kohli is the one who got man of the match and scored the highest runs leading India to a comfortable victory. Hence, it can be seen that social network analysis gives important insights into sporting events. Natural language processing as an alternative approach seems to lack the precision and efficiency that social network analysis offers. Our team has been long arguing for a hybrid approach that utilizes both natural language processing and social network analysis approaches to address the various research questions in the fields of Information Retrieval and Web Information Systems given the low scalability and speed of Natural Language Processing alone [3].

We now analyse the communities within this dataset. The community represented in blue is mostly comprised of Indian players and it makes sense as to why they form a separate community. However, the inclusion of Misbah and Ajmal in this community is weird since both are Pakistani players - further analysis reveals as to why this occurred and it was due to Ajmal taking important wicket of Sehwag causing Ajmal to go into that community and Misbah being mentioned with Ajmal once forced him there too. The community in dark green represents for the most part Pakistani players with the exception of Dhoni who is the Indian team captain; this however occurs due to Twitterers comparing Hafeez's captaincy with Dhoni's captaincy thereby forcing Dhoni in that community. Lastly, the community in aqua green represents players who did not play in the match with the exception of Afridi and he was forced into that community due to tweet suggestions from Pakistani cricket fans of dropping him and including him in the list of those not playing the match.

Lastly as I mentioned in the beginning of this post as well any feedback or idea is welcome. Interested students who want to join this project are requested to contact me personally via email or social networks.

[1] J. Nichols, J. Mahmud, and C. Drews. Summarizing sporting events using twitter. In Proceedings of the 2012 ACM international conference on Intelligent User Interfaces, IUI ’12, pages 189–198, New York, NY, USA, 2012. ACM.
[2] A. Tang and S. Boring. #epicplay: crowd-sourcing sports video highlights. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI ’12, pages 1569–1572, New York, NY, USA, 2012. ACM.
[3] A. Younus, M. Qureshi, F. Asar, M. Azam, M. Saeed, and N. Touheed, “What do the average twitterers say: A twitter model for public opinion analysis in the face of major political events,” in 2011 International Conference on Advances in Social Networks Analysis and Mining. IEEE, 2011, pp. 618–623.


  1. Interesting analysis. A couple of things come to mind.

    First of all... so what you've done is map the colocations of players. Now consider an alternate experiment. Suppose you were to take a list of sentiment words (horrible, terrible, disaster, wonderful, excellent etc.) and graph the colocations of players -> sentiment words. E.g. in this case you'd see Hafeez being colocated with negative sentiment mostly. it would be fun to see such a represntation (maybe not too much fun for Hafeez, admittedly)

    Secondly, I'm wondering what your process might contribute in other fields. At a higher level, what you've done is identify a group of subjects (players) in a topic (Pak vs Ind T20 match). then you've done the centrality and community analysis based on colocations of mentions of these subjects. What kind of analysis can we do if we forget about cricket and instead consider political candidates (subjects) in an election (topic), for instance? I'm not sure. It's a novel representation though, which makes it interesting ;)