Università degli Studi dell'Insubria

Dipartimento di Scienze Teoriche e Applicate - DiSTA

Twitter TAF dataset: Detecting topically anomalous friendships.

As the popularity and usage of social media exploded over the years, mining social media data for different purposes became an important endeavor. In this paper, we analyze data interest patterns of social network users in time to understand individual and collective user behavior on social networks.
By using two snapshots from the Twitter network, we first cluster users according to their data interests. Then, by using these groups in addition to the past interests of individual users, we detect emerging topics in groups, as well as differing interests of individual users in clusters. Building on these results, we propose novel anomaly metrics to identify users whose data interests diverge from collective behavior. Experiments on a large, real world dataset show the effectiveness of our approach.

We are sharing a dataset for Twitter, where new friendships of users were analyzed to find anomalous friendships. These found anomalous friendships were presented to Amazon Mechanical Turk users for validation. In detection, we looked into topical interests of users in time, and used LDA with the number of topics set to 100. Each user is represented by two vectors of topics. The first vector is called a tweetProfileVector, where tweets of a user was the LDA input. The second vector is called a bioProfileVector, where biotext of a user was the LDA input. Section 4.1 of our paper (given below) clarifies how this process was carried out.

Our anomaly detection results were verified in 12 081 reviews by 287 MTurkers.
Overall we downloaded ~10M Twitter accounts, and used 3M of them in our computations. 1.9K users’ 4K friendships were asked to the Mturkers.


The dataset has 6 files.

graph2009.txt: Friends of users in 2009. In Twitter terminology, friends are the users followed by users. This part of data is extracted from Kwak et al.


5954192    202003,13475802,9904812.

User 5954192 is following users 202003 and 13475802 and 9904812.

graph2013.txt: Friends of users in 2013.


5954192    202003,13475802,9904812.


tweetProfileVectors.txt: [tweetLDA] LDA representation of 2M Twitter users by their tweets.


1139248118    70    0.20    53    0.799

User 1139248118 has    topics 70 and 53 with probabilities 0.20 and 0.79, respectively. Topic probabilities should sum to 1.0.

bioProfileVectors.txt: [bioLDA] LDA representation of 2M Twitter users by their bio texts.

9839318214    75    0.45    4    0.55

User 9839318214   has    topics 75 and 4 with probabilities 0.45 and 0.55, respectively. Topic probabilities should sum to 1.0.

MTurkdecisions.txt: Mturk users’ decisions for friendships (i.e., this friendship is anomalous or not).


9525212    30313925    i5cql07hi    no    Developers need to attend to politics to grow their business.    2013-10-29 13:00:01

MTurker i5cql07hi (MTurker id is anonymized here) was presented with the data that user 9525212 has a new friend 30313925. He was asked to tell us whether this friendship is an anomaly (‘yes’), or not (‘no’). If he cannot decide, he can choose not enough data to tell (‘possible’). In this row, we see that he chose ‘no’ and gave the explanation that “Developers need to attend to politics to grow their business”. His decision was recorded at time 2013-10-29 13:00:01.

bio.txt: Full bio information of 4k Twitter users whose friendships were asked to MTurkers. We are giving this data because Mturkers saw some shared attributes (e.g., same city, language) between user-friend pairs, and this affected their decision.


jack    952    2337560    20750    13378    en    Pacific Time (US & Canada)    San Francisco    A sailor, a tailor

Observations of Data. tweetLDA is more reliable than bioLDA:
Overall, bioLDA results show that technical words, such as official and Facebook, lead to found topics that can describe the function of an account (e.g., Official Twitter feed for Senator Frank R. Lautenberg), but these topics fail to give us a sense of what the Twitter pro files are about. On the other hand, tweetLDA is more useful to understand characteristics of users, because tweetLDA uses more information (i.e., multiple tweets instead of a short bio) and diverse texts (i.e., multiple tweets about different topics).

IMPORTANT: Some users do not have bioTexts or tweets. Depending on which, their bioLDA or tweetLDA vectors do not exist in the data files.

In MTurkDecisons.txt we are giving all decisions from Mturkers, but in our paper we have filtered out MTurkers who cheated or finished the task without giving explanations. You can use start-finish times of MTurkers in the file to decide on which MTurkers’ works to include in your analysis. For example, if a MTurker finished his task in less than 15 minutes, you can ignore his work.

Topics in bioLDA and tweetLDA are different from each other; same topic numbers in each LDA refer to different sets of words. We did not include topic descriptions in the dataset for privacy reasons.

The dataset was used in the “Detecting Anomalies in Social Network Data Consumption” article (under submission) by Cuneyt Gurcan Akcora, Barbara Carminati, Elena Ferrari.

Due to Twitter’s new policy, we are no more able to share this dataset.