Detecting cheaters on Twitter via ego-net degree distributions of followers

If you know me beyond this post, you know that I’m currently a master’s candidate at Oxford Internet Institute. As part of this, I am able to sit around and ask fun questions about the internet on a regular basis. Two days ago, I was in my optional law class with my professor and advisor, Viktor Mayer-Schönberger, and our class was debating the problem of trust online. Essentially, we found ourselves asking this question: how do we make people behave nicely online? The general routes seemed to be things like real identifications (Google taking your cell number for reference), costs of entry and/or exit (leaving 4chan costs nothing – leaving your Reddit score means something), and assertions of trust (eBay scores), among other things.

From here, we started talking about ways to assert trust, and one simple system – using the fingerprint of your social network as a unique identification of who you are without explicitly saying who you are. This got me thinking about this interesting concept again, that the network topology you have cultivated on Facebook/Twitter is, in all likelihood, highly unique and possibly globally unique to just you.

Let’s consider a practical example: Let’s say we want to globally identify my Twitter account using attributes of my account. I have, today, 313 followers on Twitter, and I follow 135 people. These two numbers, alone, reduce the possible number of accounts I could be – rather than any old account, I’m part of a unique cohort of users with this 313/135 set of values. Furthermore, we can look at those 313 followers, and then look at who they co-follow with me. In other words, if we restrict a network to these 313 people, who else do they follow in the network? This is an ego-net, or my 1.5 degree network. Now, a simple metric about these people is the degree distribution. Let’s say @Gary follows me, and also follows 10 people that follow me as well. This means that Gary has a degree of 10. So, obviously, everyone’s going to have slightly different distributions – I’ll have 1 follower at 100 co-followings, one at 76, one at 56, one at 54, etcetcetc, which will be distinct to my followers as compared to someone like my wonderful friend @peeinears, Ian Pearce.

So, we have some cool unique identification system, or at least something that is relatively unique. But, being incapable of focusing on one thing at a time, I started thinking about other applications of this information – what else can you use this fingerprint for? One possible application I thought of was part of Viktor’s thing, albeit from a different angle – determining the trust of a user based on the ego-net degree distribution. Imagine, Ian Pearce and Devin and Gary all have ego-net degree distributions (assuming they have at least 2 followers). These values are going to be potentially unique to each user; for instance, it is unlikely any of use have the exact same number of friends/followers, much less the same connections. If we plot the distribution of degree distributions per node, these distributions will differ, but not absolutely. That is to say, Ian and Devin and Gary will be different, but not hugely so – because they are in the same society on the same website and probably have pretty similar communications patterns, we would expect a “usual” degree distribution. Clearly, this is something that lends itself more easily to Twitter than Facebook (since Facebook requires mutual ties rather than directional ties), but in the case of Twitter, you can look at these degree distributions to see if someone uses the platform for an “unusual” or statistically improbable activity. In other words, you may be able to use the degree distribution of an ego-net to determine, as a proxy measure, the degree to which a user fits within standard patterns. People who fall way out of the standard distributions may be doing something fundamentally different.

This all comes down to an interest I had in a someone's account: I was confused as to how someone was able to have 20k+ followers without having done some huge thing; in fact, by Twitter standards he’s a bigger deal than Ethan Zuckerman. Feels wrong. So, could I show that his account is followed by a huge army of bots through this technique? Turns out you can. And, turns out that these 20,000 accounts are actually part of a gigantic farm of accounts that are just follower accounts created solely to inflate the egos and follower counts of those willing to pay a price. So, what does a normal follower distribution look like compared to these cases? Let’s look at the graph. The chart is from several accounts of variable size, and their distributions. Clearly, one of these things is not like the others – the account in question is actually included in the dataset, but is not visible simply because the number of degrees, standardized against the number of followers they have, is so out of whack. Admittedly, its highly unlikely that an account with 20,000 followers behaves differently than 300 follower accounts, so it’s not necessarily a fair comparison. When we consider the log graph, however, the results are very stark, however – something is clearly different about one of these accounts. While more work has to be done to make this model actually work, I think something useful is here, perhaps useful enough to the degree that Twitter may want to implement some system like this. Alternatively, someone could write a program to do this formally so that we can temper the perceived trust of an account by this distribution, establishing a “real” trust, like Klout but without bullshit and gameification – just a real way to deal with people.

A note about how this little study was done: I snowballed some accounts around my account – this is in no way a real study and I don’t intend it to be read as such. The accounts were not consulted, so guys, please don’t be angry I included you. Also, the code is available for anyone who wants to try it out on their own (although as of 2015 it's likely got some serious smells both from ruby changing quite a bit as well as my being a much worse code back a few years):

require 'twitter'  
require 'json'  
class Array  
  def chunk(pieces=2)
    len = self.length
    return [] if len == 0
    mid = (len/pieces)
    chunks = []
    start = 0
    1.upto(pieces) do |i|
      last = start+mid
      last = last-1 unless len%pieces >= i
      chunks << self[start..last] || []
      start = last+1
    end
    chunks
  end

  def percentile(percentile)
    self.sort[(percentile * self.length).ceil - 1]
  end
end

def grab_ids(id, direction="follower")  
  cursor = -1
  ids = []
  while cursor != 0
    data = Twitter.send("#{direction}_ids", id, {:cursor => cursor})
    rls = Twitter.rate_limit_status
    sleep((rls.reset_time-Time.now)/rls.remaining_hits)
    ids = ids+=data.attrs["ids"]  
    cursor = data.attrs["next_cursor"]
  end
  return ids
end

def grab_egonet  
  threads = []
  @follower_ids.chunk(80).each do |ids|
    threads << Thread.new do
      ids.each do |id|
        begin
          @egonet[id.to_s] = grab_ids(id, "friend")&@follower_ids
          puts "Grabbed #{id}...\n"
        rescue
          next
        end
      end
    end
  end
  threads.collect{|t| t.join}
end

def make_csvs  
  f = File.open("basic_report.csv", "w")
  f.write("screen_name,followers,friends,tweets,age\n")
  @screen_names.each do |screen_name|
    results = []
    results << screen_name
    results << @basic_results[screen_name]["followers"]
    results << @basic_results[screen_name]["friends"]
    results << @basic_results[screen_name]["tweets"]
    results << @basic_results[screen_name]["age"]
    f.write(results.join(",")+"\n")
  end
  f.close
  f = File.open("distribution_report.csv", "w")
  f.write("percentile,#{@screen_names.join(",")}\n")
  0.upto(99) do |index|
    percentile = index+1/100.0
    row = [percentile,@screen_names.collect{|sn| @distribution_results[sn][index]}].flatten
    f.write("#{row.join(",")}\n")
  end
  f.close
end

def basic_attributes  
  return {"followers" => @user.followers, "friends" => @user.friends, "tweets" => @user.statuses_count, "age" => Time.now-@user.created_at}
end

def calculate_distribution  
  distribution = @egonet.values.collect{|v| v.length}
  standardized_set = []
  1.upto(100) do |percentile|
    percentile = percentile/100.0
    standardized_set << distribution.percentile(percentile)/@user.followers.to_f
  end
  return standardized_set
end

@screen_names = ARGV[0].split(",") || @screen_names =["robhawkes", "BurcuBaykurt", "stefanbazan", "CaptSolo"]

Twitter.consumer_key = "CONSUMERKEY"  
Twitter.consumer_secret = "SECRET"  
Twitter.oauth_token = "TOKEN"  
Twitter.oauth_token_secret = "SECRET"  
@distribution_results = {}
@basic_results = {}
@screen_names.each do |screen_name|
  @user = Twitter.user(screen_name)
  @egonet = {}
  @follower_ids = grab_ids(@user.id)
  grab_egonet
  @distribution_results[screen_name] = calculate_distribution
  @basic_results[screen_name] = basic_attributes
end  
make_csvs