I have been working on a really exciting project with Ian Pearce and Max Darham that attempts to visualize Wikipedia.
Our idea was as such:
How are articles on Wikipedia organically organized? For instance, if you read the article on Bicycles, there will likely be links to the inventor of the modern bicycle within the text of the article, and there will likely be links to rubber, steel, and all the other parts; the countries that were important in making advances in bicycle technology, and so forth. If we treat an article as a vertex and the link as a line or connection to another article or vertex, could we make a program that visually maps this data? We basically wanted to make a prototype datamine project and then use our information to draw specific conclusions about how Wikipedia looks, works, is organized, and if that bears any similarities to any other known information networks maps or types.
Our plan was as such:
We wanted to build a ruby program that visited a random article. You can do this for the English language by using http://en.wikipedia.org/wiki/Special:Random.
Then, we wanted to have the program store all the links from the page, remember the page and all of its links, then recurse through and look at all the links that were on that page, then go through those linked pages, and so forth until we had a huge array of all the articles and the articles that were linked to within a given article. Then, we wrote a little method that turned that information into a Python list, and then used Sage to create visuals of the data. Below are a few examples as of right now of a few graphs we have generated so far; these are the largest connect components, or the largest group of connected articles; there are others, but they don’t link to anything, and nothing links to them, so they are kind of boring. Check out below for some visualizations of the data as well as large cuts of code used to make these work!
View post on imgur.com
Actual Sage Code Used to make it all work:
#NOTE: the following language sets are available on my server: als ang bar ga ia kg li nov nrm pdc qu sco vls war
g = eval(open(get_remote_file('http://devingaffney.com/files/wiki_data
/kg/kg_data.txt')).read())
wiki_graph = Graph(g)
wiki_digraph = DiGraph(g)
wg_conncomp = wiki_graph.connected_components_subgraphs()
wdg_conncomp = wiki_digraph.connected_components_subgraphs()
print ""
print "GRAPH"
print "Total articles:", wiki_graph.num_verts()
dh = wiki_graph.degree_histogram()
dh_plot = list_plot(dh, plotjoined=True)
gcount = 0
print "Degree histogram:"
for x in dh:
gcount = gcount+1
if x != 0:
print gcount, x
dh_plot.axes_range(0,200,0,60)
dh_plot.show()
graph_degree_total = 0
for deg in wiki_graph.degree_iterator():
graph_degree_total = deg+graph_degree_total
graph_average_degree = float(graph_degree_total/wiki_graph.num_verts())
print ""
print "GRAPH : LARGEST CONNECTED COMPONENT"
print "Total articles in largest connected component:", wg_conncomp[0].num_verts()
print "Diameter of largest connected component:", wg_conncomp[0].diameter()
array = []
counter = 0
for x in wg_conncomp[0].vertices():
for y in wg_conncomp[0].vertices():
length = wg_conncomp[0].shortest_path_length(x,y)
array.append(length)
for x in array:
counter = x+counter
average_path_length = float(counter/len(array))
print "Average path length:", average_path_length
print "Clustering Average:", wg_conncomp[0].clustering_average()
print "Degree Total:", graph_average_degree
print "Number of Cliques", wiki_graph.clique_number()
wg_conncomp[0].show(vertex_size=2, fontsize=2, figsize=[75,75],
filename='wikipedia_crcl.png', layout="circular")
print ""
print "DIGRAPH"
dh = wiki_digraph.degree_histogram()
dh_plot = list_plot(dh, plotjoined=True)
dgcount = 0
print "Degree histogram:"
for x in dh:
dgcount = dgcount+1
if x != 0:
print dgcount, x
digraph_degree_total = 0
for deg in wiki_digraph.out_degree_iterator():
digraph_degree_total = deg+digraph_degree_total
digraph_average_degree = float(digraph_degree_total/wiki_digraph.num_verts())
dh_plot.axes_range(0,200,0,60)
dh_plot.show()
print ""
print "DIGRAPH : LARGEST CONNECTED COMPONENT"
print "Total articles in largest connected component:", wdg_conncomp[0].num_verts()
print "Clustering Average:", wdg_conncomp[0].clustering_average()
print "Degree Total:", digraph_average_degree
wdg_conncomp[0].show(vertex_size=2, fontsize=2, figsize=[75,75],
filename='digraph_wikipedia.png')
And the ruby libraries associated with originally collecting the datasets provided above:
class Wikipedia
attr_accessor :articles, :articles_by_hash
def initialize()
@articles = []
@articles_by_hash = []
end
def make_article_object(article_array)
if !@articles_by_hash.include?(article_array[0].hash) # If the article is being encountered
@articles_by_hash << article_array[0].hash # for the first time, then add its hash
new_article = Article.new(article_array[0]) # to the list of hashes and make a new
new_article.analyzed = true # Article => new_article.
links_array = article_array[1] # For the article's links, make an array
links_array.each do |linked_article| # and for each link => linked_article:
if !@articles_by_hash.include? linked_article.hash # if that link is being encountered for
@articles_by_hash << linked_article.hash # the first time, add its hash, make a
new_linked_article = Article.new(linked_article) # new Article => new_linked_article, add
@articles << new_linked_article # it to the Wiki's @articles, and add it
new_article.links << new_linked_article # to the new_article's @links
else
@articles.each do |article|
if article.hash_number == linked_article.hash
new_article.links << article
break
end
end
end
end
new_article.analyzed = true
@articles << new_article
else
@articles.each do |article|
if (article.hash_number == article_array[0].hash) and !article.analyzed
links_array = article_array[1]
links_array.each do |linked_article|
if !@articles_by_hash.include? linked_article.hash
@articles_by_hash << linked_article.hash
new_linked_article = Article.new(linked_article)
@articles << new_linked_article
article.links << new_linked_article
else
@articles.each do |existing_article|
if existing_article.hash_number == linked_article.hash
article.links << existing_article
break
end
end
end
end
article.analyzed = true
end
end
end
end
def print_for_sage_string
graph_string = "{"
@articles.each do |article|
graph_string = graph_string + article.hash_number.to_s + ": ["
article.links.each do |article_link|
graph_string = graph_string + article_link.hash_number.to_s + ", "
end
graph_string = graph_string + "], "
end
graph_string = graph_string + "}"
graph_string = graph_string.gsub(", ]", "]")
graph_string = graph_string.gsub(", }", "}")
print "\n"
print graph_string
print "\n\n"
end
def print_for_sage
print "{"
@articles.each do |article|
print article.hash_number.to_s + ": ["
article.links.each do |article_link|
print article_link.hash_number.to_s + ", "
end
print "], "
end
print "}\n\n"
end
def print_list
@articles.each do |article|
print "\n" + " => [" + article.name + "] >> "
if article.links.length == 0
print "nothing"
end
article.links.each do |article_link|
print "[" + article_link.name + "], "
end
end
print "\n\n"
end
def hash_reformat
i = 0
@articles.each do |article|
article.hash_number = i
i = i + 1
break if i == @articles.length
end
end
def find_title_by_hash(int) #find_title_by_hash(6), not ("6")
@articles.each do |article|
if article.hash_number == int
return article.name
break
end
end
end
def find_hash_by_title(article_name)
article_name.downcase!
salsa_verde = 0
@articles.each do |article|
if article.name.downcase.include?(article_name)
print article.name + ": " + article.hash_number.to_s + "\n"
end
salsa_verde = salsa_verde+1
if salsa_verde == @articles.length
print "Went through all articles.\n"
break
end
end
end
def label_for_sage
print "def label_wiki(wiki_graph):\n"
@articles.each do |article|
print " wiki_graph.set_vertex(" + article.hash_number.to_s + ", \"" + article.name + "\")\n"
end
end
end
require 'rubygems'
require 'hpricot'
require 'open-uri'
require 'cgi'
class LinkGrabz
attr_accessor :body, :link_array, :visited, :title, :url, :language_prefix, :url_prefix
def initialize(language_prefix)
@language_prefix = language_prefix
@url_prefix = "http://localhost/~ian/wikipedia/" + @language_prefix + "/articles"
@body = ""
@link_array = []
@url = ""
end
def special_grabber
html = Hpricot(open("http://"+@language_prefix+".wikipedia.org/wiki/Special:Random"))
if !html.at('//div[@class="printfooter"]').nil? and !html.at('//div[@class="printfooter"]').children.nil?
article_name = html.at('//div[@class="printfooter"]').children.select{|e| e}
article_name = article_name[1].inner_html
elsif !html.at('//title[@=empty()]').nil? and !html.at('//title[@=empty()]').children.nil?
article_name = html.at('//title[@=empty()]').children.select{|e| e}
article_name = article_name[1].inner_html
else
#special_grabber
break
end
article_name = article_name.gsub("http://"+@language_prefix+".wikipedia.org/wiki/","")
if article_name.length > 9
if article_name[0,1].include?("%")
if article_name[3,1].include?("%")
if article_name[6,1].include?("%")
grablink(@url_prefix + "/" + CGI::unescape(article_name[0,3]) + "/" + CGI::unescape(article_name[3,3]) + "/" + CGI::unescape(article_name[6,3]) + "/" + article_name + ".html")
else grablink(@url_prefix + "/" + CGI::unescape(article_name[0,3]) + "/" + CGI::unescape(article_name[3,3]) + "/" + article_name[4,1].downcase + "/" + article_name + ".html")
end
elsif article_name[4,1]
grablink(@url_prefix + "/" + article_name[0,1].downcase + "/" + article_name[3,1].downcase + "/" + CGI::unescape(article_name[4,3]) + "/" + article_name + ".html")
else grablink(@url_prefix + "/" + CGI::unescape(article_name[0,3]) + "/" + article_name[3,1].downcase + "/" + article_name[4,1].downcase + "/" + article_name + ".html")
end
elsif article_name[1,1].include?("%")
if article_name[4,1].include?("%")
grablink(@url_prefix + "/" + article_name[0,1].downcase + "/" + CGI::unescape(article_name[1,3]) + "/" + CGI::unescape(article_name[4,3]) + "/" + article_name + ".html")
else grablink(@url_prefix + "/" + article_name[0,1].downcase + "/" + CGI::unescape(article_name[1,3]) + "/" + article_name[4,1].downcase + "/" + article_name + ".html")
end
elsif article_name[2,1].include?("%")
grablink(@url_prefix + "/" + article_name[0,1].downcase + "/" + article_name[1,1].downcase + "/" + CGI::unescape(article_name[3,3]) + "/" + article_name + ".html")
else grablink(@url_prefix + "/" + article_name[0,1].downcase + "/" + article_name[1,1].downcase + "/" + article_name[2,1].downcase + "/" + article_name + ".html")
end
else
special_grabber
end
end
def grablink(url)
if !Hpricot(open(url)).nil?
html = Hpricot(open(url)) #open site
@body = html.search("//div[@id='content']")
@url = url
@title = @url.gsub(@url_prefix,"")
else
print "\nEncountered broken link.\n"
end
rescue OpenURI::HTTPError
special_grabber
rescue URI::InvalidURIError
special_grabber
end
def extract
@link_array.clear
(@body/"a[@href]").each do |url|
new_url = url.attributes['href'].match(/(\/.\/.\/.\/.*)/).to_s
if new_url.length != 0 and !new_url.include?("~") and !new_url.include?("#") and !new_url.include?("%7E")
@link_array << new_url
end
end
@link_array = @link_array.uniq
end
def export
return [@title, @link_array]
end
end
class Article
attr_accessor :name, :hash_number, :links, :analyzed
def initialize(name)
@name = name
@hash_number = name.hash
@links = []
@analyzed = false
end
end
@wiki = Wikipedia.new()
def run(language_prefix, article_number_limit, special_grabber_limit)
@test = LinkGrabz.new(language_prefix)
@stop = 0
while @wiki.articles.length < article_number_limit and @stop < special_grabber_limit
@test.special_grabber
if !@wiki.articles_by_hash.include? @test.title.hash
@stop = 0
@articles_to_be_analyzed = []
print "\nRandomly grabbed: " + @test.title
@test.extract
@wiki.make_article_object(@test.export)
print "\nExported to @wiki.\n\nSorting unanalyzed articles..."
@wiki.articles.each do |article|
if !article.analyzed
if article.links.length > 0
article.analyzed = true
else
@articles_to_be_analyzed << article
end
end
end
while @articles_to_be_analyzed.length > 0
i = 1
atba_length = @articles_to_be_analyzed.length.to_s
@articles_to_be_analyzed.each do |article|
@test.grablink(@test.url_prefix + article.name)
@test.extract
@wiki.make_article_object(@test.export)
article.analyzed = true
print "\nAnalyzed article " + i.to_s + " of " + atba_length + " unanalyzed articles: " + article.name
i += 1
end
old_atba = @articles_to_be_analyzed
@articles_to_be_analyzed = []
print "\n\nResorting unanalyzed articles..."
@wiki.articles.each do |article|
if !article.analyzed
if article.links.length > 0
article.analyzed = true
else
@articles_to_be_analyzed << article
end
end
end
break if old_atba == @articles_to_be_analyzed
print "\n\nTotal articles encountered: " + @wiki.articles.length.to_s + "\n"
end
else
print "Article already analyzed: " + @test.title
@stop += 1
end
print "\n\nTotal articles encountered: " + @wiki.articles.length.to_s + "\n"
end
@wiki.hash_reformat
end