Where in the world are you?

Mark Graham, Scott Hale, and I have written an article about the geolinguistic contours of Twitter mostly for the purpose of improving and exploring methodological choices available to researchers when they are working with Twitter's geographic and language metadata. A full version of the paper is available and the abstract is provided below.

Abstract

The movements of ideas and content between locations and languages are unquestionably crucial concerns to researchers of the information age, and Twitter has emerged as a central, global platform on which hundreds of millions of people share knowledge and information. A variety of research has attempted to harvest locational and linguistic metadata from tweets in order to understand important questions related to the 300 million tweets that flow through the platform each day. However, much of this work is carried out with only limited understandings of how best to work with the spatial and linguistic contexts in which the information was produced.
Furthermore, standard, well-accepted practices have yet to emerge. As such, this paper studies the reliability of key methods used to determine language and location of content in Twitter. It compares three automated language identification packages to Twitter’s user interface language setting and to a human coding of languages in order to identify common sources of disagreement. The paper also demonstrates that in many cases user-entered profile locations differ from the physical locations users are actually tweeting from. As such, these open-ended, user-generated, profile locations cannot be used as useful proxies for the physical locations from which information is published to Twitter.

where_in_the_world

where_in_the_world.pdf

615 KB

The Future is Algorithmic Feeds on Bluesky

If the original sin of Web 1.0 was the pop-up ad, the original sin of web 2.0 was the move to algorithmic feeds. Opaque optimization strategies aimed at maximizing private revenue for the sake of what was otherwise externally billed as public goods became increasingly toxic, spawning discourse

How Much Data is Enough for Finetuning an LLM?

There's no shortage of analogies for explaining what an LLM is capable of - one of the best, though, is from this New Yorker article proclaiming it as a "blurry JPEG of the web". This metaphor is particularly useful for capturing many of the technical aspects

Using Synthetic Data Generators to Measure LSTM Lift

Long short-term memory models (LSTMs) are a family of neural networks that are predominantly used to predict the next value given a historical chain of previous values. These can be numerical predictions (i.e. where is the stock price going based on historical stock data) or categorical predictions (i.e.

Some supervision required: LLMs at scale in practice

Recently, I gave a talk at the PIE/Autodesk space to help contextualize some thoughts that have been percolating with regards to the nascent introduction of API-based, widely available LLMs like ChatGPT. In the hype cycle, I've observed some pretty broad claims about what's happening under

Read more

The Future is Algorithmic Feeds on Bluesky

How Much Data is Enough for Finetuning an LLM?

Using Synthetic Data Generators to Measure LSTM Lift

Some supervision required: LLMs at scale in practice