SubstackDB: Exploiting Lax Upload Validation to Create Parasitic File Servers

For the past year, I've been increasingly focusing on what I have come to call "sociotechnical security" - whereas "technical security" seeks to identify and remove unintended flaws in the architecture of platforms, "sociotechnical security" is all about identifying and removing the incentives for how worst-faith users may abuse the explicit intent of platform affordances. Making this "sociotechnical" distinction brings into the frame a lot of issues not typically considered to be security issues, but are proving to become existential threats to a bunch of different businesses. Social platforms have misinformation problems due (in part) to fake accounts spreading it, online marketplaces face algorithmic manipulation challenges from sellers jockeying for position, and platforms with weak security around analytics face all sorts of ad and impression count fraud.

Today I want to share an exploit that I spent the last week investigating, and am calling "SubstackDB," after Substack, where I first identified the problem. Specifically, platforms tend to prefer low-friction interfaces, and tend to afford users increasing flexibility in affordances provided. Substack's WYSIWYG editor for drafting posts is overly optimistic in assuming good faith in user behavior, and is exposed to a huge flaw - because there is no validation for the input of files uploaded into the editor, and because their upload functionality has no verification scheme beyond requiring an active user session, their file server can be hijacked for any arbitrary use case. Far from being the only company facing this issue, Discord also suffers from a nearly identical problem. As a proof of concept, I used the unpublished APIs for both Substack and Discord's file uploading capabilities to store copies of GPT2 on their servers, and I provide the necessary scripts for loading and verifying the execution of those models. Additionally, I am providing a ruby implementation of SubstackDB which, given a valid username and password allows a user to upload and download any file of any size.

Substack Vulnerability

This is a picture of Substack's WYSIWYG post editor with an example image uploaded. Here's a look at the cURL request that uploaded the image, and the response back from the server:

And the response:

It turns out that this "bucketeer" name refers to a Heroku file server plugin, which presumably also indicates that Substack is at least partially hosted there. Regardless, through trial and error, I determined that only a very small portion of the above cURL is required to send a file:

Here, the [BASE 64 BYTES] refers to literally any content that is Base64 encoded. In our case it is an image, but it turns out that any data encoded in Base64 will be treated as valid input. Through more trial and error, I determined that while there seems to be no upper limit to the size of an uploaded file, though it is in practice limited by timeout errors that ultimately invalidate the request. Further, this "image" upload functionality actually returns the original file byte-for-byte, so no compression occurs between upload and receiving the final URL - because of this, we can store any other data relatively easily and just declare the "type" of the content to be one of the valid types required by the upload endpoint.

To prove that any arbitrary content could be uploaded, I downloaded a copy of the GPT2 "medium" model via aitextgen, split the .bin file containing the model into several hundred smaller files of equal size, and then uploaded those to Substack's endpoint. Finally, I wrote a script that reconstructs the model using a final "manifest" JSON file stored on Substack as well:

Discord Vulnerability

Discord's vulnerability seems a bit more intentional as a feature than as a bug per se, but is still ripe for abuse. Using a throwaway account and throwaway server, I was granted a set of credentials - using those credentials, I was able to slightly alter the script I used to upload GPT2.

Discord appears to treat non-image uploads as more of a first-order object in their system - clearly, there is some form of intent to allow users to upload files of some nature. What is likely outside the intent, however, is automating this affordance to send gigabytes of content through their platform in a relatively short time frame. Ultimately, I was able to generate a nearly identical script as was deployed in the Substack case.

SubstackDB Script

Finally, to prove out the concept of truly using this type of vulnerability as an arbitrary file server, I wrote a generalized SubstackDB class which, in this version of the script, takes as input the username, password, and filepath, and returns a print-out of whether or not the contents of that filepath, once read, uploaded, and downloaded, is identical to the original source file. In practice, one could use this script to be a literal drop-in replacement for many classic file store APIs.

Endnote

This is just two examples of a general problem with upload validation. Of note, I also explored exploits like this on Meetup, Indiegogo, Gumroad, and a few others, and while it was still likely technically possible to pull off a similar stunt, it was in no way worth the investment in time that it would take to fully reverse engineer their implementations - generally, the issue was that one-time-use tokens were being employed to validate uploads on a per-upload basis, which proved to be too much of a pain to solve. The point, however, is that this is a demonstration of a systemic issue - by assuming best-faith use, platforms allow worst-faith users the unintended "sociotechnical" affordance of a free fileserver at the cost of the platform.

SubstackDB: Exploiting Lax Upload Validation to Create Parasitic File Servers

Substack Vulnerability

Discord Vulnerability

SubstackDB Script

Endnote

Read more

The Future is Algorithmic Feeds on Bluesky

How Much Data is Enough for Finetuning an LLM?

Using Synthetic Data Generators to Measure LSTM Lift

Some supervision required: LLMs at scale in practice