AI and Data Scraping on the Archive

With the proliferation of AI tools in recent months, many fans have voiced concerns regarding data scraping and AI-generated works, and how these developments can affect AO3. We share your concerns. We’d like to share what we’ve been doing to combat data scraping and what our current policies on the subject of AI are.

Data scraping and AO3 fanworks

We’ve put in place certain technical measures to hinder large-scale data scraping on AO3, such as rate limiting, and we’re constantly monitoring our traffic for signs of abusive data collection. We do not make exceptions for researchers or those wishing to create datasets. However, we don’t have a policy against responsible data collection — such as those done by academic researchers, fans backing up works to Wayback Machine or Google’s search indexing. Putting systems in place that attempt to block all scraping would be difficult or impossible without also blocking legitimate uses of the site.

With that said, it is an unfortunate reality that anything that is publicly available online can be used for reasons other than its initial intended purposes. In many cases, AI data collection traffic relies on the same techniques as the legitimate use cases above.

Once we became aware that data from AO3 was being included in the Common Crawl dataset — which is used to train AI such as ChatGPT — we put code in place in December 2022 requesting Common Crawl not scrape the Archive again.

We cannot go back in time to stop data collection that already occurred, or remove AO3’s content from existing datasets, as much as we may dislike that it happened. All we can do is attempt to reduce such collection in the future. The Archive’s development team will continue to be on the lookout for individual scrapers collecting AO3 data, and to take action as needed.

Likewise, our Legal committee has and will continue to serve the OTW mission of protecting fanworks from legal challenge and commercial exploitation. This includes their position that users should be allowed to opt out from having their works incorporated into AI training sets, a position that they have presented to the U.S. Copyright Office. They, too, will continue to keep pace with this developing field.

What can I do to avoid data scraping?

You may want to restrict your work to Archive users only. While this will not block every potential scraper, it should provide some protection against large-scale scraping.

AI-generated works and AO3 policies

At the moment, there is nothing in our Terms of Service that prohibits fanworks that are fully or partly generated with AI tools from being posted to the AO3, if they otherwise qualify as fanworks.

Our goals as an organization include maximum inclusivity of fanworks. This means not only the best fanworks, or the most popular fanworks, but all the fanworks that we can preserve. If fans are using AI to generate fanworks, then our current position is that this is also a type of work that is within our mandate to preserve.

Depending on the circumstances, AI-generated works could violate our anti-spam policies (e.g. if a creator posts a significant number in a short time). If you’re uncertain whether a work violates our Terms of Service, you may always report it to our Policy & Abuse team using the link at the bottom of any page, and they can investigate.

This statement reflects AO3’s policy at the time of writing, as we wanted to be transparent with our users about what our current stance is and what can be done – and is being done – to mitigate scraping for AI datasets. However, these policies are also under discussion internally among AO3 volunteers. If we agree on changes to these in the future, those will be announced publicly; additionally, if there are any proposed changes to the AO3 Terms of Service, they will be made available for public comment as is required of any and all changes to our Terms of Service.

We hope that this helps to make things more clear – this is a complicated situation, and we’re doing our very best to address it in a way that doesn’t compromise AO3’s principles of maximum fanwork inclusivity or legitimate uses of the site. As discussions and approaches evolve, we will keep our users updated.

Announcement, Archive of Our Own
  1. dead commented:

    So, will AO3 keep on legally championing AI as transformative work? This is the worst, and honestly, the most concerning part of this. If AI output is termed as ‘transformative’ rather than generative work legally (and it is generative since there’s no human element involved,) it will spell devastation to the working class creatives all over the world- as then it strengthens its position as ‘FAIR USE’ which it in no shape or form is, when it’s threatening the income of millions of people. I do not want to be a part of such exploitation and as such cannot continue to support OTW in any form, until such a time that they change their policies regarding this.

  2. amarisllis commented:

    As much as I hate AI, I’m satisfied with this response. I’m glad to see that steps to limit scraping have already been taken. Though I don’t believe that AI-generated works should be classified as legitimate artistic products, I would be *very* hesitant to see a ban on AI generated work on the site as I am not at all confident in the ability to do it without catching a lot of legitimate fic in the crossfire. And moreover, it wouldn’t actually work; it would still be there, just untagged. AI detection tools exist, but they are *very* imperfect at this stage and the inaccuracy rates are too high for me to be comfortable relying on them. Moreover, this also does not fall in line well with AO3’s permissive content policies and would likely lead to enforcement problems and persecution of a lot of legitimate writers as a result. What I *would* like to see however, is an AI-generated warning label similar to the other major Archive warnings so that users may filter out AI-generated works as needed. Given technical and legal limitations, and given the lack of perfect ability to detect ai-generated work at this stage (the technology is simply not there yet), this is the best-case scenario.

  3. DL commented:

    On April 19th, Ms. Rosenblatt, acting as the official representative of the OTW, told the US government’s copyright office that the OTW has “heard from fans” who are excited about the use of AI. Could you tell us how these fans communicated their feelings to the OTW? Has there been any quantitative polling on the feeling of OTW members, AO3 users, or fandom at large? While I don’t disagree with the basic legal stance the OTW is taken (I lack the legal expertise to presume to try!), I AM very concerned that the OTW is speaking for fandom without seeming to have a good idea of what fandom wants to say.

    • Ring commented:

      I’m very curious about this, and would like to hear where she got feedback and what the context was. A huge number of tools fall under the catchall term “AI,” and not all of them are designed to generate variations on training material.

  4. heather commented:

    “If fans are using AI to generate fanworks, then our current position is that this is also a type of work that is within our mandate to preserve.”

    Okay, but why though? What is your justification? What about procedurally-generated work constitutes some sort of transformative artistic thought on the part of an author or artist? These programs are making statistical guesses based off reams of data. It is not thinking. It is not making some sort of observation or commentary off the original work. It’s just … generating text.

    Look, I do think that in an internet that is getting continually sanded down and sanitized, the OTW’s commitment to letting people post erotica Is Good Actually. Beyond that, though, you do not seem to have an ethical thought in your collective body.

    The training of large language models threatens jobs across the industries that feed fandom in the first place–the entertainment industry wants to institute AI writing to kill writer’s rooms (and by extension, their UNION). Short fiction magazines are overrun with A.I. submissions, making it harder for new authors to break into publishing. Tor has already published a Christopher Paolini book with AI-generated cover art.

    Meanwhile big tech is paying workers in the global south a PITTANCE to moderate the data that goes into these programs and keep them marketable. As in, these are the people who end up watching beheading videos all day and get paid $1.50 an hour to do it. And! On top of that! The processing power requires a lot of energy *and water to keep it cool*.

    So you’ve got a program that exists to eliminate jobs–the very jobs that inspire and motivate Fandom, that thing you so wish to protect–which is in turn subsidized by nightmarish underpaid labor, that wrecks the environment in a rapidly deteriorating climate, but you step back and decide that it’s no ethics just right to allow its output on the archive. Wow. Great. I’m glad you’re taking such a strong stand for fandom here.

    • this commented:

      100% agree

    • Plantzawa commented:

      Absolutely agree 100%. Do better OTW. We’re watching you.

    • fuck AI commented:

      1000% this

    • RogueSareth is disappointed but old enough to not be surprised commented:

      Absolutely 100% agreed. This wishy washy response from the OTW is disappointing and far below the standard of practice I expect from them. They also did not address or acknowledge Betsy being their official rep to the US government in Ai discussions.

      None of this is a good sign, I don’t want Ao3 to go the way of FF/.net or LJ and fandom to have to rebuild *again* but if OTW doesn’t start taking a stronger anti Ai stance that’s where we’re headed.

    • Odamaki commented:

      100% agree and thank you for putting it into words for me.

      I post my works on AO3 on the express understanding that they are protected, both from people who want to quash fandom works in general but also against those who would capitalise and monitise work I undertake. Right now, I am free to take 6 coffee shop AUs at random and manually combine then to produce a ‘new’ work but I think I would rightly be called out for it. If I used AI to write this comment for me, it would rightly be dismissed as spam.

      Why are we enabling machines to churn out poor imitations of creativity we have spent years cultivating and giving it the same value? Why are human writers being pushed into a position of making fodder for computers?

      OTW must make a clear and unambiguous statement that they do not support this. The archive is OUR own, not AIs.

      I have a decade of fanfic on AO3, I don’t want to have to find a new home for it, but if the only place it is safe is my personal harddrive, then that is where it will be removed to if I cannot trust OTW to defend my value as a creator.

    • Meg commented:

      Very well put! I’m most concerned about the real people who are being abused in the data set creation, but “AI” generated text and images are not just bad fanworks, they are not fanworks at all. They do not fall within the OTW’s purview. I’m really disappointed with OTW’s response on this. I realise it may be difficult to enforce a ban on machine generated works but you don’t have to *encourage* this garbage.

  5. thosenearandfarwars commented:

    Great. Please focus more on updating the harassment policy, the abuse report system, and the TOS around hate speech.

  6. Cassie commented:

    I respect the position of allowing AI-written works (even if I disagree) but can they be reported if they’re not tagged as AI-written? I don’t want to support AI works when I can be supporting legitimate artists.

  7. fts commented:

    An AI work is not a fanwork and Ao3 needs to clearly define a hard line between them before considering any change to ToS.

    This impacts all users and we need a definitive definition that can be agreed upon or debated: AI policy needs to be treated with all the transparency of elections.

  8. DL commented:

    This Time article revealed ChatGPT’s dataset was managed by workers in Kenya, paid very low wages. In at least one case, a worker was required to read underage noncon Batman fanfic, which had been removed from its fannish context and presented with no warnings, to a worker who could not freely choose not to read the work without risking her job. https://time.com/6247678/openai-chatgpt-kenya-workers/ AO3 has a robust tagging and warning system to allow readers to know what we’re getting into. We can find the porn we like and avoid the porn we don’t, and we aren’t being threatened with unemployment for choosing not to read something that squicks or triggers us. That fanfic is being used by corporations in a way that strips the community standards we’ve put around it and may traumatize exploited workers seems like a topic the OTW would be well-positioned to address. A particular use of fannish content need not be infringing to be unethical or undesirable.

  9. ladydragon76 commented:

    I am glad to see limits to scraping are a consideration and something the team is on top of. I’d like to add my voice to those who do not want AI created fics allowed in the Archive though. My reasoning is this: Someone came in, stole (scraped) our work and now an AI is going to regurgitate work REAL people did. How’s it artistic or creative to tell a computer program to write you a story about X & Y where Z happens? Where is the work or effort in that? The answer is, of course, it’s in the people who wrote all those stories in the first place before they were taken without permission or even notification that the site would be scraped. Some computer is going to use MY work, something I cried or bled my heart onto the page for, but that’s ok? No. It’s… cheating. It’s theft and regurgitation without effort. It’s cheap. Is it cool that programming is this advanced? Yeah, sure, and as soon as an AI is sentient, it can write for itself, until then, it’s this grey area legal theft that isn’t at all in the same vein at all as someone sitting down, using the imagination and CREATING something for themselves the way every living person on this site does.

  10. Gloria commented:

    ‪Sorry, but “users should be allowed to OPT OUT from having their works incorporated into AI training sets” is ridiculous.‬

    ‪Users should be allowed to ‘OPT IN’. That should be the default. Not the other way around. This seriously puts in doubt the sincerity of your intentions.‬

    ‪This is very transparent: Opt Out models are based on the strategy that users will only be able to withdraw consent if they know how to do it, and are motivated to do so. Meanwhile hile most users will live in ignorance that they’re giving tacit approval for their work to be used to train AI.

  11. M commented:

    Thanks very much for this thorough and honest statement – it answers many questions. I’d appreciate it a lot if OTW continues to update us in this manner.

  12. long time fandom enthusiast commented:

    The inclusion of AI fanfics is really disheartening, especially in very small fandoms. Where exactly do you think the content being “generated” comes from in a small community? It is legitimate plagiarism of the few top authors driving that fandom forward. I cannot express enough how *AGAINST* this policy I am.

    Really, really disappointed in you, OTW.

  13. leen commented:

    Thank you for the info and for continuing to protect the interests of fans online.

  14. d commented:

    Vehemently disagree with the idea that AI-generated works can be considered fanworks – they are created by a machine scraping datasets for associations. There’s nothing creative or meaningful about them. It’s an astonishingly poor position to take and astonishingly at odds with the rest of the creative community’s understanding of the value of these works – it’s a major reason the WGA is on strike and the subject of several lawsuits, and AI-generated works have negatively affected writers and artists in creative employment across the globe. Given the overlap between fan creatives and our professional counterparts, it’s pretty insulting for the OTW to treat AI-generated works on the same level. While it may be very difficult to know when a fic is AI-generated, it would at the very least discourage people from posting them and take an ethical stance in protecting creative endeavors. I strongly urge the OTW to reconsider this stance – or to make it a topic up to public vote to members in a future election.

  15. Taenith @ AO3 commented:

    Legal Chair Betsy Rosenblatt still needs to resign. You say her views don’t reflect the OTW’s, and that she is one of 900+ volunteers. Yet, you entrusted her to represent the OTW and its views to the US government. And she spoke in an OTW capacity to the Association of Research Libraries, which subsequently published the interview you highlighted in OTW Signal! How does any of this make sense?!?!

    —— I do understand she possesses extensive professional qualifications in terms of intellectual property law. But she seems completely out-of-touch with the experiences and concerns of the vast majority of OTW’s donors & userbase regarding our subcultural understanding of the term “fanworks” and the very real ethical/environmental consequences of AI-generated texts or images.

    ——– I very much appreciate the greater clarity about what is being done to prevent scraping. I know that comprehensive legal and/or technical solutions to this issue do not yet exist. I urge OTW leadership to invest time and money pursuing such solutions.

    ——- However, I strongly disagree with your current characterization of AI-generated text/images as fanworks. A software program cannot be a “fan” of anything. It cannot feel protective towards a child character in a horror franchise and decide to create an alternate reality where the child is rescued and eventually defeats the monsters. It cannot exist as a sexual minority in a world full of cishet people, wonder if the two “best friends” in a space mecha show have feelings that go WAY beyond friendship, and decide to create an alternate reality where those characters’ feelings DO blossom into romance.

    —- I understand the legal and technical definition of “transformative works/fair use” etc do not match up with our fannish subcultural understanding of “fanwork”. But you were created to uphold FANdom and FANworks. Please think about this!!!

    —— Finally: please add “generated by AI” to the list of top-level warnings like Major Character Death. If users don’t want others to know that the text they submitted was generated by AI, then “Chose not to use Archive Warnings” would still cover that scenario. While I personally would prefer AI-generated text not be hosted at all on the Archive, I appreciate that it would be nearly impossible to prevent them from being posted entirely. This way, those of us who don’t want to interact with them in any way will not have to do so.

    • Mel commented:

      Agreed, this is very bizarre. This wasn’t a random volunteer posting to their personal Twitter account, this was the Legal Chair speaking in an OTW-related capacity later quoted in formal OTW communication.

    • L8 commented:

      “A software program cannot be a “fan” of anything.”

      As Taenith said above.

      That sentence needs to be the tagline for why AI generated fanworks should be disallowed.

  16. Sei-notti commented:

    I too would like to leave my name down as a voice against the inclusion of wholly AI-generated fic; how can you stand against individuals scraping content to feed into AI, but permit them to show the fruits of the countless items that existed in the archives prior to strengthening the defenses against it?
    Numerous people have posted to their social media about how they’ve taken abandoned stories – or ones they think to be abandoned, or even just those they dislike the ending to – and had AI generate the next chapter or a different ending to it. Do you want to embolden them by saying they could post content spat out by a computer, ‘inspired by’ another fan’s hours of work tailoring each word?
    There will not be a perfect way to prevent the content from being added, but to not even pretend there is an honor system of ‘this work was created by a genuine, human fan without the use of AI or other generative technology’ is a joke.

  17. H. commented:

    So Ao3 is about to become a junkyard with no one but strat bots? Because that’s what will inevitably happen like it happened to other sites that allowed AI “art”.
    There is nothing, literally nothing, fannish in AI. It’s not fanwork. It’s a bot generated text posted specifically *because* someone didn’t have the fannish drive to create anything on their own.

  18. seaara commented:

    Right can we calm the hysteria a little guys, they literally say “at the time of the writing” and “internal discussions are ongoing” because this is a new problem in all our lives and there isn’t an easy solution, as evidenced by varying comments who disagree on how to solve this issue—despite all agreeing that they hate ai. If people in the comments can’t wholly agree then members of the otw team will be the same. It’s been less than 2 days since the backlash started and all the otw have other actual real life full time jobs so can we let them try to work on something that might be agreeable without us all being at their throats for the time it takes them to do it.

    • Mel commented:

      If there are internal discussions, then it’s worth them getting ongoing member input. If we can’t all agree, why not bring it to a vote?

  19. Firepup commented:

    I’m disappointed Legal’s recommendation was an “Opt Out”. Not even an “Opt in”? Opt Outs feel inherently predatory. This is a weak stance.

    Please argue that AI is unethical without consent, credit, and/or compensation to scraped creators. Also, if fanfiction is nonprofit, nothing scraped from us should be allowed to garner income either. It puts all fans, who were obeying the laws, in legal jeopardy through no action or fault of our own. AI shouldn’t be allowed to plagiarize people’s work to splice into commercial works in the first place! Literally nothing trained with the Common Crawl database should ever be allowed to be used for more than personal non-monetizable assets BECAUSE our non-monetizable work is in it. Yet people have and will use this database to sell books with AI!

    People are saying AI like ChatGPT “allegedly” take from databases without permission. Meanwhile, you’re sitting on proof Common Crawl infringed on the consent of roughly 5,740,000 of your own users. It’s not a “‘Whoopsie’ looks like they stole everything from us prior to December 2022!” Legal take these people to court!

    AI should not be allowed to “transform” our nonprofit work, swiped from under our nose with no consent, into commercial viability.

    • long time fandom enthusiast commented:

      ^this

  20. apprepuff commented:

    We appreciate that you’re trying to protect us from scraping, but ❰❰❰AI-generated work is NOT FANWORK because ROBOTS CANNOT BE FANS.❱❱❱ When AI becomes sentient, then it can participate in fandom, but AI “fics” are NOT fanwork. They are regurgitated Frankensteins of STOLEN HUMAN WORK.