AI and Data Scraping on the Archive

With the proliferation of AI tools in recent months, many fans have voiced concerns regarding data scraping and AI-generated works, and how these developments can affect AO3. We share your concerns. We’d like to share what we’ve been doing to combat data scraping and what our current policies on the subject of AI are.

Data scraping and AO3 fanworks

We’ve put in place certain technical measures to hinder large-scale data scraping on AO3, such as rate limiting, and we’re constantly monitoring our traffic for signs of abusive data collection. We do not make exceptions for researchers or those wishing to create datasets. However, we don’t have a policy against responsible data collection — such as those done by academic researchers, fans backing up works to Wayback Machine or Google’s search indexing. Putting systems in place that attempt to block all scraping would be difficult or impossible without also blocking legitimate uses of the site.

With that said, it is an unfortunate reality that anything that is publicly available online can be used for reasons other than its initial intended purposes. In many cases, AI data collection traffic relies on the same techniques as the legitimate use cases above.

Once we became aware that data from AO3 was being included in the Common Crawl dataset — which is used to train AI such as ChatGPT — we put code in place in December 2022 requesting Common Crawl not scrape the Archive again.

We cannot go back in time to stop data collection that already occurred, or remove AO3’s content from existing datasets, as much as we may dislike that it happened. All we can do is attempt to reduce such collection in the future. The Archive’s development team will continue to be on the lookout for individual scrapers collecting AO3 data, and to take action as needed.

Likewise, our Legal committee has and will continue to serve the OTW mission of protecting fanworks from legal challenge and commercial exploitation. This includes their position that users should be allowed to opt out from having their works incorporated into AI training sets, a position that they have presented to the U.S. Copyright Office. They, too, will continue to keep pace with this developing field.

What can I do to avoid data scraping?

You may want to restrict your work to Archive users only. While this will not block every potential scraper, it should provide some protection against large-scale scraping.

AI-generated works and AO3 policies

At the moment, there is nothing in our Terms of Service that prohibits fanworks that are fully or partly generated with AI tools from being posted to the AO3, if they otherwise qualify as fanworks.

Our goals as an organization include maximum inclusivity of fanworks. This means not only the best fanworks, or the most popular fanworks, but all the fanworks that we can preserve. If fans are using AI to generate fanworks, then our current position is that this is also a type of work that is within our mandate to preserve.

Depending on the circumstances, AI-generated works could violate our anti-spam policies (e.g. if a creator posts a significant number in a short time). If you’re uncertain whether a work violates our Terms of Service, you may always report it to our Policy & Abuse team using the link at the bottom of any page, and they can investigate.

This statement reflects AO3’s policy at the time of writing, as we wanted to be transparent with our users about what our current stance is and what can be done – and is being done – to mitigate scraping for AI datasets. However, these policies are also under discussion internally among AO3 volunteers. If we agree on changes to these in the future, those will be announced publicly; additionally, if there are any proposed changes to the AO3 Terms of Service, they will be made available for public comment as is required of any and all changes to our Terms of Service.

We hope that this helps to make things more clear – this is a complicated situation, and we’re doing our very best to address it in a way that doesn’t compromise AO3’s principles of maximum fanwork inclusivity or legitimate uses of the site. As discussions and approaches evolve, we will keep our users updated.

Announcement, Archive of Our Own
  1. helloliriels commented:

    so glad some attempts are being made to combat this. feels so invasive.

    • Unicron commented:

      AI is not creativity. It is regenerating what already existed. By definition it cannot be transformative

    • AI is a 4chan psyop. commented:

      Computers can not be fans of anything period. Only humans can be a fan. Ban AI or suffer the consequences of LiveJournal.

  2. chocolatepot commented:

    Thanks for this update! Can you speak to why your legal chair previously spoke positively about what AO3 could do for AI training?

    • chocolatepot commented:

      (To clarify, I don’t mean this in a hostile manner, I just want to understand the context of the earlier remarks.)

    • commenter commented:

      She didn’t. Please re-read the interview. She said that AI trained on fanfiction would exhibit a “more contemporary, broad, inclusive, and diverse set of ideas” than AI trained only on works created before 1927 – which is true. And she said that legally, the act of training an AI should be considered separately from the legality of AI outputs – a logical position also held by organizations such as the Electronic Frontier Foundation.

      She said nothing whatsoever about AO3 doing anything to support AI training.

      • nicole commented:

        thank you for saying this!!! she made a couple of jokes that people seem to be taking as evidence that she loves ai and doesn’t care about fan authors’ concerns. but she doesn’t say anything of the sort

        • betsy is that you? commented:

          Um she literally said that what Stability AI did was fair use. That’s as pro AI as you can go.

          • Aster commented:

            “Fair use” is a legal position, not a moral one. It’s not “pro ai” to say that there is no legal grounds to prevent it.

      • chocolatepot commented:

        I think you’ve misunderstood me. What you just summarized there is positivity about AO3’s influence on AI. I’m not accusing her of the things everyone else is (deliberately sending fics to be scraped or something), but it’s basic reading comprehension to see that she was being positive rather than negative or even neutral.

        I’m not particularly worried about this, I just found it really odd that she would effectively take a stance for AO3 before the site as a whole did.

  3. Elizabeth commented:

    I’m still interested why the person in the article (I can’t think of her name and really don’t care to) thought it was such fun and such a laugh. That is not something I want to hear from someone who is involved in the legal aspects of the site.

  4. marithlizard commented:

    Thanks for the update! No surprises, but I hope it will help to calm the fears.

  5. Dreamin commented:

    I don’t know why this couldn’t have been posted instead of the non-apology. While I thank you for the clarification, it still leaves the question of Betsy and the conflict between her position within the OTW and her support of AI. She needs to resign over the conflict of interest.

    • nicole commented:

      that’s not what conflict of interest means. you are assuming she has a difference in opinion (we don’t know if she does). a conflict of interest would be if she was part of a company that wanted to scape ao3 for profit.

  6. pnutbutter commented:

    I do hope to see firmer restrictions on AI use in the future. Fanworks are a labor of love, and it pains me to see the personalized aspect of fanfiction being stripped away in favor of fast and easy production. I care a lot about WHO wrote something, as they bring their own style and flavor to a fanfiction, than I care about how fast they wrote it, or how funny the idea is. I would prefer to keep AI-generated fiction away from the archive and the PEOPLE who have made it the amazing collection of works that it is.

  7. mythicaltunes27 commented:

    You can’t say in one breath that you are against scraping of fanfic for legal reasons and then say that it’s okay for people to then post that theft on the archive.

  8. Mel commented:

    “If fans are using AI to generate fanworks, then our current position is that this is also a type of work that is within our mandate to preserve.”

    A computer is not a fan. If it is generated by a computer, it is not a fanwork. It was not a work created by a fan.

    I appreciate the steps OTW is taking to prevent scraping, but I believe the above is still a bad call and hope that OTW changes it.

    • SLWalker commented:

      This. A computer or chatbot is not a person or a fan and therefore the works are not fanworks. But they likely are theft.

      • ApocalypticRomantic commented:

        I wholeheartedly agree.

        • Trepkos commented:

          Please do not allow AI generated items – I won’t call them works – to be included on this site. I agree with those above who say that such works are essentially stolen fanworks.

    • Dani commented:

      I understand your feelings on this, but there is currently no way to effectively distinguish AI created work from human created work. Existing detectors are helpful, but not foolproof and so using them would risk the works of many human writers being caught in the filter and not being able to share their work on AO3. This will continue to be the case as these AI programs get better. For AO3 to implement a policy banning AI content on their site would be implementing a policy they are unable to enforce and would only result in people posting AI content to AO3 without tagging or labelling it as such. I would much rather it be permitted so that people who create content using AI will have no reason to pretend they wrote it themselves and thus allow the rest of us to filter those tags out of our search results.

      • Mel commented:

        Policy and enforcement are separate questions. This is like saying “please don’t make littering illegal, we shouldn’t shoot people for littering”. There are other enforcement possibilities that aren’t as aggressive as what you’re imagining.

        • Dani commented:

          If we are unable to tell with 100% certainty what text is AI generated and what is human written (and collaborations, which is what we are most likely to see), then there is no possibility for any kind of enforcement, no matter if it’s lax or aggressive.

    • G commented:

      Thank you

  9. RaraeAves commented:

    There’s a significant difference between wanting to save as many fanworks as possible, and saving machine-made versions that blatantly steal from the aforementioned fanworks. While you can’t undo the scraping that’s already been done, the Archive/OTW could do more to discourage its use, including preventing the posting of products from scraped data. This is disappointing to say the least.
    In the meantime, what are members meant to make of Betsy’s enthusiasm for AI in the interview that initially raised alarms? How are writers who are anti-theft supposed to feel comfortable with her continued involvement in the organization?

    • EchoEkhi commented:

      You should take comfort in that Prof. Betsy Rosenblatt is a subject-matter expert and a professor at the University of Tulsa College of Law. She knows more about this field than all the rest of us combined, and she is the best possible person to chair the legal committee and stands the best chance to persuade legislators on OTW’s behalf.

      • Mel commented:

        This is an appeal to authority. While expertise is valuable in understanding the possible outcomes of a decision, expertise alone does not determine one’s values.

        • EchoEkhi commented:

          But her values don’t have any bearing on this matter. The legal chair is a technocratic role and not supposed to be democratically accountable, and she only advises the Board of Directors with no power over OTW policy.

          • OK, but... commented:

            She still advises the board who decide policy. No one’s questioning her field of expertise, we’re questioning her field of influence.

        • commenter commented:

          An appeal to authority is only a logical fallacy when the authority’s expertise is irrelevant to the issue at hand. Prof. Rosenblatt is an expert in Intellectual Property law and therefore her expertise is extremely relevant to issues of IP and fanworks.

          • Mel commented:

            Her expertise is relevant to understanding the legal landscape, but it’s ridiculous to say that only people with expertise should be allowed to determine societal values. It’s like saying only experienced marksmen should be allowed to have stances on firearms.

          • yolkcheeks commented:

            “it’s ridiculous to say that only people with expertise should be allowed to determine societal values”
            I must have missed where that was said (by Prof. Rosenblatt or otherwise), would you mind pointing me to it?

      • Ring commented:

        I am not questioning her legal expertise, but she did not accurately describe how the models used to generate content work, and she drew a frankly bizarre conclusion about the impact fanfic could have on them. These models have been trained on such a large volume of text that if they’re still biased–and they are–the only way to correct that bias is probably direct human intervention. This is demonstrated pretty much every time people discover biased results that can be reliably reproduced; it’s usually fixed shortly afterward, because actual human beings go in correct its behavior under those circumstances. The example she gave of models not recognizing certain professions as being open to all genders was brought up very recently with a blatant example; despite the fact that “women can be firefighters” almost definitely exists as a set of words among the billions ChatGPT was trained on, it did not actually learn from this that women can be firefighters. If you ask it, “Can women be firefighters?” it will probably tell you yes; what people discovered was that (simplifying) if you use different pronouns to refer to a firefighter and a teacher and ask it which one did what, it will confidently say that the firefighter took whichever actions correspond to he/him pronouns and assign the teacher she/her pronouns, even if that makes no sense in context. All it “knows” about how to answer that question is what answer is most likely to be used in similar text. Not to be utterly flippant, but the only thing large language models will “learn” from fanfic is how to get who tops wrong if you want the opposite of what the majority of your fandom prefers.

      • Odamaki commented:

        I take no comfort in this at all. She may be an expert in her field but that is no guarantee she is using that with the interests of the community in mind. She has generalised her own personal opinion and failed to serve in the manner she ought to have by consulting with the community she represents first. What she said was ‘I’m totally fine with you stealing my clients’ work’ while we were at the table expecting her to do very much the opposite.
        Experts can be nice people who are got at thier jobs but they are not immune to getting it wrong. Experts are not immune to acting in a way which is technically correct and yet entirely malign. Experts can be wholly blind to areas of non-expertise where different fields overlap with thier own.
        The outcry at the interview demonstrates that Betsy is poorly placed to represent. If she was my legal council in any other circumstance, I’d already be briefing her replacement.

  10. shadowmaat commented:

    Thank you for the clarifications. I’m reassured. And I know it’d be difficult/impossible to ban AI-generated fics since there’s no concrete way to tell if something was reconstituted by a computer or written by a human. Sure, if they boast in the tags/summary, but otherwise? I hope things will settle down a bit now and sorry for any additional headaches y’all’ve been having lately.

    • chocolatepot commented:

      There are detectors like Zero GPT and GPT Zero that can tell if text is ai-generated with a high degree of accuracy! It would be trivial for the Abuse Team to check any fics flagged by users with them.

      • amarisllis commented:

        These tools exist but their accuracy is still *highly* suspect. They are not magic bullets and currently are pretty insufficient in terms of what they do/don’t flag. It’s not something that you should rely on at this stage. At best it’s a strong signal, at worst a coin flip.

      • Rose commented:

        ZeroGPT? You mean that thing that doesn’t work and when someone tested it with the US construction, the result was that it thought it was ninety-something percent AI generated? That AI detection tool?

        And other “could it be AI” tests wouldn’t work either, as some users ‘horde’ (write a bunch of fics over the course of a while then upload them in a short space of time), some users are uploading a bunch of their own fics from elsewhere (also creating the “lost of posts short space of time”), and others have a grammar and sentence structure that could be misconstrued for AI.

        Since the de-dress-up-gamification of AI put it into the eyes and hands of malicious parties, resulting in the data scraping we see today, I’ve disliked AI generated content as much as the next guy, but an attempted crack-down could cause that case where an artist was harassed off of I think it was an art subreddit for having a style that was “too close to AI” (think that painterly anime style), even though they had proof that they drew the image from scratch.

  11. MangoTea commented:

    “If fans are using AI to generate fanworks, then our current position is that this is also a type of work that is within our mandate to preserve.”

    AI generated fanworks should fall in the same category that plagiarized works do.

    Also, want a spam problem? This is how you GET a spam problem. You’ve just told spammers to go ahead and try to shoot their shot.

  12. azimuthal commented:

    While this is good to hear and I’m glad that you’re finally clearly communicating with the users with what’s been going on, and that you’ll continue to do so- it’s also something that feels insufficient.

    I understand AO3’s position about maximum inclusivity- but there’s also the fact that AO3 was built on preserving fannish content and protecting writers. A machine is not as fan. A machine built on the non consensual labour of the very writers you profess to stand for is not something you should be including in your definition on fanwork or transformative work. There’s also the fact that we did not consent to our donations going towards you legally recognising AI outputs as fanwork and advocating for them being such when speaking to government officials. Opt out is also not a remotely sufficient solution in regards to consent when this puts all the onus on not being exploited onto the writers themselves– when so many people are not aware of their exploitation/have passed away and thus cannot consent. The ONLY thing you should be asking for is clear and ethical OPT IN. Otherwise it’s a hollow and insufficient gesture.

    This is not even getting into the fact that these models are so clearly built for profit and to replace every creative profession, which goes against your policy of non-commercial fanwork being fair use. As you say, even in your post that you would protect users against commercial exploitation. The output of these models is just one part. The fact remains that these models themselves are built on commercial exploitation and until there is an ethical AI, there is no place for them, or the works produced by them in fandom.

    I also get that there is not yet any magical detecting machine for AI, but all we ask is that you do not officially endorse and include these works. Copying this from a petition that worded this very well-

    ” -We are NOT asking ao3 to implement some magical tech detection system that scans and analyzes every fic uploaded. This is not a witch hunt.

    -We WANT ao3 to update their TOS and FAQ to reflect that they don’t allow AI works to be hosted on the archive, they don’t condone AI generated works (fics, comments, etc).

    -We DON’T WANT AO3 to implement a reporting tool for works suspected to be written by AI. This is just going to be another tool to harass and abuse writers, will overwhelm support, things which we absolutely do not want.

    -We believe IN GOOD FAITH that every writer on AO3 is a human writer. Yes, it means that some works will “sneak in” to the system, but at that’s on your writer’s conscience.

    The ban on uploading AI-generated works can be as simple as changing a few lines on the FAQ/TOS to reflect AO3’s stance, to putting a small checkbox before posting that YES THIS IS WRITTEN BY A REAL HUMAN.
    eg. on youtube and other user-generated-content platforms have checkboxes or other verification to show that you’re the person uploading.”

    Again, reiterating, that maximum inclusivity does not trump your other ethos, which is protecting your writers.

  13. extremelydisappointed commented:

    I am highly disappointed in OTWs response to this. Not only will this still allow AI owners like those of Chat to steal from fanfic authors, but it can and will call into question the legitimacy of Fair Use and potentially cause creators to abandon their work in an attempt not to get sued by corporations for using their properties in fan work.

    OTW’s entire stance on AI generated fanfiction should be “It is not allowed”, and they should be finding better solutions to prevent theft of fanfic authors’ works from being stolen by greedy corporations who want to justify shutting down fandom communities.

    I will be withdrawing my support of OTW until they choose to do right by the authors who use their site. And I hope others will follow me in this boycot..

  14. Longtime Ao3 writer commented:

    AI is not a co-author and, the fact is all AI is made of stolen data. How is that ethical? How is it okay that a machine is going to be okay to make FAN works? Machine are not fans, and they have no business in a FANWORK archive. I’m disgusted by the abject disregard for creators who are victims of exploitation and the use of something that THE VERY INDUSTRIES WE ARE FANS OF BOYCOTT. The writers strike highlights WHY AI content has 0 place in ANY creative spaces and I look forward to the OTW team doing the right thing and unilaterally banning all AI aided or generated work.

  15. RogueSareth is still unimpressed Ai fic is not fanwork fucking period commented:

    This is at least an attempt to address the issue, but it certainly doesn’t explain why you would send a pro Ai lawyer as your official rep to the US government. Or quote part of her interview so gleefully in Signal like it’s a good thing.

    Also NO, Ai generated fic IS NOT AND NEVER WILL BE FANWORK. It is generating those fic off stolen work.

    If you are still allowing work that was created using the stolen work of actual human authors you are not protecting writers. Period.

    This is an attempt but you need to put in some more effort if you want people’s trust back.

  16. onereyofstarlight commented:

    This update is both appreciated and well overdue. The methods put in place to prevent such large scale scraping is welcome news although I do understand that the ability of OTW to prevent this entirely is limited. I still firmly disagree that AI generated works fall under the umbrella of fanworks for the simple reason that these works are not made by fans. Even if they have been edited by fans, this process still relies on stealing from the hard work of other fans and is not acceptable to me. I will certainly be needing to consider the two positions presented here carefully as I do not believe they are congruent with each other and am still concerned that when push comes to shove OTW will fall on the side of AI and not fans, especially given the recent comments that have come to light.

  17. PP commented:

    Thank you for the clarification on these issues. However, I do want to voice my strong disagreement with the idea that AI-generated items should be considered fanwork. For one, AO3 was scraped prior to December 2022 as stated, which means these AI-generated items have a high chance of having used the the AO3 data that was collected without consent. Two, I strongly believe that fanwork must involve individual or group labor and skill of some kind beyond typing a prompt into a chatbot. Regardless of the subjective quality of what the AI generates, it should not be allowed on the Archive. I urge the OTW to reconsider their stance on this.

  18. b commented:

    i have two thoughts on this:
    1. your legal chair is not some random volunteer and i want to hear an apology from her for conflating her personal views on AI with the goals of the OTW. if her views don’t represent the OTW, why were they posted at all? calling her one volunteer out of 900+ severly downplays her position. if it was a tag wrangler (also one volunteer out of 900+) no one would give a damn.
    2. allowing AI work to be posted is hypocritical to the idea of not allowing scraping. AI work is not fanwork and you can’t call it fanwork just because a fan told chatGPT to write about homestuck.
    and i’ll say what i did on the initial post again: if this had happened before the donation drive, i would never have donated. this is not consistent with what i expect from the OTW.

    • blcwriter commented:

      ^This.

    • Anti-AI commented:

      Same.

    • commenter commented:

      The only people conflating the comments of the Legal Chair with the organizational goals of the OTW are outraged commenters. She never claimed to represent the views of the OTW in her interview.

      • g commented:

        she was actively representing the OTW, idiot

  19. Active Writer commented:

    Thank you for this update. There’s a lot of important information included here that I think provides clarity. I also understand the complications with preventing AI-generated work from being posted to the archive.

    The one issue still remaining is the context of the interview itself and the way it was presented as enthusiasm for having fanworks contribute to AI learning.

    I’m sure Betsy is doing other important work for OTW, and IF the interview did not accurately represent her stance on AI, then people need to know so we can regain that trust.

    Right now, it certainly seems like she truly is enthusiastic about AI as a writing tool, and if that is the case, it’s worrying to have that difference in opinions between the legal team’s chair and the organization’s current policy – which you state could change.

  20. Lif commented:

    AI is a plagairization, and not legitimate fan work. This sucks. I will not be using AO3 in the forseeable future, and forget any donations from me.