4
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: not found

      OUTFOX: LLM-Generated Essay Detection Through In-Context Learning with Adversarially Generated Examples

      , ,
      Proceedings of the AAAI Conference on Artificial Intelligence
      Association for the Advancement of Artificial Intelligence (AAAI)

      Read this article at

      ScienceOpenPublisher
      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Large Language Models (LLMs) have achieved human-level fluency in text generation, making it difficult to distinguish between human-written and LLM-generated texts. This poses a growing risk of misuse of LLMs and demands the development of detectors to identify LLM-generated texts. However, existing detectors lack robustness against attacks: they degrade detection accuracy by simply paraphrasing LLM-generated texts. Furthermore, a malicious user might attempt to deliberately evade the detectors based on detection results, but this has not been assumed in previous studies. In this paper, we propose OUTFOX, a framework that improves the robustness of LLM-generated-text detectors by allowing both the detector and the attacker to consider each other's output. In this framework, the attacker uses the detector's prediction labels as examples for in-context learning and adversarially generates essays that are harder to detect, while the detector uses the adversarially generated essays as examples for in-context learning to learn to detect essays from a strong attacker. Experiments in the domain of student essays show that the proposed detector improves the detection performance on the attacker-generated texts by up to +41.3 points F1-score. Furthermore, the proposed detector shows a state-of-the-art detection performance: up to 96.9 points F1-score, beating existing detectors on non-attacked texts. Finally, the proposed attacker drastically degrades the performance of detectors by up to -57.0 points F1-score, massively outperforming the baseline paraphrasing method for evading detection.

          Related collections

          Author and article information

          Journal
          Proceedings of the AAAI Conference on Artificial Intelligence
          AAAI
          Association for the Advancement of Artificial Intelligence (AAAI)
          2374-3468
          2159-5399
          March 25 2024
          March 24 2024
          : 38
          : 19
          : 21258-21266
          Article
          10.1609/aaai.v38i19.30120
          f25ef8a3-dc21-4a4f-bf80-6907852730a3
          © 2024
          History

          Comments

          Comment on this article

          scite_
          0
          0
          0
          0
          Smart Citations
          0
          0
          0
          0
          Citing PublicationsSupportingMentioningContrasting
          View Citations

          See how this article has been cited at scite.ai

          scite shows how a scientific paper has been cited by providing the context of the citation, a classification describing whether it supports, mentions, or contrasts the cited claim, and a label indicating in which section the citation was made.

          Similar content136

          Cited by1