Basically what the title says. I know online providers like GPTzero exist, but when dealing with sensitive documents, I would prefer to keep it in-house. A lot of people like to talk big about open source models for generating stuff, but the detection side is not as discussed I feel.

I wonder if this kind of local capability can be stitched into a browser plugin. Hell, doesn’t even need to be a locally hosted service on my home network. Local app on-machine should be fine. But being able to host it as a service to use from other machines would be interesting.

I’m currently not able to give it a proper search but the first glance results are either for people trying to evade these detectors or people trying to locally host language models.

  • MartianSands@sh.itjust.works
    link
    fedilink
    English
    arrow-up
    63
    ·
    6 days ago

    Be cautious about trusting the AI-detection tools, they’re not much better than the AI they’re trying to detect, because they’re just as prone to false positives and false negatives as the agents they claim to detect.

    It’s also inherently an arms race, because if a tool exists which can easily and reliably detect AI generated content then they’d just be using that tool for their training instead of what they already use, and the AI would quickly learn to defeat it. They also wouldn’t be worrying about their training data being contaminated by the output of existing AI, Which is becoming a genuine problem right now

    • iii@mander.xyz
      link
      fedilink
      English
      arrow-up
      18
      ·
      6 days ago

      if a tool exists which can easily and reliably detect AI generated content then they’d just be using that tool for their training

      Generative Adversarial Networks are an example of that idea in action.

  • null_dot@lemmy.dbzer0.com
    link
    fedilink
    English
    arrow-up
    28
    ·
    6 days ago

    There are no decent GPT-detection tools.

    If there were they would be locally hosted language models, and you’d need a reasonable GPU.

    • ggtdbz@lemmy.dbzer0.comOP
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 day ago

      I think I should have been more clear, this is exactly what I’m asking about. I’m somewhat surprised by the reaction this post got, this seems like a very normal thing to want to host.

      Doesn’t help that some people here are replying as if I was asking to locally host the “trick” that is feeding a chatbot text and asking it whether it’s machine-generated. Ideally the software I think I’m looking for would be something that has a bank of LLM models and can kind of do some sort of statistical magic to see how likely a block of tokens is to be generated by them. Would probably need to have quantized models just to make it run at a reasonable speed. So it would, for example, feed the first x tokens in, take stock of how the probability table looks for the next token, compare it to the actual next token in the block, and so on.

      Maybe this is already a thing and I just don’t know the jargon for it. I’m pretty sure I’m more informed about how these transformer algorithms work than the average user of them, but only just.

      • null_dot@lemmy.dbzer0.com
        link
        fedilink
        English
        arrow-up
        1
        ·
        22 hours ago

        Sorry I’m still not really sure what you’re asking for.

        I use Open Web UI, which is the worst name ever, but it’s a web ui for interacting with chat format gen AI models.

        You can install that locally and point it at any of the models hosted remotely by an inference provider.

        So you host the UI but someone else is doing the GPU intensive “inference”.

        There seems to be some models for t his task available on huggingface like this one:

        https://huggingface.co/fakespot-ai/roberta-base-ai-text-detection-v1

        The difficulty may be finding a model which is hosted by an inference provider. Most of the models available on huggingface are just the binary model which you can download and run locally. The popular ones are hosted by inference providers so you can just point a query at their API and get a response.

        As an aside, it’s possible or likely that you know more about how Gen AI works than I do, but I think this type of “probability table for the next token” is from the earlier generations. Or, this type of probability inference might be a foundational concept, but there’s a lot more sophistication layered on top now. I genuinely don’t know. I’m super interested in these technologies but there’s a lot to learn.

  • Eheran@lemmy.world
    link
    fedilink
    English
    arrow-up
    16
    arrow-down
    1
    ·
    edit-2
    6 days ago

    What do you want to achieve with it? They were still (super) unreliable last time I checked. Unreliable as in “you might as well roll a dice”. Oh and they all say they are world leading, best etc.

  • splendoruranium@infosec.pub
    link
    fedilink
    English
    arrow-up
    8
    arrow-down
    1
    ·
    edit-2
    6 days ago

    Basically what the title says. I know online providers like GPTzero exist, but when dealing with sensitive documents, I would prefer to keep it in-house. A lot of people like to talk big about open source models for generating stuff, but the detection side is not as discussed I feel.
    I wonder if this kind of local capability can be stitched into a browser plugin. Hell, doesn’t even need to be a locally hosted service on my home network. Local app on-machine should be fine. But being able to host it as a service to use from other machines would be interesting.
    I’m currently not able to give it a proper search but the first glance results are either for people trying to evade these detectors or people trying to locally host language models.

    In general it’s a fool’s errand, I’m afraid. What’s the specific context in which you’re trying to apply this?

  • sobchak@programming.dev
    link
    fedilink
    English
    arrow-up
    2
    arrow-down
    1
    ·
    6 days ago

    Excessive use of em-dashes, emojis, and other characters that aren’t on standard keyboards. I think these companies purposely have the models generate this stuff so it is easily detectable (so they avoid training on their own slop).

  • Melvin_Ferd@lemmy.world
    link
    fedilink
    English
    arrow-up
    1
    arrow-down
    15
    ·
    6 days ago

    Guys I need something that can check if someone is using autocorrect. Stoopid clankers.

  • falseWhite@lemmy.world
    link
    fedilink
    English
    arrow-up
    4
    arrow-down
    4
    ·
    6 days ago

    I would guess you’d need a model that’s at least just as powerful and smart as the original model that created the content. Otherwise it’s like asking a 10 year old to proof read an article written by an adult.