Online presence

How to Know If You Are Reading AI-Generated Text?

Greetings to all cheaters out there, we (will) see YOU

Ștefan Pleșca

Published in

ILLUMINATION

6 min readDec 19, 2022

Intro

We live in the AI revolution, and if you haven’t stayed under a rock for the last five years, you probably know about OpenAI’s GPT.

Just a refresher:

“The OpenAI API can be applied to virtually any task that involves understanding or generating natural language or code. We offer a spectrum of models with different levels of power suitable for different tasks, as well as the ability to fine-tune your own custom models. These models can be used for everything from content generation to semantic search and classification.” — from here

To translate the above into plain English, this API allows the user to generate “natural language texts” that can easily be mistaken as human-generated text. (The applicability of this technology is not limited to generating human-like text, but our main focus here is this aspect of it.)

This is extraordinary for research purposes, investigating language as an entity, obtaining different possibilities to resolve problems, and many, many more.

This is NOT the solution for creative writing!

To emphasize a little more why AI is not suited for generating creative content, I made a short list of thoughts/arguments:

No connection or fake connection to a potential reader
Creativity in AI is the decreasing value of the predictability of words' position in a sentence while trying (the best effort) to maintain the logic of the construction
Total lack of disruptive human behavior (this is my favorite - I love to feel and understand the emotion of the writer - in most cases is a phrase or an emphasis that signs the writing)

I think that people coming here on Medium want to know and feel a human behind the texts and understand their tears, smiles, uncertainties, and emotions in between.

Using an AI to write “your creative materials” on Medium is hurtful, dishonest, and cheating per se on the entire idea of the platform. (Or any creative-oriented platform as well.)

I think this is enough for now, let’s see some examples and some tools to analyze them.

Some examples

All the examples below are from one user, MR, on different published articles. I will give three small samples that we will analyze further. (MR is an acronym to protect the identity of this user.)

“Extraterrestrial life, also known as alien life, is the existence of life beyond Earth and the search for such life within the universe. While there is no definitive evidence of extraterrestrial life, the possibility of its existence has captivated the imagination of humans for centuries. From ancient myths and legends to modern scientific endeavors, the search for extraterrestrial life has inspired countless works of art, literature, and science fiction.” — MR

“Japan is an archipelago of four main islands and several smaller ones located in East Asia, in the northwest Pacific Ocean. The four main islands are Hokkaido, Honshu, Shikoku, and Kyushu. Japan is about the size of California and is located east of the Korean Peninsula and north of the island of Taiwan.” — MR

“More than 7,000 islands make up the Southeast Asian archipelago of the Philippines. It is situated in the western Pacific Ocean and is bordered by the South China Sea to the west, the Philippine Sea to the east, and the Celebes Sea to the southwest. The country is located along the Pacific Ring of Fire, which makes it prone to earthquakes and volcanic eruptions.” — MR

Even if the above samples are small, for the testing part, I used the full length of the articles. The lack of links to the articles is on purpose, I don’t want to drive any traffic to this account, or disclose information about the user at hand.

Some tools to work with

GLTR

“The GLTR demo enables forensic inspection of the visual footprint of a language model on input text to detect whether a text could be real or fake. It is a collaborative effort between Hendrik Strobelt, Sebastian Gehrmann, and Alexander Rush from the MIT-IBM Watson AI lab and Harvard NLP.” — from here

If we run the samples through GLTR we will get this:

GLTR generated screenshots of the 3 pieces in full

The green color in the text represents the probability of AI-generated text. To understand more we can cite the authors:

“Each text is analyzed by how likely each word would be the predicted word given the context to the left. If the actual used word would be in the Top 10 predicted words the background is colored green, for Top 100 in yellow, Top 1000 red, otherwise violet. Try some sample texts from below and see for yourself if you can spot the difference between machine generated text and human generated text or try your own.
The histograms show some statistic about the text: Frac(p) describes the fraction of probability for the actual word divided by the maximum probability of any word at this position. The Top 10 entropy describes the entropy along the top 10 results for each word.” — from here

For a more comprehensive understanding of GLTR please deep dive into their documentation here.

After this test, we can see that those are AI-generated, but maybe we are undecided yet, so let’s try another tool.

OpenAI Detector

“This is an online demo of the GPT-2 output detector model, based on the 🤗/Transformers implementation of RoBERTa. Enter some text in the text box; the predicted probabilities will be displayed below. The results start to get reliable after around 50 tokens.” — from here

If we run the samples(not in full because that would have overwhelmed the system but in a big portion of them) through the OpenAI detector we will get this:

The tool is very explicit, so you can see the above 99% of fake in all tests, meaning AI-generated content with GPT.

Just to be fair I will check one of my writing as well with both to show you the results.

Domain(s) relevance and portability

Some of my struggles and ways to avoid them

faun.pub

While checking the above with both, here are the results:

Screenshots generated with GLTR and OpenAI detector

You can see from the GLTR result that I am approaching something that can be written by an AI (I have specifically run this test on one of my descriptive writings), but the OpenAI Detector has seen it as genuine.

Those are only two simple and free tools to check if what you are reading is AI-generated or human generated. If you have the resources available you can also build your own, using the OpenAI framework and APIs or get insights from Originality.ai.(those solutions are not free)

For Medium friends, instead of a conclusion

Hi Medium and Medium Staff, I am sure that you are already preparing an API or integration to scan those. (I am sure the above example is not the only one floating around) I would like to get to know your approach to this. I want to congratulate you for all your great work, keeping the “garden” clean, and to shout out to everyone that is trying to cheat us with AI, that we, the readers, will SEE you.

Join Medium with my referral link - Ștefan Pleșca

Read every story from Ștefan Pleșca (and thousands of other writers on Medium). Your membership fee directly supports…

stefan-plesca.medium.com