A few months ago, the Avaaz platform released a report that warned of the massive presence of fake news (fake news) undetected in social networks, in relation to covid. These contents, with the help of the media that viralize them, are causing another pandemic, which the WHO has called an “infodemic”, capable of causing all kinds of misunderstandings and deception regarding the virus. In addition, part of this news, despite its human appearance, is created en masse using mathematical models of text generation based on artificial neural networks. However, the same ideas and mathematical models can also be used in the opposite direction, and are key in false content detection projects.
The problem of automatic text generation – that is, getting computers to speak or write consistently in different languages natural, like English or Spanish– is linked to the origins of the history of Computer Science, since it allows the machine and the human user to communicate easily. The first systems – like the chatbot ELIZA (created in 1964) that emulated a psychologist, or the software Racter (1984), which produced one of the first novels written (almost) without human intervention – they generated the sentences by applying a set of rules, called formal grammars.
The results, despite notable advances in this field over decades, were unconvincing. To achieve them, a paradigm shift in natural language processing was required, which came with the turn of the century and the processing of big data. Now, these new models, instead of requiring grammar rules entered manually, process huge amounts of texts with techniques of big data for to learn the linguistic patterns themselves. Thus, machines, although they do not understand language, are capable of repeating the most typical patterns that appear in natural languages.
According to the so-called distributional hypothesis, popularized by the linguist John Rubert Firth in the 50s of the last century, the meaning of a word is given by the other words that usually accompany it (its neighbors)
For this, these systems start from the call distributional hypothesis, popularized by the linguist John Rubert Firth in the 1950s last century, according to which the meaning of a word is given by the other words that usually accompany it (its neighbors). Imagine that, for example, we want a machine to extract the meaning of the word “dog”, studying the presence on the Internet of three phrases: “dogs have muzzles”; “Dogs bark” and “dogs sew scarves.” To do this, you could consider all the text available on the Internet (in Spanish) and see which of these phrases appear more frequently. Surely, the first two sentences are much more common than the third, that is to say, the word “dog” is usually accompanied by “snout” and “bark”, and not by “sew” so, applying the distributional hypothesis, a dog it will be “something” that has a snout and barks, but does not sew.
This is how the language models (LM), and this is how the meanings of words, which are nothing more than frequent patterns of all natural text considered by the machine. LMs are the basic components of current text generation systems, which generate sentences by predicting the next word, given a series of previous words, using ideas of probability and statistics. In the example above, the model will predict that after “the dog”, the probability that the word “barks” will appear is greater than that the word “sew” will appear.
Mathematically, these systems represent each word as a vector, the call word embedding, of about 300 dimensions. The most used system to do this is called word2vec. In this geometric space, the similar words are close (thus, “dog” would be closer to “barking” than to “sewing”) and also operations can be carried out between them, or new ones can be generated. One of the most powerful models to date are the so-called GPT-2 and its successor GPT-3, from OpenAI company, that generate texts of surprising quality. So much so that in 2019 they had to remove your generation system fake news for fear of misuse. Despite this precaution, today the use of models of this type for text generation is widespread and is not easy to detect. We suggest that readers try to guess, from these music product reviews, which are legitimate and have been generated by a model similar to OpenAI. Hint: half are one type, and half another.
New models like GLTR try to identify even the most sophisticated automatic texts. They use mathematical tools that colorize words according to how likely they are
Against this, new models such as GLTR they try to identify even the most sophisticated automatic texts. They use mathematical tools similar to the previous ones, which categorize the words by color according to how likely they are: in green (if they are within the 10 most plausible in that context, for that model), in yellow (top 100), in red ( top 1000) and the rest in purple. To evaluate whether a text is false, the model counts the number of words in each color: if the number of words in green is very high, it is very likely that the text has been generated by a machine, on the contrary, if in its Most are less likely red, yellow or purple words, it may have been written by a human.
According to recent results, the success of this tool is considerable: without it, evaluators discriminate news generated by humans from news of machines with a 54.2% correct; with them the rate rises to 72.3%. However, surely when this article is published this data will already have changed: in the context of the infodemic, we are living an accelerated arms race to design, on the one hand, the best generative text models and, on the other, the corresponding detectors .
Victor gallego and Alberto Redondo They are pre-doctoral researchers at ICMAT. Timon G Longoria Agate is responsible for communication and dissemination of the ICMAT
Coffee and theorems is a section dedicated to mathematics and the environment in which it is created, coordinated by the Institute of Mathematical Sciences (ICMAT), in which the researchers and members of the center describe the latest advances in this discipline, share meeting points between the mathematics and other social and cultural expressions and remember those who marked its development and knew how to transform coffee into theorems. The name evokes the definition of the Hungarian mathematician Alfred Rényi: “A mathematician is a machine that transforms coffee into theorems.”
Editing and coordination: Ágata A. Timón García-Longoria (ICMAT)