Disinfo Demasked by Narrative Detection!
The Project
is a contribution to address the global threat of information disorders impacting countries from the Global South. It analyses propaganda and disinformation targeting the Arabic-speaking world, while exploring the potential of AI and LLMs to unveil their underlying narratives. It focuses on Russian state-driven propaganda and disinformation as a study case. The findings and LLM experiments can be used to analyze and detect other sources of propaganda and disinformation, too.
Collaboration
The project was initiated by DW Akademie, Deutsche Welle’s center for international media development. It was implemented in collaboration with Deutsche Welle’s Research & Cooperations Unit (DW ReCo), and SocialLab as technology partner.
Researchers
The driving force behind the project research, methodology and findings are a group of experts in the fields of data science and AI, geopolitics, disinformation, especially Russian disinformation spread in the Global South, as well as journalism and fact-checking. This interdisciplinary group comes from several countries from the Middle East, Asia, Africa and America.
Findings & Data
On this website, you can follow the whole project process step by step and access all research findings and project outputs. PLEASE NOTE: the datasets produced in the project can be accessed only via the contact field after sharing key information with us.
Project Flow
1- Project Kick-off : Setting the scope, milestones and goals
- Why this Project
- Why Russian Disinformation in the MENA Region
- Why Social Listening Technologies and AI
2- Defining the research scope
3- Data Collection
- Dataset 2: 270K Records from Telegram Channel of Russia Today Arabic (AR)
4- Data Selection and Preparation
- The posts for human labelling were selected from the dataset of 4000 posts based on top views and forwards.
->The 500 posts with the most engagements were selected and equally divided among the researchers.
2- Dataset 4: 4000 translated unique records from Sputnik and RT – top views and forwarded – ENG
3- Dataset 5: 500 posts with the most engagements prepared for human labelling (combined dataset Sputnik and RT)
5- Development of labelling system (Narratives coding Methodology)
- Labelling System and Coding Instructions
- See all details in the Research Methodology
6- Human Labelling and Research findings: Top Russian Narratives targeting Arabic-speaking audiences
- Top 20 narratives with the most impact on Arabic-speaking audiences spread via Telegram Channels of Sputnik and RT (Check Narrative)
7- Feedback-Loop: Expert Meet-up in Berlin
The group confirmed the importance of developing AI models for narrative detection as a promising approach to tackle the issue of disinformation or FIMI (Foreign Information Manipulation and Interference).
8- Machine Learning Experiment: automated labelling of 2K Posts
* The 275 humanly annotated posts were prepared for five-fold cross validation by generating five 80%-20% splits.
* Another ~1500 posts were selected from the unlabeled remainder of the previously selected 4000 posts.
The posts were prepared by removing all information other than the English translation of the text, and the empty fields for labelling.
- Unseen posts were labelled in blocks of 10, using 15 examples from the human annotated data to illustrate the coding methodology.
1. The quality of the machine labelling of the 275 humanly labelled data points was estimated by comparing the predicted themes and sub-themes to the human answers. The scores used for evaluation are Accuracy, Precision, Recall and F1-Score.
2. The previously unseen, machine labelled data points were manually validated by the research team. The validation was performed in a qualitative way, assigning labels from (partially) correct to (partially) incorrect and giving explanations on what went wrong.
- Dataset 8: Machine labelling of ~1500 additional posts for human evaluation
- Technical report including detailed explanation to reproduce the experiment
- Codes and Prompts used in the ML experimentation
Research Conclusion
Russian narratives targeting Arabic-speaking audiences
The findings indicate a strategic use of Sputnik and Russia Today to disseminate narratives among Arabic-speaking audiences that favour Russian perspectives, especially in geopolitical conflicts.
Russia Today: Themes ranged from promoting RT Arabic's live broadcasts to discussions on Russia's historical interactions with global figures like Saddam Hussein. Messages often portrayed Russia in a positive or defensive light, particularly in the context of international criticism or conflict.
Sputnik: consistent focus on military and geopolitical themes. Several messages featured footage from the Russian Ministry of Defense, suggesting an intent to showcase Russian military operations positively.
In both channels: Emphasis on military successes and the portrayal of opposing forces, particularly in the context of the Ukraine conflict and recurrent portrayal of Russia as a resilient force and a beacon of stability.
The in-depth data analysis highlighted the 20 top narratives with the most impact on the audiences of those channels (see the narratives here)
Potential of ML models for narrative detection
- Confirmation that structured, labelled databases are critical to unlocking the potential of AI for narrative detection.
- Early experiments show that LLMs have the potential to recognize narratives in texts.
- This is the case even when this is done via instructions and a few examples provided (few-shot prompting).
- The potential for fine-tuning may be even higher, but this remains to be determined.
- The first experiment highlights the potential to perform narrative recognition using clear instructions and less data, without the need for very large datasets.
- It is important to include all relevant information in the prompt, such as the time frame, to ensure that the model is able to place the text in the correct context.