Abusive and Threatening Language Detection Task in Urdu
CICLing 2021 track @ FIRE 2021 co-hosted with ODS SoC 2021
Dear Participants and Interested Community,
The leaderboard part of the competition is over as of Aug 29, 2021. However, even if you have not participated yet, you can still participate.
We encourage EVERYONE (ODS SoC leaderboard participants as well as those who were late for the leaderboard deadline) to submit their technical reports to the published in [FIRE 2021] Hate Speech and Offensive Content Identification track before the deadline on September 25, 2021.
Please, see the instructions on submissions.
For your convenience, we make available the datasets along with the ground truth annotations ("correct labels") to the Test datasets for both subtasks:
The results on the Private Leaderboard have already been published on the respective pages for Subtasks A and B
When writing a paper or producing a software application, tool, or interface based on the datasets or baseline systems provided on this website, it is necessary to properly cite the source. Below are citations for the corresponding paper :
Maaz Amjad, Alisa Zhila, Grigori Sidorov, Andrey Labunets, Sabur Butt, Hamza Imam Amjad, Oxana Vitman, and Alexander Gelbukh (2021) Overview of Abusive and Threatening Language Detection in Urdu at FIRE 2021. In CEUR Workshop Proceedings. (2021). CEUR Workshop Proceedings.
Task Description
With the growth of spread and importance of social media platforms, the effect of their misuse became more and more impactful. In particular, numerous posts contain abusive language towards certain users and hence worsen users’ experience from communication via such platforms, while other posts contain actual threats that potentially put platform users in danger. The Urdu language has more than 230 million speakers worldwide with vast representation on social networks and digital media.
We encourage participants to suggest methods that can automatically detect threat and abuse in Urdu language to avoid violence and outrageous consequences.
To our best knowledge, this is the first shared task on Abusive language detection in Urdu.
Important Dates
July 19 – training and public test data release; submission platform opens
August 27 - submission deadline
August 28 - results announced (for private test set)
September 3-5 - presentations at ODS Summer of Code 2021 festival [optional]
September 25 October 04, 2021 [EXTENDED] - technical report submission for publication in Working Notes FIRE 2021
October 5 - review notifications
October 10 October 15, 2021 [EXTENDED]– Camera Ready Due
16-20 December - FIRE 2021 (Online Event)
NOTE: All dates are End-of-Day Pacific Time Zone
DATASETS
The Task is divided into two Subtasks
Participants in this year’s shared task can choose to participate in either one or both subtasks.
Sub-task A:
Sub-task A focuses on detecting Abusive language using Twitter tweets in Urdu language. This is a binary classification task in which participating systems are required to classify tweets into two classes, namely: Abusive and Non-Abusive
Abusive - This Twitter post contains any abusive content.
Non-Abusive - This Twitter post does not contain any abusive or profane content.
We followed Twitter definition to describe abusive comments toward an individual or groups to harass, intimidate, or silence someone else’s voice.
Dataset for Subtask A
For participation in Sub-Task A, please proceed here
For downloading the dataset, please proceed here
Sub-task B:
Sub-task B focuses on detecting Threatening language using Twitter tweets in Urdu language. This is a binary classification task in which participating systems are required to classify tweets into two classes, namely: Threatening and Non-Threatening.
Threatening - This Twitter post contains any threatening content.
Non-Threatening - This Twitter post does not contain any threatening or profane content.
We followed Twitter's definition to describe Threatening posts toward an individual or groups to threaten with violent acts, to kill or inflict serious physical harm, to intimidate, and to use violent language
Dataset for Subtask B
For participation in Sub-Task B, please proceed here
For downloading the dataset, please proceed here
Submission Instructions
Task Results Submission
Participants may join in teams of up to 7 people including all co-authors of the paper (if desired to submit a technical report)
Task link and result submission instructions TBD on July 19, 2021
The format and limit of result submissions TBD
When using the provided datasets, please, cite our paper as follows:
Maaz Amjad, Alisa Zhila, Grigori Sidorov, Andrey Labunets, Sabur Butt, Hamza Imam Amjad, Oxana Vitman, and Alexander Gelbukh (2021) Overview of Abusive and Threatening Language Detection in Urdu at FIRE 2021. . In CEUR Workshop Proceedings. (2021). CEUR Workshop Proceedings.
Publication Instructions
Technical Report Submission
After the result submission deadline, participants are invited to submit an abstract and a technical report paper with a brief description of their approach and experiments for publication in the FIRE 2021 Proceedings. All the working notes will be published in CEUR Workshop Proceedings.
For teams wishing to submit their technical reports, please, also register here:
Guidelines for Authors
---------------------------------
Note: Even if you are not sure about your plans to submit the report, please, better register early. You can withdraw later.
Technical reports MUST be accompanied by the system code.
The code of the final system, an abstract, and a technical report in a zip file should be submitted at UrduAbs2021 (at) CICLing.org (please name the folder with your team name).
Teams participating in more than one track can submit one or two corresponding technical reports depending on whether the systems for each sub-task are sufficiently different.
The technical report paper length should be up to 9 pages, but not shorter than 5 pages, and we encourage you to write longer papers (up to 9 pages) while the contents justifies the length.
All technical report submissions should be in single column CEUR format. Authors should use one of the CEUR Templates below:
Word and Latex: http://ceur-ws.org/Vol-XXX/CEURART.zip
Copyright Agreement: Each group also has to submit a copyright agreement form. This year there are two different forms:
a. copyright-ntp This agreement is to be filled by teams who have not used any third-party data/resource for their working notes. This will be the case for most of the teams. For downloading copyright-ntp, please proceed here
b. copyright-tp In case a group has used some third party material, they have to submit this form, which states that they have obtained the necessary permissions for use of such material. For downloading copyright-tp, please proceed here
The Partially filled copyright agreements are attached to this email. CEUR mandated physical signing of the forms after taking a printout. In case some participants do not have access to a printer (e.g. institute is closed due to covid-19), an alternate approach is mentioned here. A partially/incorrectly filled or digitally signed copyright form is the most common error that authors usually make, and special attention should be paid to it. Kindly ensure Date, place or title are not missing from the copyright agreement forms. For more details please read: http://ceur-ws.org/HOWTOSUBMIT.html#AUTHORAGREEMENT
Each team can submit ONLY ONE working note across all subtasks of a given track. This condition however is not applicable to sub-tracks within HASOC. So a team can submit to both "HASOC - Dravidian Languages" as well as "HASOC - Abusive and Threatening language detection in Urdu" but not multiple working notes for subtasks within these sub-tracks.
These will be published with CEUR as has been the trend for the last several years. We need the following from each team:
a) Working notes
b) Copyright agreement
Common Issues
Following are some common errors that the authors must try to avoid:
- Inclusion of names of teams and tracks in titles of working notes is highly discouraged.
For instance titles like "Avengers@AILA: Legal document retrieval" should rather be written as "Legal Document Retrieval"
- Author names should not have any prefixes like Dr., Prof., etc
- Copyright information within the footer of the first page should be "Forum for Information Retrieval Evaluation, December 13-17, 2021, India" and should not be changed to include track names or any other details/modifications.
Instruction for the Submission
---------------------------------
Papers should be submitted through the following link: https://easychair.org/conferences/?conf=fire21
Participants have to register at EasyChair. After registration, the following page should appear. Select "Make a new submission":
Choose the right track to submit the paper:
Evaluation plan
We will use Accuracy and F1 for ranking of the results.
ROC-AUC will also be posted for teams where confidence scores are provided but will not be defining in the final ranking.
Results and Ranking
Organizing Team
Task Chairs
Maaz Amjad
maazamjad (at) phystech (dot) edu
https://nlp.cic.ipn.mx/maazamjad
PhD student, Center for Computing Research (CIC), Instituto Politécnico Nacional (IPN), Mexico.
Alexander Gelbukh
gelbukh (at) gelbukh (dot) com https://www.gelbukh.com
Full Professor, Center for Computing Research (CIC), Instituto Politécnico Nacional (IPN), Mexico.
Grigori Sidorov
sidorov (at) cic (dot) ipn (dot) mx http://www.cic.ipn.mx/~sidorov
Full Professor, Center for Computing Research (CIC), Instituto Politécnico Nacional (IPN), Mexico.
Alisa Zhila
alisa (dot) zhila (at) gmail (dot) com https://nlp.cic.ipn.mx/~alisa
Independent Researcher
Sabur Butt
sabur (at) nlp (dot) cic (dot) ipn (dot) mx
PhD student, Center for Computing Research (CIC), Instituto Politécnico Nacional (IPN), Mexico.
Oxana Vitman
oksana (dot) vittmann (at) gmail (dot) com
Master student, Center for Computing Research (CIC), Instituto Politécnico Nacional (IPN), Mexico.
Ulyana Astanina
ODS.ai/DataSouls
Hamza Imam Amjad
hamzaimamamjad (at) phystech (dot) edu
Master student, Moscow Institute of Physics and Technology, Russia.
Alexey Natekin
ODS.ai/DataSouls
Andrey Labunets
Independent Researcher
Contact the program committee via email:
UrduAbs2021@cicling.org