Abusive and Threatening Language Detection Task in Urdu

CICLing 2021 track @ FIRE 2021 co-hosted with ODS SoC 2021

Dear Participants and Interested Community,

The leaderboard part of the competition is over as of Aug 29, 2021. However, even if you have not participated yet, you can still participate.

We encourage EVERYONE (ODS SoC leaderboard participants as well as those who were late for the leaderboard deadline) to submit their technical reports to the published in [FIRE 2021] Hate Speech and Offensive Content Identification track before the deadline on September 25, 2021.

Please, see the instructions on submissions.

For your convenience, we make available the datasets along with the ground truth annotations ("correct labels") to the Test datasets for both subtasks:

Abusive Task

Threat Task

The results on the Private Leaderboard have already been published on the respective pages for Subtasks A and B

When writing a paper or producing a software application, tool, or interface based on the datasets or baseline systems provided on this website, it is necessary to properly cite the source. Below are citations for the corresponding paper :

Maaz Amjad, Alisa Zhila, Grigori Sidorov, Andrey Labunets, Sabur Butt, Hamza Imam Amjad, Oxana Vitman, and Alexander Gelbukh (2021) Overview of Abusive and Threatening Language Detection in Urdu at FIRE 2021. In CEUR Workshop Proceedings. (2021). CEUR Workshop Proceedings.

Task Description

With the growth of spread and importance of social media platforms, the effect of their misuse became more and more impactful. In particular, numerous posts contain abusive language towards certain users and hence worsen users’ experience from communication via such platforms, while other posts contain actual threats that potentially put platform users in danger. The Urdu language has more than 230 million speakers worldwide with vast representation on social networks and digital media.

We encourage participants to suggest methods that can automatically detect threat and abuse in Urdu language to avoid violence and outrageous consequences.

To our best knowledge, this is the first shared task on Abusive language detection in Urdu.

Important Dates

July 19 – training and public test data release; submission platform opens

August 27 - submission deadline

August 28 - results announced (for private test set)

September 3-5 - presentations at ODS Summer of Code 2021 festival [optional]

~~September 25~~ October 04, 2021 [EXTENDED] - technical report submission for publication in Working Notes FIRE 2021

October 5 - review notifications

~~October~~ 10 October 15, 2021 [EXTENDED]– Camera Ready Due

16-20 December - FIRE 2021 (Online Event)

NOTE: All dates are End-of-Day Pacific Time Zone

DATASETS

The Task is divided into two Subtasks

Participants in this year’s shared task can choose to participate in either one or both subtasks.

Sub-task A:

Sub-task A focuses on detecting Abusive language using Twitter tweets in Urdu language. This is a binary classification task in which participating systems are required to classify tweets into two classes, namely: Abusive and Non-Abusive

Abusive - This Twitter post contains any abusive content.

Non-Abusive - This Twitter post does not contain any abusive or profane content.

We followed Twitter definition to describe abusive comments toward an individual or groups to harass, intimidate, or silence someone else’s voice.

Dataset for Subtask A

For participation in Sub-Task A, please proceed here

For downloading the dataset, please proceed here

Sub-task B:

Sub-task B focuses on detecting Threatening language using Twitter tweets in Urdu language. This is a binary classification task in which participating systems are required to classify tweets into two classes, namely: Threatening and Non-Threatening.

Threatening - This Twitter post contains any threatening content.

Non-Threatening - This Twitter post does not contain any threatening or profane content.

We followed Twitter's definition to describe Threatening posts toward an individual or groups to threaten with violent acts, to kill or inflict serious physical harm, to intimidate, and to use violent language

Dataset for Subtask B

For participation in Sub-Task B, please proceed here

For downloading the dataset, please proceed here

Submission Instructions

Task Results Submission

Participants may join in teams of up to 7 people including all co-authors of the paper (if desired to submit a technical report)
Task link and result submission instructions TBD on July 19, 2021
The format and limit of result submissions TBD
When using the provided datasets, please, cite our paper as follows:

Maaz Amjad, Alisa Zhila, Grigori Sidorov, Andrey Labunets, Sabur Butt, Hamza Imam Amjad, Oxana Vitman, and Alexander Gelbukh (2021) Overview of Abusive and Threatening Language Detection in Urdu at FIRE 2021. . In CEUR Workshop Proceedings. (2021). CEUR Workshop Proceedings.

Publication Instructions

Technical Report Submission

After the result submission deadline, participants are invited to submit an abstract and a technical report paper with a brief description of their approach and experiments for publication in the FIRE 2021 Proceedings. All the working notes will be published in CEUR Workshop Proceedings.

For teams wishing to submit their technical reports, please, also register here:

REGISTRATION

Guidelines for Authors

---------------------------------

Note: Even if you are not sure about your plans to submit the report, please, better register early. You can withdraw later.

Technical reports MUST be accompanied by the system code.
The code of the final system, an abstract, and a technical report in a zip file should be submitted at UrduAbs2021 (at) CICLing.org (please name the folder with your team name).
Teams participating in more than one track can submit one or two corresponding technical reports depending on whether the systems for each sub-task are sufficiently different.
The technical report paper length should be up to 9 pages, but not shorter than 5 pages, and we encourage you to write longer papers (up to 9 pages) while the contents justifies the length.
All technical report submissions should be in single column CEUR format. Authors should use one of the CEUR Templates below:

Overleaf: https://www.overleaf.com/read/gwhxnqcghhdt
Word and Latex: http://ceur-ws.org/Vol-XXX/CEURART.zip

a. copyright-ntp This agreement is to be filled by teams who have not used any third-party data/resource for their working notes. This will be the case for most of the teams. For downloading copyright-ntp, please proceed here

b. copyright-tp In case a group has used some third party material, they have to submit this form, which states that they have obtained the necessary permissions for use of such material. For downloading copyright-tp, please proceed here

The Partially filled copyright agreements are attached to this email. CEUR mandated physical signing of the forms after taking a printout. In case some participants do not have access to a printer (e.g. institute is closed due to covid-19), an alternate approach is mentioned here. A partially/incorrectly filled or digitally signed copyright form is the most common error that authors usually make, and special attention should be paid to it. Kindly ensure Date, place or title are not missing from the copyright agreement forms. For more details please read: http://ceur-ws.org/HOWTOSUBMIT.html#AUTHORAGREEMENT

Each team can submit ONLY ONE working note across all subtasks of a given track. This condition however is not applicable to sub-tracks within HASOC. So a team can submit to both "HASOC - Dravidian Languages" as well as "HASOC - Abusive and Threatening language detection in Urdu" but not multiple working notes for subtasks within these sub-tracks.

These will be published with CEUR as has been the trend for the last several years. We need the following from each team:

a) Working notes

b) Copyright agreement

Common Issues

Following are some common errors that the authors must try to avoid:

- Inclusion of names of teams and tracks in titles of working notes is highly discouraged.

For instance titles like "Avengers@AILA: Legal document retrieval" should rather be written as "Legal Document Retrieval"

- Author names should not have any prefixes like Dr., Prof., etc

- Copyright information within the footer of the first page should be "Forum for Information Retrieval Evaluation, December 13-17, 2021, India" and should not be changed to include track names or any other details/modifications.

Instruction for the Submission

---------------------------------

Papers should be submitted through the following link: https://easychair.org/conferences/?conf=fire21

Participants have to register at EasyChair. After registration, the following page should appear. Select "Make a new submission":