A computational analysis of Hinglish cyberbullying detection using machine learning techniques
Author(s): Manish Joshi, Dhirendra Pandey and Vandana Pandey
Abstract: The prevalence of cyberbullying across social media platforms has highlighted the need for effective detection systems, especially in multilingual regions like India, where code-mixed language use is common. Hinglish, a blend of Hindi and English, is widely used in online communication yet remains underrepresented in existing datasets for cyberbullying detection. This study addresses this gap by creating a Hinglish cyberbullying dataset of 22,523 annotated tweets, specifically designed to support machine learning approaches. Various machine learning models were trained and evaluated on this dataset, with Random Forest achieving the highest accuracy among the tested algorithms. Our findings emphasize the importance of targeted language resources for cyberbullying detection in multilingual contexts and demonstrate the potential of ensemble methods like Random Forest for classification tasks in code-mixed settings.