Cover Image for System.Linq.Enumerable+EnumerablePartition`1[System.Char]

Arabic Authorship Attribution Using Synthetic Minority Over-Sampling Technique and Principal Components Analysis for Imbalanced Documents

OAI: oai:igi-global.com:273159 • DOI: 10.4018/IJCINI.20211001.oa33

Published by: IGI Global

Abstract

Nowadays, dealing with imbalanced data represents a great challenge in data mining as well as in machine learning task. In this investigation, we are interested in the problem of class imbalance in Authorship Attribution (AA) task, with specific application on Arabic text data. This article proposes a new hybrid approach based on Principal Components Analysis (PCA) and Synthetic Minority Over-sampling Technique (SMOTE), which considerably improve the performances of authorship attribution on imbalanced data. The used dataset contains 7 Arabic books written by 7 different scholars, which are segmented into text segments of the same size, with an average length of 2900 words per text. The obtained results of our experiments show that the proposed approach using the SMO-SVM classifier, presents high performance in terms of authorship attribution accuracy (100%), especially with starting character-bigrams. In addition, the proposed method appears quite interesting by improving the AA performances in imbalanced datasets, mainly with function words.

Cognitive Informatics Computer Science and Information Technology Cognitive Informatics