Source Code Authorship Identification Using Tokenization and Boosting Algorithmsстатья
Информация о цитировании статьи получена из
Scopus
Статья опубликована в журнале из списка Web of Science и/или Scopus
Дата последнего поиска статьи во внешних источниках: 4 августа 2021 г.
Аннотация:Each programmer has his unique coding style. Identification source code authorship solves the problem of determining the most likely creator of the source code, in particular, for plagiarism and disputes about intellectual property violations, as well as to help in finding the creators of malware. Extraction a unique style helps to maintain the uniformity of code in repositories, considering the different influence of programmers. Currently, methods based on random forests and abstract syntax trees, short n-grams for structure preservation and Bayes classifier and others are proposed. We present a new model, called StyleIndex, based on tokenization and tools for analyzing the semantics of programming languages and context of tokens in the program text, and extraction unique author’s style Index. The algorithm applies to various programming languages and shows very high classification accuracy. Moreover, our algorithm is able not only to correlate the source code and its creator, examples of programs which are available for training, but also to divide the program into categories by the alleged authors and have trained on other authors, thereby extraction the components define the style as a global concept, independent from specific authors. The main factors that determine the style are also identified.