Abstract:
Code cloning is prevalent in software development. While it may temporarily improve development efficiency, it ultimately leads to degraded code quality, increased maintenance costs, and heightened security risks. Particularly in enterprise environments, some developers may deliberately submit duplicated code to achieve higher scores in R&D efficiency evaluations, compromising both project integrity and evaluation fairness. Traditional code clone detection methods suffer from low efficiency, while deep learning models face challenges of high computational costs and reliance on extensive labeled data. To address these limitations, a lightweight machine learning detection framework is proposed based on feature engineering. This framework generates feature vectors via pre-trained code models and constructs a multidimensional feature space by integrating multiple similarity algorithms, code change lines, and commit timing metadata. The performance of various machine learning models is rigorously evaluated. Experimental results demonstrate that models trained on the constructed feature set exhibit strong performance on multiple manually annotated enterprise datasets, enabling efficient and accurate code clone detection, providing novel insights for establishing credible R&D efficiency evaluation systems.