高级检索

基于特征工程的代码克隆检测方法

A feature engineering-based approach for code clone detection

  • 摘要: 代码克隆在软件开发中普遍存在,短期内可提升开发效率,但长期会导致代码质量下降、维护成本增加及安全风险上升。在企业环境中,部分开发者为提高研发效能评估评分,可能刻意提交重复代码,这不仅影响项目质量,还会破坏评估公平性。传统检测方法效率低,基于深度学习的模型则面临计算资源高昂和依赖大量标注数据的问题。为此,提出基于特征工程的轻量化机器学习检测框架,通过预训练代码模型生成特征向量,结合相似性算法、代码变更行信息和提交时间序列构建多维特征空间,并评估多种机器学习模型性能。实验表明,基于所构建特征集训练的模型在企业数据集上表现良好,可高效精准检测代码克隆,为可信研发效能体系构建提供新思路。

     

    Abstract: Code cloning is prevalent in software development. While it may temporarily improve development efficiency, it ultimately leads to degraded code quality, increased maintenance costs, and heightened security risks. Particularly in enterprise environments, some developers may deliberately submit duplicated code to achieve higher scores in R&D efficiency evaluations, compromising both project integrity and evaluation fairness. Traditional code clone detection methods suffer from low efficiency, while deep learning models face challenges of high computational costs and reliance on extensive labeled data. To address these limitations, a lightweight machine learning detection framework is proposed based on feature engineering. This framework generates feature vectors via pre-trained code models and constructs a multidimensional feature space by integrating multiple similarity algorithms, code change lines, and commit timing metadata. The performance of various machine learning models is rigorously evaluated. Experimental results demonstrate that models trained on the constructed feature set exhibit strong performance on multiple manually annotated enterprise datasets, enabling efficient and accurate code clone detection, providing novel insights for establishing credible R&D efficiency evaluation systems.

     

/

返回文章
返回