The optimization of code allocation for heterogeneous architectures, such as Central Processing Units (CPUs) and Graphics Processing Units (GPUs), remains challenging due to the limitations of traditional compiler heuristics and existing machine learning approaches. This paper presents a systematic evaluation of Large Language Models (LLMs) for classifying source code execution targets in heterogeneous device mapping. We fine-tune and compare six models: Distilled Bidirectional Encoder Representations from Transformers (DistilBERT), Code Bidirectional Encoder Representations from Transformers (CodeBERT), Code Bidirectional Encoder Representations from Transformers with RoBERTa (Robustly Optimized BERT Pretraining Approach) architecture (CodeBERTa), CodeT5, jTrans, and Deep Learning Low Level Virtual Machine (DeepLLVM), trained on Open Computing Language (OpenCL) kernels. Results show that general-purpose LLMs achieve up to 92.8% accuracy, matching or surpassing code-specific models, and outperform the previous state of the art (DeepLLVM) by up to 5%. Our findings indicate that LLMs pre-trained on general text are not necessarily inferior to code-specialized models, with tokenizer design and pre-training objectives impacting performance more than domain specialization. These results demonstrate the effectiveness of Transformer-based LLMs as a state-of-the-art approach for source code classification in heterogeneous computing contexts.

A transformer-based approach for source code classification for heterogeneous device mapping

Marco Siino
Primo
;
2025-01-01

Abstract

The optimization of code allocation for heterogeneous architectures, such as Central Processing Units (CPUs) and Graphics Processing Units (GPUs), remains challenging due to the limitations of traditional compiler heuristics and existing machine learning approaches. This paper presents a systematic evaluation of Large Language Models (LLMs) for classifying source code execution targets in heterogeneous device mapping. We fine-tune and compare six models: Distilled Bidirectional Encoder Representations from Transformers (DistilBERT), Code Bidirectional Encoder Representations from Transformers (CodeBERT), Code Bidirectional Encoder Representations from Transformers with RoBERTa (Robustly Optimized BERT Pretraining Approach) architecture (CodeBERTa), CodeT5, jTrans, and Deep Learning Low Level Virtual Machine (DeepLLVM), trained on Open Computing Language (OpenCL) kernels. Results show that general-purpose LLMs achieve up to 92.8% accuracy, matching or surpassing code-specific models, and outperform the previous state of the art (DeepLLVM) by up to 5%. Our findings indicate that LLMs pre-trained on general text are not necessarily inferior to code-specialized models, with tokenizer design and pre-training objectives impacting performance more than domain specialization. These results demonstrate the effectiveness of Transformer-based LLMs as a state-of-the-art approach for source code classification in heterogeneous computing contexts.
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.11769/689933
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact