Fine-tuning codellama with own dataset for code review #257

DAP5555 · 2024-12-15T01:59:42Z

Hi :)
I am planning to fine tune Code LLaMA 7B using my own dataset to evaluate its effectiveness for code review tasks. I came across a paper where the authors attempted a similar approach using data from CodeReviewer (https://arxiv.org/abs/2203.09095) .

My plan is to gather data from the pull requests in our repositories and create JSON files in the following format:
{
"oldf": "... contents of another old file ...",
"patch": "@@ -25,13 +25,16 @@ ...", // Diffs from the repo
"msg": "we call cities + towns ...",
"id": 12959,
"y": 1
}
Will this data format work for fine tuning Code LLaMA, or will I need to adjust the dataset to a specific format for it to be compatible?
Thanks in advance

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fine-tuning codellama with own dataset for code review #257

Fine-tuning codellama with own dataset for code review #257

DAP5555 commented Dec 15, 2024

Fine-tuning codellama with own dataset for code review #257

Fine-tuning codellama with own dataset for code review #257

Comments

DAP5555 commented Dec 15, 2024