JALTCALL 2024 Conference

Name: JALTCALL 2024 Conference
Start: 2024-05-17T09:20:00+09:00
End: 2024-05-19T17:00:00+09:00
Location: Meijo University Nagoya Dome Campus

17–19 May 2024

Meijo University Nagoya Dome Campus

Asia/Tokyo timezone

Contact

conference@jaltcall.org

Flavoring the Vanilla: Finetuning Large Language Models for Automated Evaluation of Argumentative Writing

18 May 2024, 10:50

30m

DN 411 (North Building)

Research Presentation (30 minutes) AI for Teaching DN 411: Mixed Sessions

Qiao Wang (Waseda University)Dr John Gayed (Waseda University)

To address the long-standing challenge facing traditional automated writing evaluation (AWE) systems in assessing higher-order thinking, this study built an AWE system for scoring argumentative essays by finetuning the GPT-3.5 Large Language Model and compared the system's effectiveness with that of the non-finetuned GPT-3.5 and GPT-4 base models, or "vanilla" models, using zero-shot prompting methods. The dataset used was the TOEFL Public Writing Dataset provided by Education Testing Service, containing 480 argumentative essays with ground truth scores under two essay prompts. Three finetuned models were generated: two finetuned exclusively on either prompt and one on both. All finetuned and base models were used to score the remaining essays after finetuning and their scoring effectiveness was compared with ground truth scores as the benchmark. The impact of the variety of finetuning prompts and the robustness of finetuned models were also explored. Results showed a 100% consistency of all models in two scoring sessions. More importantly, the finetuned models significantly outperformed the base models in accuracy and reliability. The best-performing model, finetuned on prompt 1, showed an RMSE of 0.57, a percentage agreement (score discrepancy $\leq$ 0.5) of 84.72%, and a QWK of 0.78. Further, the model finetuned on both prompts did not exhibit enhanced performance, and the two models finetuned on one prompt remained robust when scoring essays from the alternative prompt. These results suggest 1) task-specific finetuning for AWE is beneficial; 2) finetuning does not require a large variety of essay prompts; and 3) fine-tuned models are robust to unseen essays.

Keywords	Automated Writing Evaluation, Large Language Models, GPT-3.5, Finetuning, TOEFL Public Writing Dataset

Qiao Wang (Waseda University) Dr John Gayed (Waseda University)

There are no materials yet.

JALTCALL 2024 Conference

Contact

Flavoring the Vanilla: Finetuning Large Language Models for Automated Evaluation of Argumentative Writing

DN 411 (North Building)

Speakers

Description

Authors

Presentation materials

Choose timezone

JALTCALL 2024 Conference

Contact

Speakers

Description

Authors

Presentation materials