Finetuning
How to use turkish-lm-tuner
to finetune an encoder-decoder language model on summarization task¶
turkish-lm-tuner
is a library designed for finetuning a language model on specific datasets. It is based on transformers. It is designed to be easy to use and to work with any encoder and encoder-decoder language model that is supported by transformers
.
turkish-lm-tuner
supports both encoder and encoder-decoder models. It offers wrappers for various task datasets (like paraphrasing, text classification, summarization, etc.). Additionally, it includes easily importable and usable evaluation metrics for various tasks.
In this tutorial, we will show how to use turkish-lm-tuner
to finetune an encoder-decoder language model on summarization task. We will use TR News dataset for fine-tuning TURNA model.
Importing and processing dataset¶
The library includes wrappers for various datasets. These wrappers can be used to import the dataset and to preprocess it for finetuning.
For summarization task, we will use TR News dataset. This dataset includes news articles and their summaries. We will use the wrapper for this dataset to import and preprocess it based on the task configuration.
from turkish_lm_tuner import DatasetProcessor
dataset_name = "tr_news"
task = "summarization"
task_mode = '' # either '', '[NLU]', '[NLG]', '[S2S]'
task_format="conditional_generation"
model_name = "boun-tabi-LMG/TURNA"
max_input_length = 764
max_target_length = 128
dataset_processor = DatasetProcessor(
dataset_name=dataset_name, task=task, task_format=task_format, task_mode=task_mode,
tokenizer_name=model_name, max_input_length=max_input_length, max_target_length=max_target_length
)
train_dataset = dataset_processor.load_and_preprocess_data(split='train')
eval_dataset = dataset_processor.load_and_preprocess_data(split='validation')
test_dataset = dataset_processor.load_and_preprocess_data(split="test")
Preparing finetuning parameters¶
training_params = {
'num_train_epochs': 10
'per_device_train_batch_size': 4,
'per_device_eval_batch_size': 4,
'output_dir': './',
'evaluation_strategy': 'epoch',
'save_strategy': 'epoch',
'predict_with_generate': True
}
optimizer_params = {
'optimizer_type': 'adafactor',
'scheduler': False,
}
Finetuning the model¶
model_trainer = TrainerForConditionalGeneration(
model_name=model_name, task=task,
training_params=training_params,
optimizer_params=optimizer_params,
model_save_path="turna_summarization_tr_news",
max_input_length=max_input_length,
max_target_length=max_target_length,
postprocess_fn=dataset_processor.dataset.postprocess_data
)
trainer, model = model_trainer.train_and_evaluate(train_dataset, eval_dataset, test_dataset)
model.save_pretrained(model_save_path)
dataset_processor.tokenizer.save_pretrained(model_save_path)