Finetuning

How to use `turkish-lm-tuner` to finetune an encoder-decoder language model on summarization task¶

turkish-lm-tuner is a library designed for finetuning a language model on specific datasets. It is based on transformers. It is designed to be easy to use and to work with any encoder and encoder-decoder language model that is supported by transformers.

turkish-lm-tuner supports both encoder and encoder-decoder models. It offers wrappers for various task datasets (like paraphrasing, text classification, summarization, etc.). Additionally, it includes easily importable and usable evaluation metrics for various tasks.

In this tutorial, we will show how to use turkish-lm-tuner to finetune an encoder-decoder language model on summarization task. We will use TR News dataset for fine-tuning TURNA model.

Installation¶

The library can be installed as follows:

pip install turkish-lm-tuner

Importing and processing dataset¶

The library includes wrappers for various datasets. These wrappers can be used to import the dataset and to preprocess it for finetuning.

For summarization task, we will use TR News dataset. This dataset includes news articles and their summaries. We will use the wrapper for this dataset to import and preprocess it based on the task configuration.

In [ ]:

Copied!





from turkish_lm_tuner import DatasetProcessor

dataset_name = "tr_news" 
task = "summarization"
task_mode = '' # either '', '[NLU]', '[NLG]', '[S2S]'
task_format="conditional_generation"
model_name = "boun-tabi-LMG/TURNA"
max_input_length = 764
max_target_length = 128


dataset_processor = DatasetProcessor(
        dataset_name=dataset_name, task=task, task_format=task_format, task_mode=task_mode,
        tokenizer_name=model_name, max_input_length=max_input_length, max_target_length=max_target_length
)

train_dataset = dataset_processor.load_and_preprocess_data(split='train')
eval_dataset = dataset_processor.load_and_preprocess_data(split='validation')
test_dataset = dataset_processor.load_and_preprocess_data(split="test")
from turkish_lm_tuner import DatasetProcessor

dataset_name = "tr_news" 
task = "summarization"
task_mode = '' # either '', '[NLU]', '[NLG]', '[S2S]'
task_format="conditional_generation"
model_name = "boun-tabi-LMG/TURNA"
max_input_length = 764
max_target_length = 128


dataset_processor = DatasetProcessor(
        dataset_name=dataset_name, task=task, task_format=task_format, task_mode=task_mode,
        tokenizer_name=model_name, max_input_length=max_input_length, max_target_length=max_target_length
)

train_dataset = dataset_processor.load_and_preprocess_data(split='train')
eval_dataset = dataset_processor.load_and_preprocess_data(split='validation')
test_dataset = dataset_processor.load_and_preprocess_data(split="test")

Preparing finetuning parameters¶

In [ ]:

Copied!





training_params = {
    'num_train_epochs': 10
    'per_device_train_batch_size': 4,
    'per_device_eval_batch_size': 4,
    'output_dir': './', 
    'evaluation_strategy': 'epoch',
    'save_strategy': 'epoch',
    'predict_with_generate': True    
}
optimizer_params = {
    'optimizer_type': 'adafactor',
    'scheduler': False,
}

training_params = {
    'num_train_epochs': 10
    'per_device_train_batch_size': 4,
    'per_device_eval_batch_size': 4,
    'output_dir': './', 
    'evaluation_strategy': 'epoch',
    'save_strategy': 'epoch',
    'predict_with_generate': True    
}
optimizer_params = {
    'optimizer_type': 'adafactor',
    'scheduler': False,
}

Finetuning the model¶

In [ ]:

Copied!





model_trainer = TrainerForConditionalGeneration(
    model_name=model_name, task=task,
    training_params=training_params,
    optimizer_params=optimizer_params,
    model_save_path="turna_summarization_tr_news",
    max_input_length=max_input_length,
    max_target_length=max_target_length, 
    postprocess_fn=dataset_processor.dataset.postprocess_data
)

trainer, model = model_trainer.train_and_evaluate(train_dataset, eval_dataset, test_dataset)

model.save_pretrained(model_save_path)
dataset_processor.tokenizer.save_pretrained(model_save_path)

model_trainer = TrainerForConditionalGeneration(
    model_name=model_name, task=task,
    training_params=training_params,
    optimizer_params=optimizer_params,
    model_save_path="turna_summarization_tr_news",
    max_input_length=max_input_length,
    max_target_length=max_target_length, 
    postprocess_fn=dataset_processor.dataset.postprocess_data
)

trainer, model = model_trainer.train_and_evaluate(train_dataset, eval_dataset, test_dataset)

model.save_pretrained(model_save_path)
dataset_processor.tokenizer.save_pretrained(model_save_path)

Finetuning

How to use turkish-lm-tuner to finetune an encoder-decoder language model on summarization task¶

Installation¶

Importing and processing dataset¶

Preparing finetuning parameters¶

Finetuning the model¶

How to use `turkish-lm-tuner` to finetune an encoder-decoder language model on summarization task¶