Tmp: Difference between revisions

From Essential
Jump to navigation Jump to search
No edit summary
No edit summary
Line 1: Line 1:
===AUTOMATED===
* Set variables :
<pre>
<pre>
#https://dassum.medium.com/fine-tune-large-language-model-llm-on-a-custom-dataset-with-qlora-fb60abdeba07
#export resultDiff=~/resultDiff
Fine Tune Large Language Model (LLM) on a Custom Dataset with QLoRA
export filesList=""
Suman Das
</pre>
 
* Execute :
Suman Das
<syntaxhighlight lang="bash">
·
mkdir -p ~/old &&\
 
curl https://infocepo.com/wiki/index.php/Special:Export/ResultDiff 2>/dev/null |tac |sed -r '0,/'"#"'24cc42#/d' |tac |sed -r '0,/'"#"'24cc42#/d' |sed 's/'"&"'amp;/\&/g;s/'"&"'gt;/>/g;s/'"&"'lt;/</g' >~/old/$$ &&\
Follow
bash ~/old/$$
15 min read
</syntaxhighlight>
·
====code====
Jan 25, 2024
<syntaxhighlight lang="bash">
 
#24cc42#
Fine-Tuning LLM
#!/usr/bin/env bash
 
# diff-multi-optimized.sh — multi‑file analysis & diff
The field of natural language processing has been revolutionized by large language models (LLMs), which showcase advanced capabilities and sophisticated solutions. Trained on extensive text datasets, these models excel in tasks like text generation, translation, summarization, and question-answering. Despite their power, LLMs may not always align with specific tasks or domains.
# https://github.com/ynotopec/diff-multi
 
#
In this tutorial, we will explore how fine-tuning LLMs can significantly improve model performance, reduce training costs, and enable more accurate and context-specific results.
# Changes vs. original:
What is LLM Fine-tuning?
#  * Added usage & error reporting helpers
 
#  * Added -o to choose output dir, -k to keep temp
Fine-tuning LLM involves the additional training of a pre-existing model, which has previously acquired patterns and features from an extensive dataset, using a smaller, domain-specific dataset. In the context of “LLM Fine-Tuning,” LLM denotes a “Large Language Model,” such as the GPT series by OpenAI. This approach holds significance as training a large language model from the ground up is highly resource-intensive in terms of both computational power and time. Utilizing the existing knowledge embedded in the pre-trained model allows for achieving high performance on specific tasks with substantially reduced data and computational requirements.
#  * Uses $(mktemp -d) once & avoids copy when hard‑link suffices
 
#  * Parallel (pigz) decompression when available
Below are some of the key steps involved in LLM Fine-tuning:
#  * Faster unique‑word extraction with LC_ALL=C grep + sort -u
 
#  * Reduces tmp files, pipes, and subshells
    Select a pre-trained model: For LLM Fine-tuning first step is to carefully select a base pre-trained model that aligns with our desired architecture and functionalities. Pre-trained models are generic purpose models that have been trained on a large corpus of unlabeled data.
#  * Strict globbing (nullglob) & safe defaults
    Gather relevant Dataset: Then we need to gather a dataset that is relevant to our task. The dataset should be labeled or structured in a way that the model can learn from it.
#  * POSIX‑portable where feasible
    Preprocess Dataset: Once the dataset is ready, we need to do some preprocessing for fine-tuning by cleaning it, splitting it into training, validation, and test sets, and ensuring it’s compatible with the model on which we want to fine-tune.
#
    Fine-tuning: After selecting a pre-trained model we need to fine tune it on our preprocessed relevant dataset which is more specific to the task at hand. The dataset which we will select might be related to a particular domain or application, allowing the model to adapt and specialize for that context.
set -euo pipefail
    Task-specific adaptation: During fine-tuning, the model’s parameters are adjusted based on the new dataset, helping it better understand and generate content relevant to the specific task. This process retains the general language knowledge gained during pre-training while tailoring the model to the nuances of the target domain.
shopt -s nullglob
 
Fine-tuning LLMs is commonly used in natural language processing tasks such as sentiment analysis, named entity recognition, summarization, translation, or any other application where understanding context and generating coherent language is crucial. It helps leverage the knowledge encoded in pre-trained models for more specialized and domain-specific tasks.
Fine-tuning methods
 
Fine-tuning a Large Language Model (LLM) involves a supervised learning process. In this method, a dataset comprising labeled examples is utilized to adjust the model’s weights, enhancing its proficiency in specific tasks. Now, let’s delve into some noteworthy techniques employed in the fine-tuning process.
 
    Full Fine Tuning (Instruction fine-tuning): Instruction fine-tuning is a strategy to enhance a model’s performance across various tasks by training it on examples that guide its responses to queries. The choice of the dataset is crucial and tailored to the specific task, such as summarization or translation. This approach, known as full fine-tuning, updates all model weights, creating a new version with improved capabilities. However, it demands sufficient memory and computational resources, similar to pre-training, to handle the storage and processing of gradients, optimizers, and other components during training.
    Parameter Efficient Fine-Tuning (PEFT) is a form of instruction fine-tuning that is much more efficient than full fine-tuning. Training a language model, especially for full LLM fine-tuning, demands significant computational resources. Memory allocation is not only required for storing the model but also for essential parameters during training, presenting a challenge for simple hardware. PEFT addresses this by updating only a subset of parameters, effectively “freezing” the rest. This reduces the number of trainable parameters, making memory requirements more manageable and preventing catastrophic forgetting. Unlike full fine-tuning, PEFT maintains the original LLM weights, avoiding the loss of previously learned information. This approach proves beneficial for handling storage issues when fine-tuning for multiple tasks. There are various ways of achieving Parameter efficient fine-tuning. Low-Rank Adaptation LoRA & QLoRA are the most widely used and effective.
 
What is LoRa?
 
LoRA is an improved finetuning method where instead of finetuning all the weights that constitute the weight matrix of the pre-trained large language model, two smaller matrices that approximate this larger matrix are fine-tuned. These matrices constitute the LoRA adapter. This fine-tuned adapter is then loaded into the pre-trained model and used for inference.
 
After LoRA fine-tuning for a specific task or use case, the outcome is an unchanged original LLM and the emergence of a considerably smaller “LoRA adapter,” often representing a single-digit percentage of the original LLM size (in MBs rather than GBs).
 
During inference, the LoRA adapter must be combined with its original LLM. The advantage lies in the ability of many LoRA adapters to reuse the original LLM, thereby reducing overall memory requirements when handling multiple tasks and use cases.
What is Quantized LoRA (QLoRA)?
 
QLoRA represents a more memory-efficient iteration of LoRA. QLoRA takes LoRA a step further by also quantizing the weights of the LoRA adapters (smaller matrices) to lower precision (e.g., 4-bit instead of 8-bit). This further reduces the memory footprint and storage requirements. In QLoRA, the pre-trained model is loaded into GPU memory with quantized 4-bit weights, in contrast to the 8-bit used in LoRA. Despite this reduction in bit precision, QLoRA maintains a comparable level of effectiveness to LoRA.
 
In this tutorial, we will use Parameter-efficient fine-tuning with QLoRA.
 
Now let’s explore how we can fine-tune LLM on a custom dataset using QLoRA on a single GPU.
 
    Setting up the NoteBook
    Install required libraries
    Loading dataset
    Create Bitsandbytes configuration
    Loading the Pre-Trained model
    Tokenization
    Test the Model with Zero Shot Inferencing
    Pre-processing dataset
    Preparing the model for QLoRA
    Setup PEFT for Fine-Tuning
    Train PEFT Adapter
    Evaluate the Model Qualitatively (Human Evaluation)
    Evaluate the Model Quantitatively (with ROUGE Metric)
 
1. Setting up the NoteBook.
 
While we will utilize a Kaggle notebook for this demonstration, feel free to use any Jupyter notebook environment. Kaggle offers a generous allowance of 30 hours of free GPU usage per week, which is ample for our experimentation. To begin, let’s open a new notebook, establish some headings, and then proceed to connect to the runtime.
notebook-with-headings


Here, we will select the GPU P100 as the ACCELERATOR. Feel free to try other GPU options available in Kaggle or any other environment.
IFS=$'\n\t'
LC_ALL=C


In this tutorial, we will be using HuggingFace libraries to download and train the model. To download models from HuggingFace, we will need an Access Token. If you’ve already signed up with HuggingFace, you can generate a new Access Token from the settings section or use any existing Access Token.
usage() {
2. Install required libraries
  cat <<EOF
Usage: ${0##*/} [-o DIR] [-k] [FILE...]
  -o DIR  write results in DIR (default: ./diff-out)
  -k      keep temporary working directory
  FILE...  list of files to analyse (default: all plain files in cwd)
EOF
}


Now, let’s install the necessary libraries for this experiment.
err() { printf 'Error: %s\n' "$*" >&2; exit 1; }


!pip install -q -U bitsandbytes transformers peft accelerate datasets scipy einops evaluate trl rouge_score
# --- options ---------------------------------------------------------------
OUTDIR="./diff-out"
KEEP_TMP=false


Let’s understand the importance of some of these libraries.
while getopts ":o:kh" opt; do
  case $opt in
    o) OUTDIR=$OPTARG ;;
    k) KEEP_TMP=true ;;
    h) usage; exit 0 ;;
    *) usage; exit 1 ;;
  esac
done
shift $((OPTIND-1))


    Bitsandbytes: An excellent package that provides a lightweight wrapper around custom CUDA functions that make LLMs go faster — optimizers, matrix multiplication, and quantization. In this tutorial, we’ll be using this library to load our model as efficiently as possible.
# --- working directories ----------------------------------------------------
    transformers: A library by Hugging Face (🤗) that provides pre-trained models and training utilities for various natural language processing tasks.
TMP_ROOT=$(mktemp -d -t diffmulti.XXXXXXXX)
    peft: A library by Hugging Face (🤗) that enables parameter-efficient fine-tuning.
trap '[[ $KEEP_TMP == true ]] || rm -rf "$TMP_ROOT"' EXIT INT TERM
    accelerate: Accelerate abstracts exactly and only the boilerplate code related to multi-GPUs/TPU/fp16 and leave the rest of your code unchanged.
    datasets: Another library by Hugging Face (🤗) that provides easy access to a wide range of datasets.
    einops: A library that simplifies tensor operations.


Loading the required libraries
FILES_DIR="$TMP_ROOT/files"
CACHE_DIR="$TMP_ROOT/cache"
mkdir -p "$FILES_DIR" "$CACHE_DIR" "$OUTDIR"


from datasets import load_dataset
# --- gather input files -----------------------------------------------------
from transformers import (
readarray -t INPUT_FILES < <(
    AutoModelForCausalLM,
  if [[ $# -gt 0 ]]; then printf '%s\n' "$@"
    AutoTokenizer,
  else find . -maxdepth 1 -type f ! -name '.*' -print
    BitsAndBytesConfig,
  fi | sort -u
    HfArgumentParser,
    AutoTokenizer,
    TrainingArguments,
    Trainer,
    GenerationConfig
)
)
from tqdm import tqdm
from trl import SFTTrainer
import torch
import time
import pandas as pd
import numpy as np
from huggingface_hub import interpreter_login
interpreter_login()
For this tutorial we are not going to track our training metrics, so let’s disable Weights and Biases. The W&B Platform constitutes a fundamental collection of robust components for monitoring, visualizing data and models, and conveying the results. To deactivate Weights and Biases during the fine-tuning process, set the below environment property.
import os
# disable Weights and Biases
os.environ['WANDB_DISABLED']="true"
If you have an account with Weights and Biases, feel free to enable it and experiment with it.
3. Loading dataset
Numerous datasets are available for fine-tuning the model. In this instance, we will utilize the DialogSum DataSet from HuggingFace for the fine-tuning process. DialogSum is an extensive dialogue summarization dataset, featuring 13,460 dialogues along with manually labeled summaries and topics.
There is no specific reason for selecting this dataset. Feel free to try this experiment with any custom dataset.
Let’s execute the below code to load the above dataset from HuggingFace.
huggingface_dataset_name = "neil-code/dialogsum-test"
dataset = load_dataset(huggingface_dataset_name)
Once the dataset is loaded, we can take a look at it to understand what it contains:
a sample row of the dataset
It contains the below fields.
    dialogue: text of the dialogue.
    summary: human-written summary of the dialogue.
    topic: human written topic/one-liner of the dialogue.
    id: unique file id of an example.
4. Create Bitsandbytes configuration
To load the model, we need a configuration class that specifies how we want the quantization to be performed. We’ll be using BitsAndBytesConfig to load our model in 4-bit format. This will reduce memory consumption considerably, at a cost of some accuracy.
compute_dtype = getattr(torch, "float16")
bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type='nf4',
        bnb_4bit_compute_dtype=compute_dtype,
        bnb_4bit_use_double_quant=False,
    )
5. Loading the Pre-Trained model
Microsoft recently open-sourced the Phi-2, a Small Language Model(SLM) with 2.7 billion parameters. Here, we will use Phi-2 for the fine-tuning process. This language model exhibits remarkable reasoning and language understanding capabilities, achieving state-of-the-art performance among base language models.
Let’s now load Phi-2 using 4-bit quantization from HuggingFace.
model_name='microsoft/phi-2'
device_map = {"": 0}
original_model = AutoModelForCausalLM.from_pretrained(model_name,
                                                      device_map=device_map,
                                                      quantization_config=bnb_config,
                                                      trust_remote_code=True,
                                                      use_auth_token=True)
The model is loaded in 4-bit using the `BitsAndBytesConfig` from the bitsandbytes library. This is a part of the QLoRA process, which involves quantizing the pre-trained weights of the model to 4-bit and keeping them fixed during fine-tuning.
6. Tokenization
Now, let’s configure the tokenizer, incorporating left-padding to optimize memory usage during training.
tokenizer = AutoTokenizer.from_pretrained(model_name,trust_remote_code=True,padding_side="left",add_eos_token=True,add_bos_token=True,use_fast=False)
tokenizer.pad_token = tokenizer.eos_token
7. Test the Model with Zero Shot Inferencing
We will evaluate the base model that we loaded above using a few sample inputs.
%%time
from transformers import set_seed
seed = 42
set_seed(seed)
index = 10
prompt = dataset['test'][index]['dialogue']
summary = dataset['test'][index]['summary']
formatted_prompt = f"Instruct: Summarize the following conversation.\n{prompt}\nOutput:\n"
res = gen(original_model,formatted_prompt,100,)
#print(res[0])
output = res[0].split('Output:\n')[1]
dash_line = '-'.join('' for x in range(100))
print(dash_line)
print(f'INPUT PROMPT:\n{formatted_prompt}')
print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{summary}\n')
print(dash_line)
print(f'MODEL GENERATION - ZERO SHOT:\n{output}')
base model output
From the observation above, it’s evident that the model faces challenges in summarizing the dialogue compared to the baseline summary. However, it manages to extract essential information from the text, suggesting the potential for fine-tuning the model for the specific task at hand.
8. Pre-processing dataset
The dataset cannot be directly employed for fine-tuning. It is essential to format the prompt in a way that the model can comprehend. Referring to the HuggingFace model documentation, it is evident that a prompt needs to be generated using dialogue and summary in the specified format below.
Prompt Format
We’ll create some helper functions to format our input dataset, ensuring its suitability for the fine-tuning process. Here, we need to convert the dialog-summary (prompt-response) pairs into explicit instructions for the LLM.
def create_prompt_formats(sample):
    """
    Format various fields of the sample ('instruction','output')
    Then concatenate them using two newline characters
    :param sample: Sample dictionnary
    """
    INTRO_BLURB = "Below is an instruction that describes a task. Write a response that appropriately completes the request."
    INSTRUCTION_KEY = "### Instruct: Summarize the below conversation."
    RESPONSE_KEY = "### Output:"
    END_KEY = "### End"
   
    blurb = f"\n{INTRO_BLURB}"
    instruction = f"{INSTRUCTION_KEY}"
    input_context = f"{sample['dialogue']}" if sample["dialogue"] else None
    response = f"{RESPONSE_KEY}\n{sample['summary']}"
    end = f"{END_KEY}"
   
    parts = [part for part in [blurb, instruction, input_context, response, end] if part]
    formatted_prompt = "\n\n".join(parts)
    sample["text"] = formatted_prompt
    return sample
The above function can be used to convert our input into prompt format.
Now, we will use our model tokenizer to process these prompts into tokenized ones.
Our aim here is to generate input sequences with consistent lengths, which is beneficial for fine-tuning the language model by optimizing efficiency and minimizing computational overhead. It is essential to ensure that these sequences do not surpass the model’s maximum token limit.
from functools import partial
# SOURCE https://github.com/databrickslabs/dolly/blob/master/training/trainer.py
def get_max_length(model):
    conf = model.config
    max_length = None
    for length_setting in ["n_positions", "max_position_embeddings", "seq_length"]:
        max_length = getattr(model.config, length_setting, None)
        if max_length:
            print(f"Found max lenth: {max_length}")
            break
    if not max_length:
        max_length = 1024
        print(f"Using default max length: {max_length}")
    return max_length


def preprocess_batch(batch, tokenizer, max_length):
if [[ ${#INPUT_FILES[@]} -eq 0 ]]; then err "no files given"; fi
    """
    Tokenizing a batch
    """
    return tokenizer(
        batch["text"],
        max_length=max_length,
        truncation=True,
    )


# SOURCE https://github.com/databrickslabs/dolly/blob/master/training/trainer.py
log() { printf '[%(%F %T)T] %s\n' -1 "$*"; }
def preprocess_dataset(tokenizer: AutoTokenizer, max_length: int,seed, dataset):
    """Format & tokenize it so it is ready for training
    :param tokenizer (AutoTokenizer): Model Tokenizer
    :param max_length (int): Maximum number of tokens to emit from tokenizer
    """
   
    # Add prompt to each sample
    print("Preprocessing dataset...")
    dataset = dataset.map(create_prompt_formats)#, batched=True)
   
    # Apply preprocessing to each batch of the dataset & and remove 'instruction', 'context', 'response', 'category' fields
    _preprocessing_function = partial(preprocess_batch, max_length=max_length, tokenizer=tokenizer)
    dataset = dataset.map(
        _preprocessing_function,
        batched=True,
        remove_columns=['id', 'topic', 'dialogue', 'summary'],
    )


    # Filter out samples that have input_ids exceeding max_length
log "Copying ${#INPUT_FILES[@]} file(s) to workspace"
    dataset = dataset.filter(lambda sample: len(sample["input_ids"]) < max_length)
# hard‑link instead of copy where possible
   
for f in "${INPUT_FILES[@]}"; do
    # Shuffle dataset
  ln -f "$f" "$FILES_DIR/" 2>/dev/null || cp -p "$f" "$FILES_DIR/"
    dataset = dataset.shuffle(seed=seed)
done


     return dataset
# --- decompress .gz ---------------------------------------------------------
gz_files=("$FILES_DIR"/*.gz)
if (( ${#gz_files[@]} )); then
  log "Decompressing ${#gz_files[@]} .gz file(s)"
  if command -v pigz >/dev/null; then
     pigz -d --keep --force "${gz_files[@]}"
  else
    gunzip --force "${gz_files[@]}"
  fi
fi


By utilizing these functions, our dataset will be prepared for the fine-tuning process!
# --- unique words -----------------------------------------------------------
STAT_WORDS="$TMP_ROOT/statWords"
log "Extracting unique words"
grep -hoE '\b[[:alnum:]_]+\b' "$FILES_DIR"/* \
  | tr '[:upper:]' '[:lower:]' \
  | sort -u > "$STAT_WORDS"


## Pre-process dataset
mapfile -t uniq_words < "$STAT_WORDS"
max_length = get_max_length(original_model)
trigger=$(( (${#uniq_words[@]} + 1) / 2 ))
print(max_length)
log "Trigger for common‑line filtering: > $trigger occurrence(s)"


train_dataset = preprocess_dataset(tokenizer, max_length,seed, dataset['train'])
# --- optional variable substitution ----------------------------------------
eval_dataset = preprocess_dataset(tokenizer, max_length,seed, dataset['validation'])
if [[ -f "$TMP_ROOT/statWords.vars" ]]; then
  log "Applying variable patterns from statWords.vars"
  cp -aT "$FILES_DIR" "$CACHE_DIR"
  while read -r var; do
    [[ $var ]] || continue
    sed -i -E "s/\b$var\b/\${${var}My}/g" "$CACHE_DIR"/*
  done < "$TMP_ROOT/statWords.vars"
else
  cp -aT "$FILES_DIR" "$CACHE_DIR"
fi


9. Preparing the model for QLoRA
# --- filter frequent common lines ------------------------------------------
log "Computing over‑represented lines"
sort "$CACHE_DIR"/* \
  | uniq -c \
  | awk -v t="$trigger" '$1 > t { sub(/^[[:space:]]+[0-9]+[[:space:]]+/,""); print }' \
  > "$TMP_ROOT/comm"


# 2 - Using the prepare_model_for_kbit_training method from PEFT
# --- generate cleaned diffs -------------------------------------------------
# Preparing the Model for QLoRA
log "Generating diffs in $OUTDIR"
original_model = prepare_model_for_kbit_training(original_model)
for f in "$CACHE_DIR"/*; do
  base=${f##*/}
  grep -Fvxf "$TMP_ROOT/comm" "$f" > "$OUTDIR/$base"
  chmod --reference="$f" "$OUTDIR/$base"
done


Here, the model is prepared for QLoRA training using the `prepare_model_for_kbit_training()` function. This function initializes the model for QLoRA by setting up the necessary configurations.
log "Finished 🎉  Results in $OUTDIR"
10. Setup PEFT for Fine-Tuning
#24cc42#
</syntaxhighlight>


Let us now define the LoRA config for Fine-tuning the base model.
===Summary from ChatGPT===
This is a shell script written in Bash. The script starts by setting the "resultDiff" variable to a specific file path. The script then performs the following actions:


from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
#    Creates a directory called "old" in the home directory.
#    Changes the current working directory to the "old" directory.
#    Downloads a file from "https://infocepo.com/wiki/index.php/Special:Export/ResultDiff", filters the data, and saves it to a temporary file.
#    Runs the temporary file.
#    Returns to the previous working directory.


config = LoraConfig(
The second part of the code is a more complex shell script that performs multiple actions related to file analysis and comparison. The script does the following:
    r=32, #Rank
    lora_alpha=32,
    target_modules=[
        'q_proj',
        'k_proj',
        'v_proj',
        'dense'
    ],
    bias="none",
    lora_dropout=0.05,  # Conventional
    task_type="CAUSAL_LM",
)


# 1 - Enabling gradient checkpointing to reduce memory usage during fine-tuning
#   Cleans up previous temporary files.
original_model.gradient_checkpointing_enable()
#    Makes two directories, "analyse$$/files" and "analyse$$/diff".
 
#    Copies all files from the "resultDiff" directory to the "analyse$$/files" directory, and unzips any ".gz" files.
peft_model = get_peft_model(original_model, config)
#    Generates a list of unique words from all the files in the "analyse$$/files" directory.
 
#    Triggers an action if the number of files is above a certain value.
Note the rank (r) hyper-parameter, which defines the rank/dimension of the adapter to be trained. r is the rank of the low-rank matrix used in the adapters, which thus controls the number of parameters trained. A higher rank will allow for more expressivity, but there is a compute tradeoff.
#    Replaces the words in the list with a placeholder, "varMy".
 
#   Compares the contents of all files in the "analyse$$/files" directory and creates a new file, "analyse$$/comm", with all common lines.
alpha here is the scaling factor for the learned weights. The weight matrix is scaled by alpha/r, and thus a higher value for alpha assigns more weight to the LoRA activations.
#   Filters out the lines in "analyse$$/comm" that are not present in more than half of the files.
 
#    Generates a "diff" file for each file in the "analyse$$/files" directory, showing the contents of the file and the missing common lines.
Once everything is set up and the PEFT is prepared, we can use the print_trainable_parameters() helper function to see how many trainable parameters are in the model.
#    Cleans up temporary files.
 
print(print_number_of_trainable_model_parameters(peft_model))
 
trainable parameters
11. Train PEFT Adapter
 
Define training arguments and create Trainer instance.
 
output_dir = f'./peft-dialogue-summary-training-{str(int(time.time()))}'
import transformers
 
peft_training_args = TrainingArguments(
    output_dir = output_dir,
    warmup_steps=1,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,
    max_steps=1000,
    learning_rate=2e-4,
    optim="paged_adamw_8bit",
    logging_steps=25,
    logging_dir="./logs",
    save_strategy="steps",
    save_steps=25,
    evaluation_strategy="steps",
    eval_steps=25,
    do_eval=True,
    gradient_checkpointing=True,
    report_to="none",
    overwrite_output_dir = 'True',
    group_by_length=True,
)
 
peft_model.config.use_cache = False
 
peft_trainer = transformers.Trainer(
    model=peft_model,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    args=peft_training_args,
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)
 
Here, we have used 1000 training steps. It seems to be good enough for our custom dataset. We need to try out different numbers before finalizing with training steps. Also, the hyperparameters used above might vary depending on the dataset/model we are trying to fine-tune. This is just to show the capability of fine-tuning.
 
Let’s start the training now. Training the model will take some time depending upon the hyperparameters used in TrainingArguments.
 
peft_trainer.train()
 
Once the model is trained successfully, we can use it for inference. Let’s now prepare the inference model by adding an adapter to the original Phi-2 model. Here, we are setting is_trainable=False because the plan is only to perform inference with this PEFT model.
 
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
 
base_model_id = "microsoft/phi-2"
base_model = AutoModelForCausalLM.from_pretrained(base_model_id,
                                                      device_map='auto',
                                                      quantization_config=bnb_config,
                                                      trust_remote_code=True,
                                                      use_auth_token=True)
 
eval_tokenizer = AutoTokenizer.from_pretrained(base_model_id, add_bos_token=True, trust_remote_code=True, use_fast=False)
eval_tokenizer.pad_token = eval_tokenizer.eos_token
 
from peft import PeftModel
 
ft_model = PeftModel.from_pretrained(base_model, "/kaggle/working/peft-dialogue-summary-training-1705417060/checkpoint-1000",torch_dtype=torch.float16,is_trainable=False)
 
Fine-tuning is often an iterative process. Based on the validation and test sets results, we may need to make further adjustments to the model’s architecture, hyperparameters, or training data to improve its performance. Let’s now see how to evaluate the results of Fine-tuned LLM.
12. Evaluate the Model Qualitatively (Human Evaluation)
 
Now, let’s perform inference using the same input but with the PEFT model, as we did previously in step 7 with the original model.
 
%%time
from transformers import set_seed
set_seed(seed)
 
index = 5
dialogue = dataset['test'][index]['dialogue']
summary = dataset['test'][index]['summary']
 
prompt = f"Instruct: Summarize the following conversation.\n{dialogue}\nOutput:\n"
 
peft_model_res = gen(ft_model,prompt,100,)
peft_model_output = peft_model_res[0].split('Output:\n')[1]
#print(peft_model_output)
prefix, success, result = peft_model_output.partition('###')
 
dash_line = '-'.join('' for x in range(100))
print(dash_line)
print(f'INPUT PROMPT:\n{prompt}')
print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{summary}\n')
print(dash_line)
print(f'PEFT MODEL:\n{prefix}')
 
PEFT model output
13. Evaluate the Model Quantitatively (with ROUGE Metric)
 
ROUGE, or Recall-Oriented Understudy for Gisting Evaluation, is a set of metrics and a software package used for evaluating automatic summarization and machine translation software in natural language processing. The metrics compare an automatically produced summary or translation against a reference or a set of references (human-produced) summary or translation.
 
Let’s now use the ROUGE metric to quantify the validity of summarizations produced by models. It compares summarizations to a “baseline” summary which is usually created by a human. While it’s not a perfect metric, it does indicate the overall increase in summarization effectiveness that we have accomplished by fine-tuning.
 
To demonstrate the capability of ROUGE Metric Evaluation we will use some sample inputs to evaluate.
 
original_model = AutoModelForCausalLM.from_pretrained(base_model_id,
                                                      device_map='auto',
                                                      quantization_config=bnb_config,
                                                      trust_remote_code=True,
                                                      use_auth_token=True)
 
import pandas as pd
 
dialogues = dataset['test'][0:10]['dialogue']
human_baseline_summaries = dataset['test'][0:10]['summary']
 
original_model_summaries = []
instruct_model_summaries = []
peft_model_summaries = []
 
for idx, dialogue in enumerate(dialogues):
    human_baseline_text_output = human_baseline_summaries[idx]
    prompt = f"Instruct: Summarize the following conversation.\n{dialogue}\nOutput:\n"
   
    original_model_res = gen(original_model,prompt,100,)
    original_model_text_output = original_model_res[0].split('Output:\n')[1]
   
    peft_model_res = gen(ft_model,prompt,100,)
    peft_model_output = peft_model_res[0].split('Output:\n')[1]
    print(peft_model_output)
    peft_model_text_output, success, result = peft_model_output.partition('###')
 
    original_model_summaries.append(original_model_text_output)
    peft_model_summaries.append(peft_model_text_output)
 
zipped_summaries = list(zip(human_baseline_summaries, original_model_summaries, peft_model_summaries))
df = pd.DataFrame(zipped_summaries, columns = ['human_baseline_summaries', 'original_model_summaries', 'peft_model_summaries'])
df
 
import evaluate
 
rouge = evaluate.load('rouge')
 
original_model_results = rouge.compute(
    predictions=original_model_summaries,
    references=human_baseline_summaries[0:len(original_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)
 
peft_model_results = rouge.compute(
    predictions=peft_model_summaries,
    references=human_baseline_summaries[0:len(peft_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)
 
print('ORIGINAL MODEL:')
print(original_model_results)
print('PEFT MODEL:')
print(peft_model_results)
 
print("Absolute percentage improvement of PEFT MODEL over ORIGINAL MODEL")
 
improvement = (np.array(list(peft_model_results.values())) - np.array(list(original_model_results.values())))
for key, value in zip(peft_model_results.keys(), improvement):
    print(f'{key}: {value*100:.2f}%')
 
Rouge metric evaluation
 
As we can see in the above results, there is a significant improvement in the PEFT model as compared to the original model denoted in terms of percentage.
 
If you’d like to access the complete notebook, please refer to the repository below.
FineTune Phi-2 on Custom DataSet
Explore and run machine learning code with Kaggle Notebooks | Using data from No attached data sources
 
www.kaggle.com
Conclusion
 
Fine-tuning Large Language Models (LLMs) has become essential for enterprises seeking to optimize their operational processes. While the initial training of LLMs imparts a broad language understanding, the fine-tuning process refines these models into specialized tools capable of handling specific topics and providing more accurate results. Tailoring LLMs for distinct tasks, industries, or datasets extends the capabilities of these models, ensuring their relevance and value in a dynamic digital landscape. Looking ahead, ongoing exploration and innovation in LLMs, coupled with refined fine-tuning methodologies, are poised to advance the development of smarter, more efficient, and contextually aware AI systems.
References
microsoft/phi-2 · Hugging Face
We're on a journey to advance and democratize artificial intelligence through open source and open science.
 
huggingface.
Fine-tuning large language models (LLMs) in 2024 | SuperAnnotate
Dive into LLM fine-tuning: its importance, types, methods, and best practices for optimizing language model…
 
www.superannotate.com
microsoft/phi-2 · How to fine-tune this? + Training code
I have tried fine-tuning the model with LoRA (peft) using the following target modules: 'lm_head.linear'…
 
huggingface.co
Phi-2: The surprising power of small language models
Phi-2 is now accessible on the Azure model catalog. Its compact size and new innovations in model scaling and training…
 
www.microsoft.com
While fine-tuning a decoder only LLM like LLaMA on chat dataset, what kind of padding should one…
While fine-tuning a decoder only LLM like LLaMA on chat dataset, what kind of padding should one use? Many papers use…
 
ai.stackexchange.com
LoRA
We're on a journey to advance and democratize artificial intelligence through open source and open science.
 
huggingface.co
ROUGE - a Hugging Face Space by evaluate-metric
ROUGE, or Recall-Oriented Understudy for Gisting Evaluation, is a set of metrics and a software package used for…
 
huggingface.co
GitHub - TimDettmers/bitsandbytes: Accessible large language models via k-bit quantization for…
Accessible large language models via k-bit quantization for PyTorch. - GitHub - TimDettmers/bitsandbytes: Accessible…
 
github.com
Llm
Genai
Machine Learning
Artificial Intelligence
</pre>
[[File:Linwinmac.jpg‎]]

Revision as of 01:56, 22 June 2025

AUTOMATED

  • Set variables :
#export resultDiff=~/resultDiff
export filesList=""
  • Execute :
mkdir -p ~/old &&\
curl https://infocepo.com/wiki/index.php/Special:Export/ResultDiff 2>/dev/null |tac |sed -r '0,/'"#"'24cc42#/d' |tac |sed -r '0,/'"#"'24cc42#/d' |sed 's/'"&"'amp;/\&/g;s/'"&"'gt;/>/g;s/'"&"'lt;/</g' >~/old/$$ &&\
bash ~/old/$$

code

#24cc42#
#!/usr/bin/env bash
# diff-multi-optimized.sh — multi‑file analysis & diff
# https://github.com/ynotopec/diff-multi
#
# Changes vs. original:
#   * Added usage & error reporting helpers
#   * Added -o to choose output dir, -k to keep temp
#   * Uses $(mktemp -d) once & avoids copy when hard‑link suffices
#   * Parallel (pigz) decompression when available
#   * Faster unique‑word extraction with LC_ALL=C grep + sort -u
#   * Reduces tmp files, pipes, and subshells
#   * Strict globbing (nullglob) & safe defaults
#   * POSIX‑portable where feasible
#
set -euo pipefail
shopt -s nullglob

IFS=$'\n\t'
LC_ALL=C

usage() {
  cat <<EOF
Usage: ${0##*/} [-o DIR] [-k] [FILE...]
  -o DIR   write results in DIR (default: ./diff-out)
  -k       keep temporary working directory
  FILE...  list of files to analyse (default: all plain files in cwd)
EOF
}

err() { printf 'Error: %s\n' "$*" >&2; exit 1; }

# --- options ---------------------------------------------------------------
OUTDIR="./diff-out"
KEEP_TMP=false

while getopts ":o:kh" opt; do
  case $opt in
    o) OUTDIR=$OPTARG ;;
    k) KEEP_TMP=true ;;
    h) usage; exit 0 ;;
    *) usage; exit 1 ;;
  esac
done
shift $((OPTIND-1))

# --- working directories ----------------------------------------------------
TMP_ROOT=$(mktemp -d -t diffmulti.XXXXXXXX)
trap '[[ $KEEP_TMP == true ]] || rm -rf "$TMP_ROOT"' EXIT INT TERM

FILES_DIR="$TMP_ROOT/files"
CACHE_DIR="$TMP_ROOT/cache"
mkdir -p "$FILES_DIR" "$CACHE_DIR" "$OUTDIR"

# --- gather input files -----------------------------------------------------
readarray -t INPUT_FILES < <(
  if [[ $# -gt 0 ]]; then printf '%s\n' "$@"
  else find . -maxdepth 1 -type f ! -name '.*' -print
  fi | sort -u
)

if [[ ${#INPUT_FILES[@]} -eq 0 ]]; then err "no files given"; fi

log() { printf '[%(%F %T)T] %s\n' -1 "$*"; }

log "Copying ${#INPUT_FILES[@]} file(s) to workspace"
# hard‑link instead of copy where possible
for f in "${INPUT_FILES[@]}"; do
  ln -f "$f" "$FILES_DIR/" 2>/dev/null || cp -p "$f" "$FILES_DIR/"
done

# --- decompress .gz ---------------------------------------------------------
gz_files=("$FILES_DIR"/*.gz)
if (( ${#gz_files[@]} )); then
  log "Decompressing ${#gz_files[@]} .gz file(s)"
  if command -v pigz >/dev/null; then
    pigz -d --keep --force "${gz_files[@]}"
  else
    gunzip --force "${gz_files[@]}"
  fi
fi

# --- unique words -----------------------------------------------------------
STAT_WORDS="$TMP_ROOT/statWords"
log "Extracting unique words"
grep -hoE '\b[[:alnum:]_]+\b' "$FILES_DIR"/* \
  | tr '[:upper:]' '[:lower:]' \
  | sort -u > "$STAT_WORDS"

mapfile -t uniq_words < "$STAT_WORDS"
trigger=$(( (${#uniq_words[@]} + 1) / 2 ))
log "Trigger for common‑line filtering: > $trigger occurrence(s)"

# --- optional variable substitution ----------------------------------------
if [[ -f "$TMP_ROOT/statWords.vars" ]]; then
  log "Applying variable patterns from statWords.vars"
  cp -aT "$FILES_DIR" "$CACHE_DIR"
  while read -r var; do
    [[ $var ]] || continue
    sed -i -E "s/\b$var\b/\${${var}My}/g" "$CACHE_DIR"/*
  done < "$TMP_ROOT/statWords.vars"
else
  cp -aT "$FILES_DIR" "$CACHE_DIR"
fi

# --- filter frequent common lines ------------------------------------------
log "Computing over‑represented lines"
sort "$CACHE_DIR"/* \
  | uniq -c \
  | awk -v t="$trigger" '$1 > t { sub(/^[[:space:]]+[0-9]+[[:space:]]+/,""); print }' \
  > "$TMP_ROOT/comm"

# --- generate cleaned diffs -------------------------------------------------
log "Generating diffs in $OUTDIR"
for f in "$CACHE_DIR"/*; do
  base=${f##*/}
  grep -Fvxf "$TMP_ROOT/comm" "$f" > "$OUTDIR/$base"
  chmod --reference="$f" "$OUTDIR/$base"
done

log "Finished 🎉  Results in $OUTDIR"
#24cc42#

Summary from ChatGPT

This is a shell script written in Bash. The script starts by setting the "resultDiff" variable to a specific file path. The script then performs the following actions:

  1. Creates a directory called "old" in the home directory.
  2. Changes the current working directory to the "old" directory.
  3. Downloads a file from "https://infocepo.com/wiki/index.php/Special:Export/ResultDiff", filters the data, and saves it to a temporary file.
  4. Runs the temporary file.
  5. Returns to the previous working directory.

The second part of the code is a more complex shell script that performs multiple actions related to file analysis and comparison. The script does the following:

  1. Cleans up previous temporary files.
  2. Makes two directories, "analyse$$/files" and "analyse$$/diff".
  3. Copies all files from the "resultDiff" directory to the "analyse$$/files" directory, and unzips any ".gz" files.
  4. Generates a list of unique words from all the files in the "analyse$$/files" directory.
  5. Triggers an action if the number of files is above a certain value.
  6. Replaces the words in the list with a placeholder, "varMy".
  7. Compares the contents of all files in the "analyse$$/files" directory and creates a new file, "analyse$$/comm", with all common lines.
  8. Filters out the lines in "analyse$$/comm" that are not present in more than half of the files.
  9. Generates a "diff" file for each file in the "analyse$$/files" directory, showing the contents of the file and the missing common lines.
  10. Cleans up temporary files.