FineTuning Data Preprocessing
✅ Your input datapoint
messages = [
{"role": "system", "content": "You are a assistant responsible for classifying mental health status."},
{"role": "user", "content": "I am depressed and want to die"},
{"role": "assistant", "content": "based on what you described it's depression'"}
]STEP 1 — Chat template formatting
tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)<bos><start_of_turn>system
You are a assistant responsible for classifying mental health status.<end_of_turn>
<start_of_turn>user
I am depressed and want to die<end_of_turn>
<start_of_turn>model
based on what you described its 'depression'<end_of_turn>STEP 2 — Build prompt only version
STEP 3 — Tokenization
STEP 4 — Build initial labels
STEP 5 — Mask the prompt tokens
STEP 6 — Shift tokens for next-token prediction
🎯 After shifting:
input_ids
labels
Position
input
label
Meaning
How data looks before and after dropping
✅ What exactly is dropped from input_ids?
input_ids?👉 The last token of the full sequence.
✔️ Dropped token = the last token (usually EOS / end-of-turn).
❗ WHY do we drop the last token?
✅ What exactly is dropped from labels?
labels?✔️ Dropped token = the first token (which matches input_ids[0])
🔍 Summary Table — EXACTLY What Gets Dropped
Operation
From Where?
What Gets Dropped?
Why?
⚡ Visual Alignment
Last updated