FineTuning Data Preprocessing
Below is a complete, explicit, token-level walkthrough of what happens to your data point as it moves through every step of your preprocessing pipeline.
This will show why shifting is required, what masking does, and how the final input_ids and labels look.
✅ Your input datapoint
messages = [
{"role": "system", "content": "You are a assistant responsible for classifying mental health status."},
{"role": "user", "content": "I am depressed and want to die"},
{"role": "assistant", "content": "based on what you described it's depression'"}
]Your training job’s goal:
Given the system + user message, predict the assistant message.
STEP 1 — Chat template formatting
Using:
tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)For Gemma-style templates, the output becomes roughly:
<bos><start_of_turn>system
You are a assistant responsible for classifying mental health status.<end_of_turn>
<start_of_turn>user
I am depressed and want to die<end_of_turn>
<start_of_turn>model
based on what you described its 'depression'<end_of_turn>Then you remove <bos>:
<start_of_turn>system
You are...<end_of_turn>
<start_of_turn>user
I am...<end_of_turn>
<start_of_turn>model
based on...<end_of_turn>STEP 2 — Build prompt only version
We take all messages except the last assistant message:
prompt_messages = messages[:-1]Template becomes:
<start_of_turn>system
You are ...<end_of_turn>
<start_of_turn>user
I am depressed ...<end_of_turn>
<start_of_turn>modelBecause:
add_generation_prompt=TrueGemma inserts a generation start after the assistant role.
STEP 3 — Tokenization
Assume the tokenizer produces something like:
full_tokenized["input_ids"]:
[101, 4001, 22, ... , 9002, 17, 55, 81, 8821, 902, 5] # ← includes the assistant answer
prompt_tokenized["input_ids"]:
[101, 4001, 22, ... , 9002, 17] # ← stops before assistant contentLet’s illustrate with symbolic tokens for clarity:
Full input_ids:
[F_sys, “You”, ..., F_user, “I am…”, F_assist, "based", "on", "what", ... "depression", EOS]
Prompt input_ids:
[F_sys, “You”, ..., F_user, “I am…”, F_assist]Length examples:
full = 100 tokens
prompt = 60 tokensSTEP 4 — Build initial labels
Before masking:
labels = [
F_sys, "You", ..., F_assist, "based", "on", ..., "depression", EOS
]STEP 5 — Mask the prompt tokens
You mask the first prompt_length tokens using IGNORE_INDEX (−100):
labels (before shift):
[ -100, -100, -100, ... -100, "based", "on", "what", ..., "depression", EOS]
↑ ↑
first 60 masked assistant answer keptSo only assistant message tokens remain as supervised labels.
STEP 6 — Shift tokens for next-token prediction
This is the important part.
You perform:
input_ids = full_tokenized["input_ids"][:-1]
labels = labels[1:]Let’s inspect what this does.
🎯 After shifting:
input_ids
Drop the last token:
[F_sys, "You", ..., "depression"]labels
Drop the first token:
[-100, -100, ... -100, "based", "on", ..., EOS]Now positions align like this:
input_ids[i] → model predicts → labels[i]Example:
0
<start_of_turn>
IGNORE
Don't learn
1
"You"
IGNORE
Don't learn
...
...
...
...
59
<assistant>
"based"
Model learns to output start of assistant answer
60
"based"
"on"
Model learns the next token
61
"on"
"what"
next-token prediction
…
…
…
…
90
"depression"
EOS
complete answer
How data looks before and after dropping
Great — let's be very explicit about what exactly gets dropped when we do:
input_ids = full_tokenized["input_ids"][:-1]
labels = labels[1:]This operation drops one token from input_ids and one token from labels, but from opposite ends.
Below is the clearest possible explanation.
✅ What exactly is dropped from input_ids?
input_ids?👉 The last token of the full sequence.
Example full sequence (symbolic tokens):
full_tokenized["input_ids"] =
[
F_sys, "You", ..., "depression", EOS
]EOS = end-of-sequence token (or <end_of_turn>)
When you do:
input_ids = full_tokenized["input_ids"][:-1]You get:
input_ids =
[
F_sys, "You", ..., "depression"
]✔️ Dropped token = the last token (usually EOS / end-of-turn).
❗ WHY do we drop the last token?
Because for next-token prediction:
The model outputs a prediction at each position i based on input at position i.
But there is no next token after EOS. So we remove the last input position to keep input and labels aligned.
✅ What exactly is dropped from labels?
labels?When you shift labels:
labels = labels[1:]You remove the first token from the label sequence.
Before shifting:
labels_before_shift =
[
[-100], [-100], ... [-100], # masked prompt tokens
"based", "on", ..., "depression", EOS
]After shifting:
labels_after_shift =
[
[-100], ... [-100], "based", "on", ..., EOS
]✔️ Dropped token = the first token (which matches input_ids[0])
The first label corresponds to predicting the “next token” for something that is not supposed to be predicted.
You remove it so that:
input_ids[i] aligns with labels[i]🔍 Summary Table — EXACTLY What Gets Dropped
input_ids[:-1]
end of input_ids
the last token (usually EOS)
no next token to predict after EOS
labels[1:]
start of labels
the first label token
label[0] has no input_ids[-1] to align with
⚡ Visual Alignment
Before:
input_ids: [t0, t1, t2, t3, EOS]
labels: [-100, -100, "based", ..., EOS]After shifting:
input_ids: [t0, t1, t2, t3] ← last token dropped
labels: [-100, "based", ...] ← first label droppedNow:
input_ids[i] → predicts → labels[i]Perfect alignment.
Last updated