Fine-tuning works best when the task is narrow.
You do not need a small open-weight model to beat Claude Opus 4.7 at everything. That is a losing fight, and it is the wrong open-source vs proprietary tradeoff to make.
You need it to become strong at one job:
Translate English novel passages into clean, natural Swedish.
This guide shows a practical pipeline using:
| Part | Choice |
|---|---|
| Student model | google/gemma-4-E4B-it |
| Fine-tuning tool | Unsloth |
| Method | LoRA |
| Dataset | Helsinki-NLP/opus_books, en-sv |
| Metrics | BLEU, chrF, COMET |
| Judge model | Claude Opus 4.7 |
| Export format | LoRA adapter or GGUF |
The idea is simple:
- Train Gemma 4 E4B on legal English-Swedish book data.
- Evaluate it with automatic metrics.
- Use Claude Opus 4.7 as a reviewer.
- Improve the dataset.
- Train again.
This is not direct Claude distillation.
Anthropic does not allow using Claude outputs to train or develop another AI model unless you have written permission.
So we will not train Gemma on Claude-generated Swedish translations.
We will use Claude Opus 4.7 as a judge, not as the source of training labels.
That is the safer setup.
What We Are Building
Input:
He looked at the old house and felt that something was waiting inside.
Expected Swedish output:
Han såg på det gamla huset och kände att något väntade där inne.
A good novel translation is not word-for-word.
The model must learn:
| Skill | Meaning |
|---|---|
| Accuracy | Keep the original meaning |
| Tone | Preserve the mood |
| Swedish fluency | Sound natural to Swedish readers |
| Dialogue | Keep speech readable |
| Names | Keep character names unchanged |
| Consistency | Translate repeated terms the same way |
This is why literary translation needs a clean dataset and real evaluation.
Important Model Facts
Gemma 4 E4B is an open-weight instruction model from Google.
The official Gemma 4 model card says the small Gemma 4 models use a 128K context window, while the medium models support 256K.
E4B belongs to the small model group.
Unsloth supports Gemma 4 E2B, E4B, 26B-A4B, and 31B fine-tuning.
Unsloth’s E4B LoRA fine-tuning guide quotes around 17GB VRAM; real usage shifts with sequence length, batch size, LoRA rank, and precision, so treat it as a starting point.
Use a Linux machine with an NVIDIA GPU for this tutorial.
A Mac can run local models, but this Unsloth training flow is meant for CUDA.
Dataset
Use:
Helsinki-NLP/opus_books
Config:
en-sv
OPUS Books contains aligned book text.
It is useful for learning, research, and prototyping.
For a real product, use your own licensed English novels and licensed Swedish translations.
Do not build a commercial translation product on a dataset unless you have checked the data rights.
Project Structure
gemma-sv-translator/
data/
train.jsonl
valid.jsonl
test.jsonl
scripts/
prepare_data.py
train_unsloth.py
translate_test.py
eval_metrics.py
opus_judge.py
outputs/
Install packages:
pip install --upgrade --force-reinstall --no-cache-dir unsloth unsloth_zoo
pip install -U datasets transformers trl peft accelerate bitsandbytes
pip install -U sacrebleu unbabel-comet anthropic pandas torch
For reproducibility, pin package versions after your first successful run. Unsloth and TRL move fast, and a working requirements.txt will save you a debugging session later.
Log in to Hugging Face:
huggingface-cli login
You may need to accept the Gemma license on Hugging Face before downloading the model.
Step 1: Prepare the Dataset
Create:
scripts/prepare_data.py
from datasets import load_dataset
from pathlib import Path
import json
import random
import re
OUT_DIR = Path("data")
OUT_DIR.mkdir(exist_ok=True)
SYSTEM_PROMPT = (
"You are a professional literary translator. "
"Translate English novel passages into natural Swedish. "
"Preserve meaning, tone, dialogue, names, and formatting."
)
def clean_text(text: str) -> str:
text = text.strip()
text = re.sub(r"\s+", " ", text)
return text
def is_good_pair(en: str, sv: str) -> bool:
if not en or not sv:
return False
if len(en) < 20 or len(sv) < 20:
return False
if len(en) > 2500 or len(sv) > 2500:
return False
ratio = len(sv) / max(len(en), 1)
if ratio < 0.4 or ratio > 2.2:
return False
return True
dataset = load_dataset(
"Helsinki-NLP/opus_books",
"en-sv",
split="train",
)
rows = []
for item in dataset:
pair = item["translation"]
en = clean_text(pair.get("en", ""))
sv = clean_text(pair.get("sv", ""))
if not is_good_pair(en, sv):
continue
rows.append({
"source_en": en,
"target_sv": sv,
"messages": [
{
"role": "system",
"content": SYSTEM_PROMPT,
},
{
"role": "user",
"content": f"Translate this English novel passage to Swedish:\n\n{en}",
},
{
"role": "assistant",
"content": sv,
},
],
})
random.seed(3407)
random.shuffle(rows)
train_end = int(len(rows) * 0.85)
valid_end = int(len(rows) * 0.95)
splits = {
"train": rows[:train_end],
"valid": rows[train_end:valid_end],
"test": rows[valid_end:],
}
for split_name, split_rows in splits.items():
path = OUT_DIR / f"{split_name}.jsonl"
with path.open("w", encoding="utf-8") as f:
for row in split_rows:
f.write(json.dumps(row, ensure_ascii=False) + "\n")
print(split_name, len(split_rows))
Run:
python scripts/prepare_data.py
This creates:
data/train.jsonl
data/valid.jsonl
data/test.jsonl
The shuffle is important.
Without it, your train and test split may keep nearby book sections together in a bad order.
Step 2: Fine-Tune Gemma 4 E4B with Unsloth
Create:
scripts/train_unsloth.py
import torch
from datasets import load_dataset
from trl import SFTConfig, SFTTrainer
from unsloth import FastLanguageModel
MODEL_ID = "google/gemma-4-E4B-it"
MAX_SEQ_LENGTH = 2048
dataset = load_dataset(
"json",
data_files={
"train": "data/train.jsonl",
"valid": "data/valid.jsonl",
},
)
model, tokenizer = FastLanguageModel.from_pretrained(
model_name=MODEL_ID,
max_seq_length=MAX_SEQ_LENGTH,
load_in_4bit=False,
load_in_16bit=True,
full_finetuning=False,
)
model = FastLanguageModel.get_peft_model(
model,
r=16,
lora_alpha=16,
lora_dropout=0,
bias="none",
target_modules=[
"q_proj",
"k_proj",
"v_proj",
"o_proj",
"gate_proj",
"up_proj",
"down_proj",
],
use_gradient_checkpointing="unsloth",
random_state=3407,
max_seq_length=MAX_SEQ_LENGTH,
)
def format_example(example):
text = tokenizer.apply_chat_template(
example["messages"],
tokenize=False,
add_generation_prompt=False,
)
return {"text": text}
train_ds = dataset["train"].map(
format_example,
remove_columns=dataset["train"].column_names,
)
valid_ds = dataset["valid"].map(
format_example,
remove_columns=dataset["valid"].column_names,
)
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=train_ds,
eval_dataset=valid_ds,
args=SFTConfig(
output_dir="outputs/gemma-sv-translator",
dataset_text_field="text",
max_seq_length=MAX_SEQ_LENGTH,
per_device_train_batch_size=1,
gradient_accumulation_steps=8,
learning_rate=2e-4,
num_train_epochs=3,
warmup_ratio=0.05,
logging_steps=10,
eval_steps=50,
save_steps=100,
eval_strategy="steps",
save_strategy="steps",
optim="adamw_8bit",
bf16=torch.cuda.is_available() and torch.cuda.is_bf16_supported(),
fp16=torch.cuda.is_available() and not torch.cuda.is_bf16_supported(),
seed=3407,
report_to="none",
),
)
trainer.train()
model.save_pretrained("outputs/gemma-sv-translator-lora")
tokenizer.save_pretrained("outputs/gemma-sv-translator-lora")
Run:
python scripts/train_unsloth.py
Start with:
MAX_SEQ_LENGTH = 2048
After the full pipeline works, try:
MAX_SEQ_LENGTH = 4096
Longer passages may help novel translation.
They also need more VRAM.
Step 3: Generate Test Translations
Create:
scripts/translate_test.py
import json
from pathlib import Path
from unsloth import FastLanguageModel
MODEL_PATH = "outputs/gemma-sv-translator-lora"
MAX_SEQ_LENGTH = 2048
SYSTEM_PROMPT = (
"You are a professional literary translator. "
"Translate English novel passages into natural Swedish. "
"Preserve meaning, tone, dialogue, names, and formatting."
)
model, tokenizer = FastLanguageModel.from_pretrained(
model_name=MODEL_PATH,
max_seq_length=MAX_SEQ_LENGTH,
load_in_4bit=False,
load_in_16bit=True,
)
FastLanguageModel.for_inference(model)
def translate(en: str) -> str:
messages = [
{
"role": "system",
"content": SYSTEM_PROMPT,
},
{
"role": "user",
"content": f"Translate this English novel passage to Swedish:\n\n{en}",
},
]
prompt = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
output = model.generate(
**inputs,
max_new_tokens=900,
do_sample=False,
)
generated_tokens = output[0][inputs["input_ids"].shape[-1]:]
return tokenizer.decode(generated_tokens, skip_special_tokens=True).strip()
Path("outputs").mkdir(exist_ok=True)
with open("data/test.jsonl", "r", encoding="utf-8") as f_in, \
open("outputs/predictions.jsonl", "w", encoding="utf-8") as f_out:
for line in f_in:
row = json.loads(line)
prediction = translate(row["source_en"])
f_out.write(json.dumps({
"source_en": row["source_en"],
"reference_sv": row["target_sv"],
"prediction_sv": prediction,
}, ensure_ascii=False) + "\n")
Run:
python scripts/translate_test.py
You now have:
outputs/predictions.jsonl
Step 4: Evaluate with BLEU, chrF, and COMET
BLEU and chrF compare the model output with the human Swedish reference.
COMET checks the source, prediction, and reference together.
For translation work, COMET is often more useful than BLEU alone. If you want a stronger pattern for compiling and optimizing eval pipelines around metrics like these, DSPy is a good next step.
Create:
scripts/eval_metrics.py
import json
import torch
import sacrebleu
from comet import download_model, load_from_checkpoint
sources = []
references = []
predictions = []
with open("outputs/predictions.jsonl", "r", encoding="utf-8") as f:
for line in f:
row = json.loads(line)
sources.append(row["source_en"])
references.append(row["reference_sv"])
predictions.append(row["prediction_sv"])
bleu = sacrebleu.corpus_bleu(predictions, [references])
chrf = sacrebleu.corpus_chrf(predictions, [references])
print("BLEU:", round(bleu.score, 2))
print("chrF:", round(chrf.score, 2))
model_path = download_model("Unbabel/wmt22-comet-da")
comet_model = load_from_checkpoint(model_path)
comet_data = [
{
"src": source,
"mt": prediction,
"ref": reference,
}
for source, prediction, reference in zip(sources, predictions, references)
]
scores = comet_model.predict(
comet_data,
batch_size=8,
gpus=1 if torch.cuda.is_available() else 0,
)
print("COMET:", round(scores.system_score, 4))
Run:
python scripts/eval_metrics.py
Track your results like this:
| Model | BLEU | chrF | COMET | Notes |
|---|---|---|---|---|
| Base Gemma 4 E4B | Run first | Run first | Run first | Baseline |
| Fine-tuned Gemma 4 E4B | Compare | Compare | Compare | Your trained model |
| Claude Opus 4.7 | Judge only | Judge only | Judge only | Do not train on its outputs |
Do not publish fake scores.
Run the eval and show your real numbers.
Step 5: Use Claude Opus 4.7 as a Judge
Claude Opus 4.7 is useful for quality review. For the full pattern (golden datasets, regression back-testing, release gates), see my LLM-as-judge eval pipeline guide.
Use it to check:
- Meaning
- Swedish fluency
- Literary tone
- Dialogue quality
- Name preservation
- Formatting preservation
Create:
scripts/opus_judge.py
import json
import os
from anthropic import Anthropic
client = Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
def judge_translation(source_en: str, reference_sv: str, prediction_sv: str) -> str:
prompt = f"""
You are evaluating an English to Swedish literary translation.
Score the model translation from 1 to 5.
Criteria:
1. Meaning preservation
2. Natural Swedish
3. Literary tone
4. Dialogue quality
5. Name and formatting preservation
Return only valid JSON with these fields:
score, meaning_errors, fluency_errors, tone_errors, final_comment
English source:
{source_en}
Human Swedish reference:
{reference_sv}
Model Swedish translation:
{prediction_sv}
""".strip()
message = client.messages.create(
model="claude-opus-4-7",
max_tokens=700,
messages=[
{
"role": "user",
"content": prompt,
}
],
)
return message.content[0].text
with open("outputs/predictions.jsonl", "r", encoding="utf-8") as f_in, \
open("outputs/opus_judged.jsonl", "w", encoding="utf-8") as f_out:
for line in f_in:
row = json.loads(line)
row["opus_judge"] = judge_translation(
row["source_en"],
row["reference_sv"],
row["prediction_sv"],
)
f_out.write(json.dumps(row, ensure_ascii=False) + "\n")
Run:
export ANTHROPIC_API_KEY="your_key_here"
python scripts/opus_judge.py
Do not set temperature, top_p, or top_k for Opus 4.7. Anthropic says non-default sampling parameters return a 400 error on Opus 4.7. Prompt the model clearly instead.
Step 6: Improve the Dataset
Fine-tuning quality mostly comes from data quality.
Fix the data before changing hyperparameters.
| Problem | Fix |
|---|---|
| Bad alignment | Remove the row |
| Old Swedish style | Add modern licensed examples |
| Missing dialogue | Add more dialogue passages |
| Names get translated | Add examples with names |
| Long passages fail | Train with longer chunks |
| Tone feels flat | Add better literary references |
Good row:
{
"source_en": "She opened the door slowly, afraid of what she might find.",
"target_sv": "Hon öppnade dörren långsamt, rädd för vad hon skulle kunna hitta."
}
Bad row:
{
"source_en": "She opened the door slowly.",
"target_sv": "Kapitel tre."
}
Bad rows hurt the model.
Remove them.
Step 7: Build a Human Test Set
Automatic metrics are not enough for novels.
Create 100 hand-picked examples:
| Type | Count |
|---|---|
| Dialogue | 20 |
| Description | 20 |
| Action | 20 |
| Emotional scenes | 20 |
| Long paragraphs | 20 |
Score each output from 1 to 5:
| Score | Meaning |
|---|---|
| 5 | Publishable |
| 4 | Good, minor edits |
| 3 | Understandable, needs editing |
| 2 | Serious problems |
| 1 | Wrong translation |
The real question is simple:
Does the model reduce editing time?
If a human editor normally needs 30 minutes and your model reduces that to 10 minutes, the fine-tune is useful.
Step 8: Export the Model
Unsloth supports GGUF export.
Example:
model.save_pretrained_gguf(
"outputs/gemma-sv-translator-gguf",
tokenizer,
quantization_method="q8_0",
)
You can also try a smaller file:
model.save_pretrained_gguf(
"outputs/gemma-sv-translator-gguf-q4",
tokenizer,
quantization_method="q4_k_m",
)
Use this rule:
| Format | Best for |
|---|---|
| LoRA adapter | More training |
| Merged model | Server deployment |
| GGUF q8_0 | Better local quality |
| GGUF q4_k_m | Smaller local model |
For novel translation, test q8_0 first.
Then compare it with q4_k_m.
Quantization can reduce tone quality.
The Full Pipeline
1. Load OPUS Books English-Swedish.
2. Clean bad translation pairs.
3. Format rows as Gemma chat examples.
4. Fine-tune Gemma 4 E4B with Unsloth LoRA.
5. Translate unseen test passages.
6. Evaluate with BLEU, chrF, and COMET.
7. Ask Claude Opus 4.7 to judge quality.
8. Inspect the worst examples.
9. Improve the dataset.
10. Train again.
11. Export to GGUF.
Fine-tuning is not one training run. It is data, evals, mistakes, fixes, and retraining. If you want the broader picture, MLOps for LLMs covers the production retrain loop end to end.
Can Gemma 4 E4B Become as Good as Claude Opus 4.7?
Not in general.
Claude Opus 4.7 is a much larger frontier model.
Gemma 4 E4B is a small open-weight model.
But Gemma can become useful for one narrow task:
English novel passage in.
Clean Swedish draft out.
A practical setup is:
| Role | Model |
|---|---|
| Cheap daily translation draft | Fine-tuned Gemma 4 E4B |
| Hard passage review | Claude Opus 4.7 |
| Final quality | Human editor |
This gives you:
- Lower cost
- Local control
- Faster drafts
- Better privacy
- Measurable quality
You are not replacing the frontier model.
You are moving repeatable work to a smaller model and keeping the larger model for review.
Final Thoughts
For English to Swedish novel translation, the recipe is clear:
- Use legal parallel data.
- Clean the dataset carefully.
- Fine-tune Gemma 4 E4B with Unsloth.
- Evaluate with BLEU, chrF, COMET, and human review.
- Use Claude Opus 4.7 as a judge, not as training data.
- Improve the dataset and repeat.
That is how you build a small local translation model that can be useful in a real translation workflow.
Frequently Asked Questions
Why fine-tune Gemma 4 E4B instead of just calling Claude Opus 4.7?
Per-request cost and latency. A LoRA-tuned Gemma 4 E4B runs on a single GPU, drafts a Swedish passage in seconds, and stays local. Claude Opus 4.7 is the better reviewer; a small specialist is the better draft engine.
Can I fine-tune Gemma 4 E4B on a Mac?
No. The Unsloth flow in this guide is CUDA-only. A Mac can run the exported GGUF model for inference, but training needs an NVIDIA GPU with about 17GB of VRAM.
Why use Claude as a judge instead of as a teacher?
Anthropic’s usage policy does not allow training another AI model on Claude outputs without written permission. Scoring translations with Opus 4.7 is fine; generating training labels for Gemma with it is not.
How much does dataset quality matter compared to hyperparameters?
More than anything else. A clean 5K-pair dataset will beat a noisy 50K-pair dataset every time. Fix the data first, then think about LoRA rank, learning rate, and epochs.
Is OPUS Books safe to ship a commercial product on?
For learning and benchmarking, yes. For a paid translation product, no. Check the license of each book pair and use your own licensed parallel corpus before charging customers.