LLM MODELS, PROVIDERS AND TRAINING

Fine-Tune Gemma 4 E4B for Swedish Translation

Fine-tuning Gemma 4 E4B pipeline diagram for English to Swedish novel translation with Claude Opus 4.7 as evaluator.

Fine-tuning works best when the task is narrow.

You do not need a small open-weight model to beat Claude Opus 4.7 at everything. That is a losing fight, and it is the wrong open-source vs proprietary tradeoff to make.

You need it to become strong at one job:

Translate English novel passages into clean, natural Swedish.

This guide shows a practical pipeline using:

Part Choice
Student model google/gemma-4-E4B-it
Fine-tuning tool Unsloth
Method LoRA
Dataset Helsinki-NLP/opus_books, en-sv
Metrics BLEU, chrF, COMET
Judge model Claude Opus 4.7
Export format LoRA adapter or GGUF

The idea is simple:

  • Train Gemma 4 E4B on legal English-Swedish book data.
  • Evaluate it with automatic metrics.
  • Use Claude Opus 4.7 as a reviewer.
  • Improve the dataset.
  • Train again.

This is not direct Claude distillation.

Anthropic does not allow using Claude outputs to train or develop another AI model unless you have written permission.

So we will not train Gemma on Claude-generated Swedish translations.

We will use Claude Opus 4.7 as a judge, not as the source of training labels.

That is the safer setup.

What We Are Building

Input:

He looked at the old house and felt that something was waiting inside.

Expected Swedish output:

Han såg på det gamla huset och kände att något väntade där inne.

A good novel translation is not word-for-word.

The model must learn:

Skill Meaning
Accuracy Keep the original meaning
Tone Preserve the mood
Swedish fluency Sound natural to Swedish readers
Dialogue Keep speech readable
Names Keep character names unchanged
Consistency Translate repeated terms the same way

This is why literary translation needs a clean dataset and real evaluation.

Important Model Facts

Gemma 4 E4B is an open-weight instruction model from Google.

The official Gemma 4 model card says the small Gemma 4 models use a 128K context window, while the medium models support 256K.

E4B belongs to the small model group.

Unsloth supports Gemma 4 E2B, E4B, 26B-A4B, and 31B fine-tuning.

Unsloth’s E4B LoRA fine-tuning guide quotes around 17GB VRAM; real usage shifts with sequence length, batch size, LoRA rank, and precision, so treat it as a starting point.

Use a Linux machine with an NVIDIA GPU for this tutorial.

A Mac can run local models, but this Unsloth training flow is meant for CUDA.

Dataset

Use:

Helsinki-NLP/opus_books

Config:

en-sv

OPUS Books contains aligned book text.

It is useful for learning, research, and prototyping.

For a real product, use your own licensed English novels and licensed Swedish translations.

Do not build a commercial translation product on a dataset unless you have checked the data rights.

Project Structure

gemma-sv-translator/
  data/
    train.jsonl
    valid.jsonl
    test.jsonl
  scripts/
    prepare_data.py
    train_unsloth.py
    translate_test.py
    eval_metrics.py
    opus_judge.py
  outputs/

Install packages:

pip install --upgrade --force-reinstall --no-cache-dir unsloth unsloth_zoo
pip install -U datasets transformers trl peft accelerate bitsandbytes
pip install -U sacrebleu unbabel-comet anthropic pandas torch

For reproducibility, pin package versions after your first successful run. Unsloth and TRL move fast, and a working requirements.txt will save you a debugging session later.

Log in to Hugging Face:

huggingface-cli login

You may need to accept the Gemma license on Hugging Face before downloading the model.

Step 1: Prepare the Dataset

Create:

scripts/prepare_data.py
from datasets import load_dataset
from pathlib import Path
import json
import random
import re

OUT_DIR = Path("data")
OUT_DIR.mkdir(exist_ok=True)

SYSTEM_PROMPT = (
    "You are a professional literary translator. "
    "Translate English novel passages into natural Swedish. "
    "Preserve meaning, tone, dialogue, names, and formatting."
)

def clean_text(text: str) -> str:
    text = text.strip()
    text = re.sub(r"\s+", " ", text)
    return text

def is_good_pair(en: str, sv: str) -> bool:
    if not en or not sv:
        return False

    if len(en) < 20 or len(sv) < 20:
        return False

    if len(en) > 2500 or len(sv) > 2500:
        return False

    ratio = len(sv) / max(len(en), 1)

    if ratio < 0.4 or ratio > 2.2:
        return False

    return True

dataset = load_dataset(
    "Helsinki-NLP/opus_books",
    "en-sv",
    split="train",
)

rows = []

for item in dataset:
    pair = item["translation"]

    en = clean_text(pair.get("en", ""))
    sv = clean_text(pair.get("sv", ""))

    if not is_good_pair(en, sv):
        continue

    rows.append({
        "source_en": en,
        "target_sv": sv,
        "messages": [
            {
                "role": "system",
                "content": SYSTEM_PROMPT,
            },
            {
                "role": "user",
                "content": f"Translate this English novel passage to Swedish:\n\n{en}",
            },
            {
                "role": "assistant",
                "content": sv,
            },
        ],
    })

random.seed(3407)
random.shuffle(rows)

train_end = int(len(rows) * 0.85)
valid_end = int(len(rows) * 0.95)

splits = {
    "train": rows[:train_end],
    "valid": rows[train_end:valid_end],
    "test": rows[valid_end:],
}

for split_name, split_rows in splits.items():
    path = OUT_DIR / f"{split_name}.jsonl"

    with path.open("w", encoding="utf-8") as f:
        for row in split_rows:
            f.write(json.dumps(row, ensure_ascii=False) + "\n")

    print(split_name, len(split_rows))

Run:

python scripts/prepare_data.py

This creates:

data/train.jsonl
data/valid.jsonl
data/test.jsonl

The shuffle is important.

Without it, your train and test split may keep nearby book sections together in a bad order.

Step 2: Fine-Tune Gemma 4 E4B with Unsloth

Create:

scripts/train_unsloth.py
import torch
from datasets import load_dataset
from trl import SFTConfig, SFTTrainer
from unsloth import FastLanguageModel

MODEL_ID = "google/gemma-4-E4B-it"
MAX_SEQ_LENGTH = 2048

dataset = load_dataset(
    "json",
    data_files={
        "train": "data/train.jsonl",
        "valid": "data/valid.jsonl",
    },
)

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=MODEL_ID,
    max_seq_length=MAX_SEQ_LENGTH,
    load_in_4bit=False,
    load_in_16bit=True,
    full_finetuning=False,
)

model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
    ],
    use_gradient_checkpointing="unsloth",
    random_state=3407,
    max_seq_length=MAX_SEQ_LENGTH,
)

def format_example(example):
    text = tokenizer.apply_chat_template(
        example["messages"],
        tokenize=False,
        add_generation_prompt=False,
    )

    return {"text": text}

train_ds = dataset["train"].map(
    format_example,
    remove_columns=dataset["train"].column_names,
)

valid_ds = dataset["valid"].map(
    format_example,
    remove_columns=dataset["valid"].column_names,
)

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=train_ds,
    eval_dataset=valid_ds,
    args=SFTConfig(
        output_dir="outputs/gemma-sv-translator",
        dataset_text_field="text",
        max_seq_length=MAX_SEQ_LENGTH,
        per_device_train_batch_size=1,
        gradient_accumulation_steps=8,
        learning_rate=2e-4,
        num_train_epochs=3,
        warmup_ratio=0.05,
        logging_steps=10,
        eval_steps=50,
        save_steps=100,
        eval_strategy="steps",
        save_strategy="steps",
        optim="adamw_8bit",
        bf16=torch.cuda.is_available() and torch.cuda.is_bf16_supported(),
        fp16=torch.cuda.is_available() and not torch.cuda.is_bf16_supported(),
        seed=3407,
        report_to="none",
    ),
)

trainer.train()

model.save_pretrained("outputs/gemma-sv-translator-lora")
tokenizer.save_pretrained("outputs/gemma-sv-translator-lora")

Run:

python scripts/train_unsloth.py

Start with:

MAX_SEQ_LENGTH = 2048

After the full pipeline works, try:

MAX_SEQ_LENGTH = 4096

Longer passages may help novel translation.

They also need more VRAM.

Step 3: Generate Test Translations

Create:

scripts/translate_test.py
import json
from pathlib import Path
from unsloth import FastLanguageModel

MODEL_PATH = "outputs/gemma-sv-translator-lora"
MAX_SEQ_LENGTH = 2048

SYSTEM_PROMPT = (
    "You are a professional literary translator. "
    "Translate English novel passages into natural Swedish. "
    "Preserve meaning, tone, dialogue, names, and formatting."
)

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=MODEL_PATH,
    max_seq_length=MAX_SEQ_LENGTH,
    load_in_4bit=False,
    load_in_16bit=True,
)

FastLanguageModel.for_inference(model)

def translate(en: str) -> str:
    messages = [
        {
            "role": "system",
            "content": SYSTEM_PROMPT,
        },
        {
            "role": "user",
            "content": f"Translate this English novel passage to Swedish:\n\n{en}",
        },
    ]

    prompt = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
    )

    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

    output = model.generate(
        **inputs,
        max_new_tokens=900,
        do_sample=False,
    )

    generated_tokens = output[0][inputs["input_ids"].shape[-1]:]
    return tokenizer.decode(generated_tokens, skip_special_tokens=True).strip()

Path("outputs").mkdir(exist_ok=True)

with open("data/test.jsonl", "r", encoding="utf-8") as f_in, \
     open("outputs/predictions.jsonl", "w", encoding="utf-8") as f_out:

    for line in f_in:
        row = json.loads(line)
        prediction = translate(row["source_en"])

        f_out.write(json.dumps({
            "source_en": row["source_en"],
            "reference_sv": row["target_sv"],
            "prediction_sv": prediction,
        }, ensure_ascii=False) + "\n")

Run:

python scripts/translate_test.py

You now have:

outputs/predictions.jsonl

Step 4: Evaluate with BLEU, chrF, and COMET

BLEU and chrF compare the model output with the human Swedish reference.

COMET checks the source, prediction, and reference together.

For translation work, COMET is often more useful than BLEU alone. If you want a stronger pattern for compiling and optimizing eval pipelines around metrics like these, DSPy is a good next step.

Create:

scripts/eval_metrics.py
import json
import torch
import sacrebleu
from comet import download_model, load_from_checkpoint

sources = []
references = []
predictions = []

with open("outputs/predictions.jsonl", "r", encoding="utf-8") as f:
    for line in f:
        row = json.loads(line)

        sources.append(row["source_en"])
        references.append(row["reference_sv"])
        predictions.append(row["prediction_sv"])

bleu = sacrebleu.corpus_bleu(predictions, [references])
chrf = sacrebleu.corpus_chrf(predictions, [references])

print("BLEU:", round(bleu.score, 2))
print("chrF:", round(chrf.score, 2))

model_path = download_model("Unbabel/wmt22-comet-da")
comet_model = load_from_checkpoint(model_path)

comet_data = [
    {
        "src": source,
        "mt": prediction,
        "ref": reference,
    }
    for source, prediction, reference in zip(sources, predictions, references)
]

scores = comet_model.predict(
    comet_data,
    batch_size=8,
    gpus=1 if torch.cuda.is_available() else 0,
)

print("COMET:", round(scores.system_score, 4))

Run:

python scripts/eval_metrics.py

Track your results like this:

Model BLEU chrF COMET Notes
Base Gemma 4 E4B Run first Run first Run first Baseline
Fine-tuned Gemma 4 E4B Compare Compare Compare Your trained model
Claude Opus 4.7 Judge only Judge only Judge only Do not train on its outputs

Do not publish fake scores.

Run the eval and show your real numbers.

Step 5: Use Claude Opus 4.7 as a Judge

Claude Opus 4.7 is useful for quality review. For the full pattern (golden datasets, regression back-testing, release gates), see my LLM-as-judge eval pipeline guide.

Use it to check:

  • Meaning
  • Swedish fluency
  • Literary tone
  • Dialogue quality
  • Name preservation
  • Formatting preservation

Create:

scripts/opus_judge.py
import json
import os
from anthropic import Anthropic

client = Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])

def judge_translation(source_en: str, reference_sv: str, prediction_sv: str) -> str:
    prompt = f"""
You are evaluating an English to Swedish literary translation.

Score the model translation from 1 to 5.

Criteria:
1. Meaning preservation
2. Natural Swedish
3. Literary tone
4. Dialogue quality
5. Name and formatting preservation

Return only valid JSON with these fields:
score, meaning_errors, fluency_errors, tone_errors, final_comment

English source:
{source_en}

Human Swedish reference:
{reference_sv}

Model Swedish translation:
{prediction_sv}
""".strip()

    message = client.messages.create(
        model="claude-opus-4-7",
        max_tokens=700,
        messages=[
            {
                "role": "user",
                "content": prompt,
            }
        ],
    )

    return message.content[0].text

with open("outputs/predictions.jsonl", "r", encoding="utf-8") as f_in, \
     open("outputs/opus_judged.jsonl", "w", encoding="utf-8") as f_out:

    for line in f_in:
        row = json.loads(line)

        row["opus_judge"] = judge_translation(
            row["source_en"],
            row["reference_sv"],
            row["prediction_sv"],
        )

        f_out.write(json.dumps(row, ensure_ascii=False) + "\n")

Run:

export ANTHROPIC_API_KEY="your_key_here"
python scripts/opus_judge.py

Do not set temperature, top_p, or top_k for Opus 4.7. Anthropic says non-default sampling parameters return a 400 error on Opus 4.7. Prompt the model clearly instead.

Step 6: Improve the Dataset

Fine-tuning quality mostly comes from data quality.

Fix the data before changing hyperparameters.

Problem Fix
Bad alignment Remove the row
Old Swedish style Add modern licensed examples
Missing dialogue Add more dialogue passages
Names get translated Add examples with names
Long passages fail Train with longer chunks
Tone feels flat Add better literary references

Good row:

{
  "source_en": "She opened the door slowly, afraid of what she might find.",
  "target_sv": "Hon öppnade dörren långsamt, rädd för vad hon skulle kunna hitta."
}

Bad row:

{
  "source_en": "She opened the door slowly.",
  "target_sv": "Kapitel tre."
}

Bad rows hurt the model.

Remove them.

Step 7: Build a Human Test Set

Automatic metrics are not enough for novels.

Create 100 hand-picked examples:

Type Count
Dialogue 20
Description 20
Action 20
Emotional scenes 20
Long paragraphs 20

Score each output from 1 to 5:

Score Meaning
5 Publishable
4 Good, minor edits
3 Understandable, needs editing
2 Serious problems
1 Wrong translation

The real question is simple:

Does the model reduce editing time?

If a human editor normally needs 30 minutes and your model reduces that to 10 minutes, the fine-tune is useful.

Step 8: Export the Model

Unsloth supports GGUF export.

Example:

model.save_pretrained_gguf(
    "outputs/gemma-sv-translator-gguf",
    tokenizer,
    quantization_method="q8_0",
)

You can also try a smaller file:

model.save_pretrained_gguf(
    "outputs/gemma-sv-translator-gguf-q4",
    tokenizer,
    quantization_method="q4_k_m",
)

Use this rule:

Format Best for
LoRA adapter More training
Merged model Server deployment
GGUF q8_0 Better local quality
GGUF q4_k_m Smaller local model

For novel translation, test q8_0 first.

Then compare it with q4_k_m.

Quantization can reduce tone quality.

The Full Pipeline

1. Load OPUS Books English-Swedish.
2. Clean bad translation pairs.
3. Format rows as Gemma chat examples.
4. Fine-tune Gemma 4 E4B with Unsloth LoRA.
5. Translate unseen test passages.
6. Evaluate with BLEU, chrF, and COMET.
7. Ask Claude Opus 4.7 to judge quality.
8. Inspect the worst examples.
9. Improve the dataset.
10. Train again.
11. Export to GGUF.

Fine-tuning is not one training run. It is data, evals, mistakes, fixes, and retraining. If you want the broader picture, MLOps for LLMs covers the production retrain loop end to end.

Can Gemma 4 E4B Become as Good as Claude Opus 4.7?

Not in general.

Claude Opus 4.7 is a much larger frontier model.

Gemma 4 E4B is a small open-weight model.

But Gemma can become useful for one narrow task:

English novel passage in.

Clean Swedish draft out.

A practical setup is:

Role Model
Cheap daily translation draft Fine-tuned Gemma 4 E4B
Hard passage review Claude Opus 4.7
Final quality Human editor

This gives you:

  • Lower cost
  • Local control
  • Faster drafts
  • Better privacy
  • Measurable quality

You are not replacing the frontier model.

You are moving repeatable work to a smaller model and keeping the larger model for review.

Final Thoughts

For English to Swedish novel translation, the recipe is clear:

  • Use legal parallel data.
  • Clean the dataset carefully.
  • Fine-tune Gemma 4 E4B with Unsloth.
  • Evaluate with BLEU, chrF, COMET, and human review.
  • Use Claude Opus 4.7 as a judge, not as training data.
  • Improve the dataset and repeat.

That is how you build a small local translation model that can be useful in a real translation workflow.

Frequently Asked Questions

Why fine-tune Gemma 4 E4B instead of just calling Claude Opus 4.7?

Per-request cost and latency. A LoRA-tuned Gemma 4 E4B runs on a single GPU, drafts a Swedish passage in seconds, and stays local. Claude Opus 4.7 is the better reviewer; a small specialist is the better draft engine.

Can I fine-tune Gemma 4 E4B on a Mac?

No. The Unsloth flow in this guide is CUDA-only. A Mac can run the exported GGUF model for inference, but training needs an NVIDIA GPU with about 17GB of VRAM.

Why use Claude as a judge instead of as a teacher?

Anthropic’s usage policy does not allow training another AI model on Claude outputs without written permission. Scoring translations with Opus 4.7 is fine; generating training labels for Gemma with it is not.

How much does dataset quality matter compared to hyperparameters?

More than anything else. A clean 5K-pair dataset will beat a noisy 50K-pair dataset every time. Fix the data first, then think about LoRA rank, learning rate, and epochs.

Is OPUS Books safe to ship a commercial product on?

For learning and benchmarking, yes. For a paid translation product, no. Check the license of each book pair and use your own licensed parallel corpus before charging customers.

Sources