# Fine-Tune Gemma 4 E4B for Swedish Translation

By Amir Teymoori - May 24, 2026

---

Fine-tuning works best when the task is narrow.

You do not need a small open-weight model to beat Claude Opus 4.7 at everything. That is a losing fight, and it is the wrong [open-source vs proprietary tradeoff](/choosing-the-right-llm-proprietary-vs-open-source-in-2025/) to make.

You need it to become strong at one job:

Translate English novel passages into clean, natural Swedish.

This guide shows a practical pipeline using:

| Part | Choice |
|---|---|
| Student model | `google/gemma-4-E4B-it` |
| Fine-tuning tool | Unsloth |
| Method | LoRA |
| Dataset | `Helsinki-NLP/opus_books`, `en-sv` |
| Metrics | BLEU, chrF, COMET |
| Judge model | Claude Opus 4.7 |
| Export format | LoRA adapter or GGUF |

The idea is simple:

- Train Gemma 4 E4B on legal English-Swedish book data.
- Evaluate it with automatic metrics.
- Use Claude Opus 4.7 as a reviewer.
- Improve the dataset.
- Train again.

This is not direct Claude distillation.

Anthropic does not allow using Claude outputs to train or develop another AI model unless you have written permission.

So we will not train Gemma on Claude-generated Swedish translations.

We will use Claude Opus 4.7 as a judge, not as the source of training labels.

That is the safer setup.

## What We Are Building

Input:

```text
He looked at the old house and felt that something was waiting inside.
```

Expected Swedish output:

```text
Han såg på det gamla huset och kände att något väntade där inne.
```

A good novel translation is not word-for-word.

The model must learn:

| Skill | Meaning |
|---|---|
| Accuracy | Keep the original meaning |
| Tone | Preserve the mood |
| Swedish fluency | Sound natural to Swedish readers |
| Dialogue | Keep speech readable |
| Names | Keep character names unchanged |
| Consistency | Translate repeated terms the same way |

This is why literary translation needs a clean dataset and real evaluation.

## Important Model Facts

Gemma 4 E4B is an open-weight instruction model from Google.

The official Gemma 4 model card says the small Gemma 4 models use a 128K context window, while the medium models support 256K.

E4B belongs to the small model group.

Unsloth supports Gemma 4 E2B, E4B, 26B-A4B, and 31B fine-tuning.

Unsloth's E4B LoRA fine-tuning guide quotes around 17GB VRAM; real usage shifts with sequence length, batch size, LoRA rank, and precision, so treat it as a starting point.

Use a Linux machine with an NVIDIA GPU for this tutorial.

A Mac can run local models, but this Unsloth training flow is meant for CUDA.

## Dataset

Use:

```text
Helsinki-NLP/opus_books
```

Config:

```text
en-sv
```

OPUS Books contains aligned book text.

It is useful for learning, research, and prototyping.

For a real product, use your own licensed English novels and licensed Swedish translations.

Do not build a commercial translation product on a dataset unless you have checked the data rights.

## Project Structure

```text
gemma-sv-translator/
  data/
    train.jsonl
    valid.jsonl
    test.jsonl
  scripts/
    prepare_data.py
    train_unsloth.py
    translate_test.py
    eval_metrics.py
    opus_judge.py
  outputs/
```

Install packages:

```bash
pip install --upgrade --force-reinstall --no-cache-dir unsloth unsloth_zoo
pip install -U datasets transformers trl peft accelerate bitsandbytes
pip install -U sacrebleu unbabel-comet anthropic pandas torch
```

For reproducibility, pin package versions after your first successful run. Unsloth and TRL move fast, and a working `requirements.txt` will save you a debugging session later.

Log in to Hugging Face:

```bash
huggingface-cli login
```

You may need to accept the Gemma license on Hugging Face before downloading the model.

## Step 1: Prepare the Dataset

Create:

```text
scripts/prepare_data.py
```

```python
from datasets import load_dataset
from pathlib import Path
import json
import random
import re

OUT_DIR = Path("data")
OUT_DIR.mkdir(exist_ok=True)

SYSTEM_PROMPT = (
    "You are a professional literary translator. "
    "Translate English novel passages into natural Swedish. "
    "Preserve meaning, tone, dialogue, names, and formatting."
)

def clean_text(text: str) -> str:
    text = text.strip()
    text = re.sub(r"\s+", " ", text)
    return text

def is_good_pair(en: str, sv: str) -> bool:
    if not en or not sv:
        return False

    if len(en) < 20 or len(sv) < 20:
        return False

    if len(en) > 2500 or len(sv) > 2500:
        return False

    ratio = len(sv) / max(len(en), 1)

    if ratio < 0.4 or ratio > 2.2:
        return False

    return True

dataset = load_dataset(
    "Helsinki-NLP/opus_books",
    "en-sv",
    split="train",
)

rows = []

for item in dataset:
    pair = item["translation"]

    en = clean_text(pair.get("en", ""))
    sv = clean_text(pair.get("sv", ""))

    if not is_good_pair(en, sv):
        continue

    rows.append({
        "source_en": en,
        "target_sv": sv,
        "messages": [
            {
                "role": "system",
                "content": SYSTEM_PROMPT,
            },
            {
                "role": "user",
                "content": f"Translate this English novel passage to Swedish:\n\n{en}",
            },
            {
                "role": "assistant",
                "content": sv,
            },
        ],
    })

random.seed(3407)
random.shuffle(rows)

train_end = int(len(rows) * 0.85)
valid_end = int(len(rows) * 0.95)

splits = {
    "train": rows[:train_end],
    "valid": rows[train_end:valid_end],
    "test": rows[valid_end:],
}

for split_name, split_rows in splits.items():
    path = OUT_DIR / f"{split_name}.jsonl"

    with path.open("w", encoding="utf-8") as f:
        for row in split_rows:
            f.write(json.dumps(row, ensure_ascii=False) + "\n")

    print(split_name, len(split_rows))
```

Run:

```bash
python scripts/prepare_data.py
```

This creates:

```text
data/train.jsonl
data/valid.jsonl
data/test.jsonl
```

The shuffle is important.

Without it, your train and test split may keep nearby book sections together in a bad order.

## Step 2: Fine-Tune Gemma 4 E4B with Unsloth

Create:

```text
scripts/train_unsloth.py
```

```python
import torch
from datasets import load_dataset
from trl import SFTConfig, SFTTrainer
from unsloth import FastLanguageModel

MODEL_ID = "google/gemma-4-E4B-it"
MAX_SEQ_LENGTH = 2048

dataset = load_dataset(
    "json",
    data_files={
        "train": "data/train.jsonl",
        "valid": "data/valid.jsonl",
    },
)

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=MODEL_ID,
    max_seq_length=MAX_SEQ_LENGTH,
    load_in_4bit=False,
    load_in_16bit=True,
    full_finetuning=False,
)

model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
    ],
    use_gradient_checkpointing="unsloth",
    random_state=3407,
    max_seq_length=MAX_SEQ_LENGTH,
)

def format_example(example):
    text = tokenizer.apply_chat_template(
        example["messages"],
        tokenize=False,
        add_generation_prompt=False,
    )

    return {"text": text}

train_ds = dataset["train"].map(
    format_example,
    remove_columns=dataset["train"].column_names,
)

valid_ds = dataset["valid"].map(
    format_example,
    remove_columns=dataset["valid"].column_names,
)

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=train_ds,
    eval_dataset=valid_ds,
    args=SFTConfig(
        output_dir="outputs/gemma-sv-translator",
        dataset_text_field="text",
        max_seq_length=MAX_SEQ_LENGTH,
        per_device_train_batch_size=1,
        gradient_accumulation_steps=8,
        learning_rate=2e-4,
        num_train_epochs=3,
        warmup_ratio=0.05,
        logging_steps=10,
        eval_steps=50,
        save_steps=100,
        eval_strategy="steps",
        save_strategy="steps",
        optim="adamw_8bit",
        bf16=torch.cuda.is_available() and torch.cuda.is_bf16_supported(),
        fp16=torch.cuda.is_available() and not torch.cuda.is_bf16_supported(),
        seed=3407,
        report_to="none",
    ),
)

trainer.train()

model.save_pretrained("outputs/gemma-sv-translator-lora")
tokenizer.save_pretrained("outputs/gemma-sv-translator-lora")
```

Run:

```bash
python scripts/train_unsloth.py
```

Start with:

```python
MAX_SEQ_LENGTH = 2048
```

After the full pipeline works, try:

```python
MAX_SEQ_LENGTH = 4096
```

Longer passages may help novel translation.

They also need more VRAM.

## Step 3: Generate Test Translations

Create:

```text
scripts/translate_test.py
```

```python
import json
from pathlib import Path
from unsloth import FastLanguageModel

MODEL_PATH = "outputs/gemma-sv-translator-lora"
MAX_SEQ_LENGTH = 2048

SYSTEM_PROMPT = (
    "You are a professional literary translator. "
    "Translate English novel passages into natural Swedish. "
    "Preserve meaning, tone, dialogue, names, and formatting."
)

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=MODEL_PATH,
    max_seq_length=MAX_SEQ_LENGTH,
    load_in_4bit=False,
    load_in_16bit=True,
)

FastLanguageModel.for_inference(model)

def translate(en: str) -> str:
    messages = [
        {
            "role": "system",
            "content": SYSTEM_PROMPT,
        },
        {
            "role": "user",
            "content": f"Translate this English novel passage to Swedish:\n\n{en}",
        },
    ]

    prompt = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
    )

    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

    output = model.generate(
        **inputs,
        max_new_tokens=900,
        do_sample=False,
    )

    generated_tokens = output[0][inputs["input_ids"].shape[-1]:]
    return tokenizer.decode(generated_tokens, skip_special_tokens=True).strip()

Path("outputs").mkdir(exist_ok=True)

with open("data/test.jsonl", "r", encoding="utf-8") as f_in, \
     open("outputs/predictions.jsonl", "w", encoding="utf-8") as f_out:

    for line in f_in:
        row = json.loads(line)
        prediction = translate(row["source_en"])

        f_out.write(json.dumps({
            "source_en": row["source_en"],
            "reference_sv": row["target_sv"],
            "prediction_sv": prediction,
        }, ensure_ascii=False) + "\n")
```

Run:

```bash
python scripts/translate_test.py
```

You now have:

```text
outputs/predictions.jsonl
```

## Step 4: Evaluate with BLEU, chrF, and COMET

BLEU and chrF compare the model output with the human Swedish reference.

COMET checks the source, prediction, and reference together.

For translation work, COMET is often more useful than BLEU alone. If you want a stronger pattern for compiling and optimizing eval pipelines around metrics like these, [DSPy](/dspy-3-build-evaluate-optimize-llm-pipelines/) is a good next step.

Create:

```text
scripts/eval_metrics.py
```

```python
import json
import torch
import sacrebleu
from comet import download_model, load_from_checkpoint

sources = []
references = []
predictions = []

with open("outputs/predictions.jsonl", "r", encoding="utf-8") as f:
    for line in f:
        row = json.loads(line)

        sources.append(row["source_en"])
        references.append(row["reference_sv"])
        predictions.append(row["prediction_sv"])

bleu = sacrebleu.corpus_bleu(predictions, [references])
chrf = sacrebleu.corpus_chrf(predictions, [references])

print("BLEU:", round(bleu.score, 2))
print("chrF:", round(chrf.score, 2))

model_path = download_model("Unbabel/wmt22-comet-da")
comet_model = load_from_checkpoint(model_path)

comet_data = [
    {
        "src": source,
        "mt": prediction,
        "ref": reference,
    }
    for source, prediction, reference in zip(sources, predictions, references)
]

scores = comet_model.predict(
    comet_data,
    batch_size=8,
    gpus=1 if torch.cuda.is_available() else 0,
)

print("COMET:", round(scores.system_score, 4))
```

Run:

```bash
python scripts/eval_metrics.py
```

Track your results like this:

| Model | BLEU | chrF | COMET | Notes |
|---|---:|---:|---:|---|
| Base Gemma 4 E4B | Run first | Run first | Run first | Baseline |
| Fine-tuned Gemma 4 E4B | Compare | Compare | Compare | Your trained model |
| Claude Opus 4.7 | Judge only | Judge only | Judge only | Do not train on its outputs |

Do not publish fake scores.

Run the eval and show your real numbers.

## Step 5: Use Claude Opus 4.7 as a Judge

Claude Opus 4.7 is useful for quality review. For the full pattern (golden datasets, regression back-testing, release gates), see my [LLM-as-judge eval pipeline guide](/llm-as-judge-eval-pipelines-customer-support-ai/).

Use it to check:

- Meaning
- Swedish fluency
- Literary tone
- Dialogue quality
- Name preservation
- Formatting preservation

Create:

```text
scripts/opus_judge.py
```

```python
import json
import os
from anthropic import Anthropic

client = Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])

def judge_translation(source_en: str, reference_sv: str, prediction_sv: str) -> str:
    prompt = f"""
You are evaluating an English to Swedish literary translation.

Score the model translation from 1 to 5.

Criteria:
1. Meaning preservation
2. Natural Swedish
3. Literary tone
4. Dialogue quality
5. Name and formatting preservation

Return only valid JSON with these fields:
score, meaning_errors, fluency_errors, tone_errors, final_comment

English source:
{source_en}

Human Swedish reference:
{reference_sv}

Model Swedish translation:
{prediction_sv}
""".strip()

    message = client.messages.create(
        model="claude-opus-4-7",
        max_tokens=700,
        messages=[
            {
                "role": "user",
                "content": prompt,
            }
        ],
    )

    return message.content[0].text

with open("outputs/predictions.jsonl", "r", encoding="utf-8") as f_in, \
     open("outputs/opus_judged.jsonl", "w", encoding="utf-8") as f_out:

    for line in f_in:
        row = json.loads(line)

        row["opus_judge"] = judge_translation(
            row["source_en"],
            row["reference_sv"],
            row["prediction_sv"],
        )

        f_out.write(json.dumps(row, ensure_ascii=False) + "\n")
```

Run:

```bash
export ANTHROPIC_API_KEY="your_key_here"
python scripts/opus_judge.py
```

Do not set `temperature`, `top_p`, or `top_k` for Opus 4.7. Anthropic says non-default [sampling parameters](/llm-parameters-explained-temperature-top-p-top-k/) return a 400 error on Opus 4.7. Prompt the model clearly instead.

## Step 6: Improve the Dataset

Fine-tuning quality mostly comes from data quality.

Fix the data before changing hyperparameters.

| Problem | Fix |
|---|---|
| Bad alignment | Remove the row |
| Old Swedish style | Add modern licensed examples |
| Missing dialogue | Add more dialogue passages |
| Names get translated | Add examples with names |
| Long passages fail | Train with longer chunks |
| Tone feels flat | Add better literary references |

Good row:

```json
{
  "source_en": "She opened the door slowly, afraid of what she might find.",
  "target_sv": "Hon öppnade dörren långsamt, rädd för vad hon skulle kunna hitta."
}
```

Bad row:

```json
{
  "source_en": "She opened the door slowly.",
  "target_sv": "Kapitel tre."
}
```

Bad rows hurt the model.

Remove them.

## Step 7: Build a Human Test Set

Automatic metrics are not enough for novels.

Create 100 hand-picked examples:

| Type | Count |
|---|---:|
| Dialogue | 20 |
| Description | 20 |
| Action | 20 |
| Emotional scenes | 20 |
| Long paragraphs | 20 |

Score each output from 1 to 5:

| Score | Meaning |
|---|---|
| 5 | Publishable |
| 4 | Good, minor edits |
| 3 | Understandable, needs editing |
| 2 | Serious problems |
| 1 | Wrong translation |

The real question is simple:

Does the model reduce editing time?

If a human editor normally needs 30 minutes and your model reduces that to 10 minutes, the fine-tune is useful.

## Step 8: Export the Model

Unsloth supports GGUF export.

Example:

```python
model.save_pretrained_gguf(
    "outputs/gemma-sv-translator-gguf",
    tokenizer,
    quantization_method="q8_0",
)
```

You can also try a smaller file:

```python
model.save_pretrained_gguf(
    "outputs/gemma-sv-translator-gguf-q4",
    tokenizer,
    quantization_method="q4_k_m",
)
```

Use this rule:

| Format | Best for |
|---|---|
| LoRA adapter | More training |
| Merged model | Server deployment |
| GGUF q8_0 | Better local quality |
| GGUF q4_k_m | Smaller local model |

For novel translation, test `q8_0` first.

Then compare it with `q4_k_m`.

Quantization can reduce tone quality.

## The Full Pipeline

```text
1. Load OPUS Books English-Swedish.
2. Clean bad translation pairs.
3. Format rows as Gemma chat examples.
4. Fine-tune Gemma 4 E4B with Unsloth LoRA.
5. Translate unseen test passages.
6. Evaluate with BLEU, chrF, and COMET.
7. Ask Claude Opus 4.7 to judge quality.
8. Inspect the worst examples.
9. Improve the dataset.
10. Train again.
11. Export to GGUF.
```

Fine-tuning is not one training run. It is data, evals, mistakes, fixes, and retraining. If you want the broader picture, [MLOps for LLMs](/mlops-for-llms-how-to-ship-ai-features-without-breaking-production/) covers the production retrain loop end to end.

## Can Gemma 4 E4B Become as Good as Claude Opus 4.7?

Not in general.

Claude Opus 4.7 is a much larger frontier model.

Gemma 4 E4B is a small open-weight model.

But Gemma can become useful for one narrow task:

English novel passage in.

Clean Swedish draft out.

A practical setup is:

| Role | Model |
|---|---|
| Cheap daily translation draft | Fine-tuned Gemma 4 E4B |
| Hard passage review | Claude Opus 4.7 |
| Final quality | Human editor |

This gives you:

- Lower cost
- Local control
- Faster drafts
- Better privacy
- Measurable quality

You are not replacing the frontier model.

You are moving repeatable work to a smaller model and keeping the larger model for review.

## Final Thoughts

For English to Swedish novel translation, the recipe is clear:

- Use legal parallel data.
- Clean the dataset carefully.
- Fine-tune Gemma 4 E4B with Unsloth.
- Evaluate with BLEU, chrF, COMET, and human review.
- Use Claude Opus 4.7 as a judge, not as training data.
- Improve the dataset and repeat.

That is how you build a small local translation model that can be useful in a real translation workflow.

## Frequently Asked Questions

### Why fine-tune Gemma 4 E4B instead of just calling Claude Opus 4.7?

Per-request cost and latency. A LoRA-tuned Gemma 4 E4B runs on a single GPU, drafts a Swedish passage in seconds, and stays local. Claude Opus 4.7 is the better reviewer; a small specialist is the better draft engine.

### Can I fine-tune Gemma 4 E4B on a Mac?

No. The Unsloth flow in this guide is CUDA-only. A Mac can run the exported GGUF model for inference, but training needs an NVIDIA GPU with about 17GB of VRAM.

### Why use Claude as a judge instead of as a teacher?

Anthropic's usage policy does not allow training another AI model on Claude outputs without written permission. Scoring translations with Opus 4.7 is fine; generating training labels for Gemma with it is not.

### How much does dataset quality matter compared to hyperparameters?

More than anything else. A clean 5K-pair dataset will beat a noisy 50K-pair dataset every time. Fix the data first, then think about LoRA rank, learning rate, and epochs.

### Is OPUS Books safe to ship a commercial product on?

For learning and benchmarking, yes. For a paid translation product, no. Check the license of each book pair and use your own licensed parallel corpus before charging customers.

## Sources

- [Gemma 4 E4B on Hugging Face](https://huggingface.co/google/gemma-4-E4B-it)
- [Unsloth Gemma 4 fine-tuning guide](https://unsloth.ai/docs/models/gemma-4/train)
- [Claude models overview](https://docs.anthropic.com/en/docs/about-claude/models)
- [Claude Opus 4.7 migration guide](https://platform.claude.com/docs/en/about-claude/models/migration-guide)
- [Anthropic policy on using Claude outputs for training](https://support.claude.com/en/articles/12326764-can-i-use-my-outputs-to-train-an-ai-model)
- [OPUS Books dataset on Hugging Face](https://huggingface.co/datasets/Helsinki-NLP/opus_books)
- [COMET WMT22 model on Hugging Face](https://huggingface.co/Unbabel/wmt22-comet-da)