⚠️ Disclaimer: This is ongoing academic research — the methodology and results are subject to change, and new models will be added over time.
IFEval-TR is a Turkish-language adaptation of Google Research's IFEval (Instruction Following Evaluation) benchmark. It measures how well an LLM follows verifiable instructions in Turkish — constraints such as "respond in at most 200 words", "include the keyword X", "use exactly 3 bullet points", or "write the entire response in lowercase". Every constraint is checked programmatically by a deterministic verifier; no LLM judge is involved in scoring.
Datasets (910 verified examples total):
- Translated IFEval — 498 examples directly translated to Turkish from the original 541-example English IFEval set, with language-dependent parameters (keywords, end phrases, forbidden words) also translated.
- Turkish-Specific IFEval — 412 examples generated from scratch around 50 Turkish cultural and historical topics (Ottoman history, Mevlana, Cappadocia, Turkish cuisine, etc.), each combining 1-3 randomly chosen constraints.
All examples were filtered through a multi-layer pipeline (rule-based logic + Z3 SMT solver + verifier-in-the-loop synthesizer + LLM committee) to remove any prompt whose constraints cannot jointly be satisfied. Each remaining example has a deterministic certificate of possibility.
Methodology: Each model receives the Turkish prompt as-is and produces a single response. The response is then verified against all constraints in the prompt under two scoring tiers:
- Strict — exact string matching, no morphological relaxation.
- Loose — Turkish morphology-aware via zemberek-python (lemmatization, Turkish-aware case handling for İ/I and ı/i, flexible format detection).
Headline score: Because Turkish is agglutinative, strict matching unfairly penalizes morphologically-equivalent responses. Average (loose) — the mean of prompt-level loose accuracy on the Translated and Turkish-Specific datasets — is the primary metric and the default leaderboard sort key. Strict scores are shown alongside for transparency.
Constraint types (25 total): length (words/sentences/paragraphs), keyword inclusion / forbidden words / letter and keyword frequency, formatting (bullets, JSON, titles, sections, highlights, placeholders), casing (Turkish uppercase/lowercase, capital-word count), punctuation (no-comma), combination (two responses, repeat prompt), and start/end markers (quotation, end phrase, postscript).
Original IFEval: Zhou et al. 2023 (arXiv:2311.07911). Turkish morphology: Zemberek (Akın & Akın, 2007). Code & data: github.com/atahanuz/Turkish-IFEval.
Prepared by Atahan Uz.
Translated — from the original English IFEval:
2000 yılında dünyanın en yüksek gökdeleni hakkında yaz. "gökdelen" kelimesini en az 8 kez kullan. "Başka yardımcı olabileceğim bir şey var mı?" ile bitir.
Verifier checks:
keywords:frequency (keyword="gökdelen", ≥ 8 occurrences),
startend:end_checker (response ends with "Başka yardımcı olabileceğim bir şey var mı?").
Created for Turkish — native Turkish prompt:
Ramazan geleneklerini 5 paragrafta anlat. 3. paragraf "Özellikle" ile başlasın. "barış" kelimesi tam 3 kez geçsin.
Verifier checks:
length_constraints:nth_paragraph_first_word (5 paragraphs, paragraph 3 starts with "Özellikle"),
keywords:frequency (keyword="barış", exactly 3 occurrences).
| Rank | Model | Average (loose) | Translated (loose) | Turkish (loose) | Translated (strict) | Turkish (strict) |
|---|---|---|---|---|---|---|
| 1 | Qwen/Qwen3.5-27B-FP8 | 79.11 | 69.94 | 88.27 | 69.1 | 80.0 |
| 2 | Qwen/Qwen3.5-9B | 75.83 | 66.55 | 85.11 | 64.79 | 76.6 |
| 3 | openai/gpt-oss-120b | 69.74 | 66.73 | 72.75 | 62.5 | 63.26 |
| 4 | google/gemma-3-27b-it | 65.56 | 60.24 | 70.87 | 56.22 | 61.41 |
| 5 | google/gemma-3-12b-it | 64.32 | 60.44 | 68.2 | 56.43 | 59.95 |
| 6 | openai/gpt-oss-20b | 62.70 | 61.8 | 63.59 | 58.37 | 54.37 |
| 7 | gpt-4o-mini | 58.30 | 56.4 | 60.2 | 53.8 | 50.0 |
| 8 | google/gemma-3-4b-it | 58.00 | 53.61 | 62.38 | 47.99 | 52.91 |
| 9 | ytu-ce-cosmos/Turkish-Gemma-9b-v0.1 | 52.00 | 50.6 | 53.4 | 47.39 | 45.63 |
| 10 | ytu-ce-cosmos/Turkish-Gemma-9b-T1 | 48.11 | 48.39 | 47.82 | 41.77 | 39.81 |