About the Benchmark

⚠️ Disclaimer: This is ongoing academic research — the methodology and results are subject to change, and new models will be added over time.

IFEval-TR is a Turkish-language adaptation of Google Research's IFEval (Instruction Following Evaluation) benchmark. It measures how well an LLM follows verifiable instructions in Turkish — constraints such as "respond in at most 200 words", "include the keyword X", "use exactly 3 bullet points", or "write the entire response in lowercase". Every constraint is checked programmatically by a deterministic verifier; no LLM judge is involved in scoring.

Datasets (910 verified examples total):

  • Translated IFEval — 498 examples directly translated to Turkish from the original 541-example English IFEval set, with language-dependent parameters (keywords, end phrases, forbidden words) also translated.
  • Turkish-Specific IFEval — 412 examples generated from scratch around 50 Turkish cultural and historical topics (Ottoman history, Mevlana, Cappadocia, Turkish cuisine, etc.), each combining 1-3 randomly chosen constraints.

All examples were filtered through a multi-layer pipeline (rule-based logic + Z3 SMT solver + verifier-in-the-loop synthesizer + LLM committee) to remove any prompt whose constraints cannot jointly be satisfied. Each remaining example has a deterministic certificate of possibility.

Methodology: Each model receives the Turkish prompt as-is and produces a single response. The response is then verified against all constraints in the prompt under two scoring tiers:

  • Strict — exact string matching, no morphological relaxation.
  • Loose — Turkish morphology-aware via zemberek-python (lemmatization, Turkish-aware case handling for İ/I and ı/i, flexible format detection).

Headline score: Because Turkish is agglutinative, strict matching unfairly penalizes morphologically-equivalent responses. Average (loose) — the mean of prompt-level loose accuracy on the Translated and Turkish-Specific datasets — is the primary metric and the default leaderboard sort key. Strict scores are shown alongside for transparency.

Constraint types (25 total): length (words/sentences/paragraphs), keyword inclusion / forbidden words / letter and keyword frequency, formatting (bullets, JSON, titles, sections, highlights, placeholders), casing (Turkish uppercase/lowercase, capital-word count), punctuation (no-comma), combination (two responses, repeat prompt), and start/end markers (quotation, end phrase, postscript).

Original IFEval: Zhou et al. 2023 (arXiv:2311.07911). Turkish morphology: Zemberek (Akın & Akın, 2007). Code & data: github.com/atahanuz/Turkish-IFEval.

Prepared by Atahan Uz.

Examples

Translated — from the original English IFEval:

2000 yılında dünyanın en yüksek gökdeleni hakkında yaz. "gökdelen" kelimesini en az 8 kez kullan. "Başka yardımcı olabileceğim bir şey var mı?" ile bitir.

Verifier checks: keywords:frequency (keyword="gökdelen", ≥ 8 occurrences), startend:end_checker (response ends with "Başka yardımcı olabileceğim bir şey var mı?").

Created for Turkish — native Turkish prompt:

Ramazan geleneklerini 5 paragrafta anlat. 3. paragraf "Özellikle" ile başlasın. "barış" kelimesi tam 3 kez geçsin.

Verifier checks: length_constraints:nth_paragraph_first_word (5 paragraphs, paragraph 3 starts with "Özellikle"), keywords:frequency (keyword="barış", exactly 3 occurrences).

Leaderboard
Rank Model Average (loose) Translated (loose) Turkish (loose) Translated (strict) Turkish (strict)
1 Qwen/Qwen3.5-27B-FP8 79.11 69.94 88.27 69.1 80.0
2 Qwen/Qwen3.5-9B 75.83 66.55 85.11 64.79 76.6
3 openai/gpt-oss-120b 69.74 66.73 72.75 62.5 63.26
4 google/gemma-3-27b-it 65.56 60.24 70.87 56.22 61.41
5 google/gemma-3-12b-it 64.32 60.44 68.2 56.43 59.95
6 openai/gpt-oss-20b 62.70 61.8 63.59 58.37 54.37
7 gpt-4o-mini 58.30 56.4 60.2 53.8 50.0
8 google/gemma-3-4b-it 58.00 53.61 62.38 47.99 52.91
9 ytu-ce-cosmos/Turkish-Gemma-9b-v0.1 52.00 50.6 53.4 47.39 45.63
10 ytu-ce-cosmos/Turkish-Gemma-9b-T1 48.11 48.39 47.82 41.77 39.81