IFEval-TR Benchmark

About the Benchmark

⚠️ Disclaimer: This is ongoing academic research — the methodology and results are subject to change, and new models will be added over time.

IFEval-TR is a Turkish-language adaptation of Google Research's IFEval (Instruction Following Evaluation) benchmark. It measures how well an LLM follows verifiable instructions in Turkish — constraints such as "respond in at most 200 words", "include the keyword X", "use exactly 3 bullet points", or "write the entire response in lowercase". Every constraint is checked programmatically by a deterministic verifier; no LLM judge is involved in scoring.

Datasets (910 verified examples total):

Translated IFEval — 498 examples directly translated to Turkish from the original 541-example English IFEval set, with language-dependent parameters (keywords, end phrases, forbidden words) also translated.
Turkish-Specific IFEval — 412 examples generated from scratch around 50 Turkish cultural and historical topics (Ottoman history, Mevlana, Cappadocia, Turkish cuisine, etc.), each combining 1-3 randomly chosen constraints.

All examples were filtered through a multi-layer pipeline (rule-based logic + Z3 SMT solver + verifier-in-the-loop synthesizer + LLM committee) to remove any prompt whose constraints cannot jointly be satisfied. Each remaining example has a deterministic certificate of possibility.

Methodology: Each model receives the Turkish prompt as-is and produces a single response. The response is then verified against all constraints in the prompt under two scoring tiers:

Strict — exact string matching, no morphological relaxation.
Loose — Turkish morphology-aware via zemberek-python (lemmatization, Turkish-aware case handling for İ/I and ı/i, flexible format detection).

Headline score: Because Turkish is agglutinative, strict matching unfairly penalizes morphologically-equivalent responses. Average (loose) — the mean of prompt-level loose accuracy on the Translated and Turkish-Specific datasets — is the primary metric and the default leaderboard sort key. Strict scores are shown alongside for transparency.

Constraint types (25 total): length (words/sentences/paragraphs), keyword inclusion / forbidden words / letter and keyword frequency, formatting (bullets, JSON, titles, sections, highlights, placeholders), casing (Turkish uppercase/lowercase, capital-word count), punctuation (no-comma), combination (two responses, repeat prompt), and start/end markers (quotation, end phrase, postscript).

Original IFEval: Zhou et al. 2023 (arXiv:2311.07911). Turkish morphology: Zemberek (Akın & Akın, 2007). Code & data: github.com/atahanuz/Turkish-IFEval.

Prepared by Atahan Uz.

Examples

Translated — from the original English IFEval:

2000 yılında dünyanın en yüksek gökdeleni hakkında yaz. "gökdelen" kelimesini en az 8 kez kullan. "Başka yardımcı olabileceğim bir şey var mı?" ile bitir.

Verifier checks: keywords:frequency (keyword="gökdelen", ≥ 8 occurrences), startend:end_checker (response ends with "Başka yardımcı olabileceğim bir şey var mı?").

Created for Turkish — native Turkish prompt:

Ramazan geleneklerini 5 paragrafta anlat. 3. paragraf "Özellikle" ile başlasın. "barış" kelimesi tam 3 kez geçsin.

Verifier checks: length_constraints:nth_paragraph_first_word (5 paragraphs, paragraph 3 starts with "Özellikle"), keywords:frequency (keyword="barış", exactly 3 occurrences).

Leaderboard

Rank	Model	Average (loose)	Translated (loose)	Turkish (loose)	Translated (strict)	Turkish (strict)
1	Qwen/Qwen3.5-27B-FP8	79.11	69.94	88.27	69.1	80.0
2	Qwen/Qwen3.5-9B	75.83	66.55	85.11	64.79	76.6
3	openai/gpt-oss-120b	69.74	66.73	72.75	62.5	63.26
4	google/gemma-3-27b-it	65.56	60.24	70.87	56.22	61.41
5	google/gemma-3-12b-it	64.32	60.44	68.2	56.43	59.95
6	openai/gpt-oss-20b	62.70	61.8	63.59	58.37	54.37
7	gpt-4o-mini	58.30	56.4	60.2	53.8	50.0
8	google/gemma-3-4b-it	58.00	53.61	62.38	47.99	52.91
9	ytu-ce-cosmos/Turkish-Gemma-9b-v0.1	52.00	50.6	53.4	47.39	45.63
10	ytu-ce-cosmos/Turkish-Gemma-9b-T1	48.11	48.39	47.82	41.77	39.81