Back

Ai benchmark engineer - knowledge / research (Haryana)

29 76
Haryana, India
Posted: a week ago
Save
Share

Description

Job Type: Full time Role Overview We are seeking a highly analytical and computationally proficient individual to join our team with a strong research background. You will be instrumental in contributing to this role by either crafting challenging and insightful problems in your respective research domain or devising elegant computational solutions. Responsibilities: Build multi-agent benchmark tasks that require reading, analysing, and synthesising large document collections Curate real-world research corpora — academic papers, case studies, technical reports — and design questions that require comprehensive analysis Write structured ground-truth oracles (JSON) with specific, verifiable answers that prove the agent actually read the source material Design LLM judge prompts that evaluate agent output field-by-field against the oracle Create decomposition guides that split research across multiple parallel sub-agents (one per document, one per domain, then synthesis) Offer Details: Duration: 12 months+ Pay: INR 1.75 L
- 2.00 Lakhs per month (net/take-home) Number of positions: 12 Mode of work: Fully Remote Experience: 5+ Years Required Qualifications: 5+ years of research experience — academic or industry research in any scientific domain. Strong reading comprehension and ability to extract structured information from unstructured text. Experience with JSON/data structures — designing schemas, validating output formats, Python scripting ability (for judge scripts and data processing). Experience with AI coding benchmarks (SWE-bench, Terminal-bench). Comfortable with Docker — writing Dockerfiles, building images, and debugging container issues. Attention to detail — building oracles requires exact values, not approximations Strong plus: Experience with systematic reviews, meta-analyses, or large-scale literature surveys. Familiarity with medical/legal/scientific document analysis. Experience with NLP or information extraction tasks. Knowledge of LLM evaluation and benchmarking (MMLU, GPQA, Simple QA). Experience curating datasets for AI evaluation. Additional Details Commitments Required: 8 hours per day with a 4-hour overlap with PST. Employment Type: Contractor position (Note: this role does not include medical/paid leave). Example of what you'll produce: A task with 1500 medical case records (500 cardiac, 500 vascular, 500 systemic). The agent must read all cases, identify relevant ones, extract evidence, and produce a cross-domain diagnosis. The oracle requires exact first/last case IDs per file (proves the agent read start to end), verbatim excerpts from specific cases (proves it read individual records), and a cross-domain evidence matrix. The decomposition uses 15 chunk-reader sub-agents, 3 domain synthesisers, and 1 final synthesiser. Oracle scores 1.0, single-agent scores 0.15, and multi-agent scores 0.80. Apply on Kit Job: kitjob.in/job/4lnk54

Highlights

Company name

Millionlogics
Job position

Ai benchmark engineer - knowledge / research (Haryana)

Ad ID:

8773440926
Flag
Block ad

Safety Tips

Do not pay a ’prospective employer’ anything in order to secure a job.