AI Benchmark Engineer - Knowledge / Research (Noida)
-
Noida, India
-
Posted: a week ago
-
Save
- Build multi-agent benchmark tasks that require reading, analysing, and synthesising large document collections
- Curate real-world research corpora — academic papers, case studies, technical reports — and design questions that require comprehensive analysis
- Write structured ground-truth oracles (JSON) with specific, verifiable answers that prove the agent actually read the source material
- Design LLM judge prompts that evaluate agent output field-by-field against the oracle
- Create decomposition guides that split research across multiple parallel sub-agents (one per document, one per domain, then synthesis) Offer Details:
- Duration: 12 months+
- Pay: INR 1.75 L
- 2.00 Lakhs per month (net/take-home)
- Number of positions: 12
- Mode of work: Fully Remote
- Experience: 5+ Years Required Qualifications:
- 5+ years of research experience — academic or industry research in any scientific domain.
- Robust reading comprehension and ability to extract structured information from unstructured text.
- Experience with JSON/data structures — designing schemas, validating output formats, Python scripting ability (for judge scripts and data processing).
- Experience with AI coding benchmarks (SWE-bench, Terminal-bench).
- Comfortable with Docker — writing Dockerfiles, building images, and debugging container issues.
- Attention to detail — building oracles requires exact values, not approximations Strong plus:
- Experience with systematic reviews, meta-analyses, or large-scale literature surveys.
- Familiarity with medical/legal/scientific document analysis.
- Experience with NLP or information extraction tasks.
- Knowledge of LLM evaluation and benchmarking (MMLU, GPQA, SimpleQA).
- Experience curating datasets for AI evaluation. Additional Details
- Commitments Required: 8 hours per day with a 4-hour overlap with PST.
- Employment Type: Contractor position (Note: this role does not include medical/paid leave). Example of what you'll produce: A task with 1500 medical case records (500 cardiac, 500 vascular, 500 systemic). The agent must read all cases, identify relevant ones, extract evidence, and produce a cross-domain diagnosis. The oracle requires exact first/last case IDs per file (proves the agent read start to end), verbatim excerpts from specific cases (proves it read individual records), and a cross-domain evidence matrix.
The decomposition uses 15 chunk-reader sub-agents, 3 domain synthesisers, and 1 final synthesiser. Oracle scores 1.0, single-agent scores 0.15, and multi-agent scores 0.80. Apply on Kit Job: kitjob.in/job/4lr7rk
-
Company nameMillionlogics
-
Job positionAI Benchmark Engineer - Knowledge / Research (Noida)
AI Benchmark Engineer - Knowledge / Research (Noida) has been posted in the Noida Engineering category on Locanto.
If you’re looking for something similar, check out Best Private Institute for Engineering in Noida – Accurate, Noida, Robotics Jobs Online – Start Your Future as a Remote Robot Opera, Noida or Robotics Engineer Jobs | Robotics Jobs Online Opportunity in Noida, also posted in Engineering. Currently, there are 5 ads posted in the Engineering category in Noida.
Interested in more? Widen your search to view ads in nearby areas of Noida. This includes Engineering in Shāhdara, Greater Kailash and Gautam Budh Nagar. There are more ads within a 15 km radius for this category. If you want to view those ads, click here.