Back

AI Benchmark Engineer - Knowledge / Research (Baddi)

30.9763 76.7674
Baddi, India
Posted: a week ago
Save
Share

Description

Role Overview We are seeking a highly analytical and computationally proficient individual to join our team with a strong research background. You will be instrumental in contributing to this role by either crafting challenging and insightful problems in your respective research domain or devising elegant computational solutions. Responsibilities:
- Build multi-agent benchmark tasks that require reading, analysing, and synthesising large document collections
- Curate real-world research corpora — academic papers, case studies, technical reports — and design questions that require comprehensive analysis
- Write structured ground-truth oracles (JSON) with specific, verifiable answers that prove the agent actually read the source material
- Design LLM judge prompts that evaluate agent output field-by-field against the oracle
- Create decomposition guides that split research across multiple parallel sub-agents (one per document, one per domain, then synthesis) Offer Details:
- Duration: 12 months+
- Pay: INR 1.75 L
- 2.00 Lakhs per month (net/take-home)
- Number of positions: 12
- Mode of work: Fully Remote
- Experience: 5+ Years Required Qualifications:
- 5+ years of research experience — academic or industry research in any scientific domain.
- Robust reading comprehension and ability to extract structured information from unstructured text.
- Experience with JSON/data structures — designing schemas, validating output formats, Python scripting ability (for judge scripts and data processing).
- Experience with AI coding benchmarks (SWE-bench, Terminal-bench).
- Comfortable with Docker — writing Dockerfiles, building images, and debugging container issues.
- Attention to detail — building oracles requires exact values, not approximations Strong plus:
- Experience with systematic reviews, meta-analyses, or large-scale literature surveys.
- Familiarity with medical/legal/scientific document analysis.
- Experience with NLP or information extraction tasks.
- Knowledge of LLM evaluation and benchmarking (MMLU, GPQA, SimpleQA).
- Experience curating datasets for AI evaluation. Additional Details
- Commitments Required: 8 hours per day with a 4-hour overlap with PST.
- Employment Type: Contractor position (Note: this role does not include medical/paid leave). Example of what you'll produce: A task with 1500 medical case records (500 cardiac, 500 vascular, 500 systemic). The agent must read all cases, identify relevant ones, extract evidence, and produce a cross-domain diagnosis. The oracle requires exact first/last case IDs per file (proves the agent read start to end), verbatim excerpts from specific cases (proves it read individual records), and a cross-domain evidence matrix. The decomposition uses 15 chunk-reader sub-agents, 3 domain synthesisers, and 1 final synthesiser. Oracle scores 1.0, single-agent scores 0.15, and multi-agent scores 0.80. Apply on Kit Job: kitjob.in/job/4lawdq

Highlights

Company name

Millionlogics
Job position

AI Benchmark Engineer - Knowledge / Research (Baddi)

Ad ID:

8764556297
Flag
Block ad

Safety Tips

Be careful with commission-based ’work-from-home’ positions that offer an unrealistically high income.