RemoteJobs.org mascotRemoteJobs.org
Remote JobsCompaniesAPIPost a Job
RemoteJobs.org mascotRemoteJobs.org

Find your dream remote job. Browse thousands of remote positions from top companies worldwide.

Job Categories

  • Programming
  • Design
  • Marketing
  • Sales
  • Customer Support
  • Writing

Resources

  • Browse Jobs
  • Companies
  • Post a Job
  • For Developers

Company

  • About Us
  • Contact
  • Privacy Policy
  • Terms of Service
© 2026 RemoteJobs.org. All rights reserved.
    ← Back to all jobs
    G2i

    Senior Software Engineer - AI Interaction Evaluator (Codex / Claude Code, up to

    G2i
    Contract
    Verified Remote
    RemoteUSD 208,000 - 416,000ProgrammingToday

    About this role

    Senior AI Interaction Evaluator (Codex / Claude Code)

    Contract | $100–$200/hour | 10–20 hrs/week | Start ASAP (through early May)

    Check out this Loom video for more details!

    We’re looking for highly experienced software engineer (SR+) to help evaluate the quality of interactions with modern coding agents such as OpenAI Codex and Claude Code.

    This is not a traditional engineering role.

    You won’t be writing production code.

    You’ll be evaluating something harder: whether the model thinks like a great engineer.

    What This Role Actually Is

    You will assess how AI coding agents behave in real-world scenarios — focusing on:

    • Whether the response makes sense

    • Whether the preamble and reasoning are useful

    • Whether the output reflects strong engineering judgment

    • Whether the interaction feels right to an experienced developer

    This role is about engineering taste — not syntax correctness.

    What You’ll Be Doing

    • Evaluate AI-generated coding interactions end-to-end

    • Judge whether outputs are:

    • Useful

    • Correct (at a high level)

    • Aligned with how a strong engineer would think

    • Assess the quality of explanations and reasoning, not just code

    • Distinguish between different levels of response quality (e.g. what makes something a 2 vs 4)

    • Provide clear, opinionated feedback on:

    • What worked

    • What didn’t

    • What felt “off” or misleading

    • Help define what great looks like when interacting with tools like Cursor

    What We Mean by “Taste”

    We’re specifically looking for engineers who can answer questions like:

    • Does this feel like something a strong engineer would actually say?

    • Is this explanation helpful, or just technically correct?

    • Is the model guiding the user well, or just dumping output?

    • Would this interaction build or erode trust?

    You should be comfortable making subjective but rigorous judgments.

    Who You Are

    • Staff / Principal-level engineer (or equivalent experience)

    • Strong background in one of the below:

    • TypeScript / JavaScript

    • Python

    • Hands-on experience using:

    • OpenAI Codex

    • Claude Code

    • Cursor

    • Deep familiarity with modern AI-assisted dev workflows

    • Able to evaluate code without needing to fully execute or deeply review every line

    • Comfortable giving direct, opinionated feedback

    • High bar for what “good engineering” looks like

    Nice to Have

    • Experience with tools like Cursor or similar AI-first IDEs

    • Prior exposure to prompt design or evaluation workflows

    • Experience mentoring senior engineers or defining engineering standards

    Engagement Details

    • Rate: $100–$200/hour

    • Hours: ~10–20 hours/week

    • Duration: Through early May (with possible extension)

    • Start: ASAP

    • Process:

    • Take-home evaluation exercise

    • One behavioral interview

    About G2i

    G2i
    G2i

    Related Jobs

    Director of Field Sales

    Alternative Payments · CAD 160,000 - 195,000

    Junior Tax Analyst

    Hire Hangar · USD 12,000 - 14,400

    Director, Creative Marketing (Lifecycle and Organic Growth)

    Jerry · USD 135,000 - 185,000