Inflection AI is a public benefit corporation leveraging our world class large language model to build the first AI platform focused on the needs of the enterprise. 

Who we are:

Inflection AI was re-founded in March of 2024 and our leadership team has assembled a team of kind, innovative, and collaborative individuals focused on building enterprise AI solutions. We are an organization passionate about what we are building, enjoy working together and strive to hire people with diverse backgrounds and experience. 

Our first product, Pi, provides an empathetic and conversational chatbot. Pi is a public instance of building from our 350B+ frontier model with our sophisticated fine-tuning (10M+ examples), inference, and orchestration platform. We are now focusing on building new systems that directly support the needs of enterprise customers using this same approach.

Want to work with us? Have questions? Learn more below.

About the Role

As an Inference Engineer, you will own the real-time performance, scalability, and reliability of our LLM-powered systems. You’ll optimize every layer—from GPU kernels to orchestration frameworks—to deliver sub-second latency, high throughput, and enterprise-grade uptime. Your work will also enable advanced capabilities such as tool usage, agentic flows, retrieval-augmented generation (RAG), and long-term memory.

This is a good role for you if you:

  • Have direct experience deploying and optimizing large transformer models for real-time inference across multi-GPU or multi-node environments
  • Are skilled with tools like Triton, TensorRT, TVM, ONNX Runtime, or custom CUDA kernels—and know when to use C++/Rust for critical performance gains
  • Understand the balance between latency, throughput, accuracy, and cost, and make smart choices around quantization, speculative decoding, and caching
  • Have developed or integrated agent-based orchestration systems, RAG pipelines, or memory architectures in production environments
  • Automate at every layer—CI/CD for model artifacts, load testing, canary rollouts, and auto-scaling
  • Communicate clearly with both infrastructure teams and product stakeholders

Responsibilities include:

  • Design and optimize high-performance inference pipelines using PyTorch, vLLM, Triton, TensorRT, and FSDP/DeepSpeed
  • Integrate agentic runtimes—tool calling, function execution, and multi-step planning—while meeting strict latency requirements
  • Build robust retrieval-augmented generation (RAG) stacks combining vector search, caching, and real-time context packing
  • Develop memory services to support conversation continuity and user personalization at scale
  • Monitor, instrument, and autotune GPU performance, kernel fusion, and batching strategies across clusters of NVIDIA H100 and Intel Gaudi accelerators
  • Partner with training, safety, and product teams to transform research into stable, production-grade systems
  • Contribute upstream to open-source performance libraries and share insights with the community

Employee Pay Disclosures

At Inflection AI, we aim to attract and retain the best employees and compensate them in a way that appropriately and fairly values their individual contributions to the company. For this role, Inflection AI estimates a starting annual base salary will fall in the range of approximately $175,000 - $350,000 depending on experience. This estimate can vary based on the factors described above, so the actual starting annual base salary may be above or below this range.

Interview Process

Apply: Please apply on Linkedin or our website for a specific role.

After speaking with one of our recruiters, you’ll enter our structured interview process, which includes the following stages:

  1. Hiring Manager Conversation – An initial discussion with the hiring manager to assess fit and alignment.
  2. Technical Interview – A deep dive with an Inflection Engineer to evaluate your technical expertise.
  3. Onsite Interview – A comprehensive assessment, including:
    • domain-specific interview
    • system design interview
    • A final conversation with the hiring manager

Depending on the role, we may also ask you to complete a take-home exercise or deliver a presentation.

For non-technical roles, be prepared for a role-specific interview, such as a portfolio review.

Decision Timeline
We aim to provide feedback within one week of your final interview.

Salary

$175,000 - $350,000

Yearly based

Location

Palo Alto, CA

Job Overview
Job Posted:
2 days ago
Job Expires:
Job Type
Full Time

Share This Job: