Image
img
img

Founding Data Engineer

  • Permanent
  • $250K + equity
  • Remote, United States
  • Data Engineering
Image

Founding Data Engineer

Location: Remote (US time zones)
Compensation: $185K–270K + above-market equity + strong benefits

About the Company

We’re partnering with a mission-driven AI startup building a research assistant that helps scientists, clinicians, and decision-makers reason more effectively with evidence. Their platform ingests and structures millions of academic papers, clinical trials, and other high-value documents, powering language models that improve how complex questions are answered.

This is a ground-floor opportunity to define the data platform at a company pushing the frontier of AI and safe reasoning systems.

The Role

As the Founding Data Engineer, you will:

  • Architect and own large-scale data pipelines ingesting heterogeneous document sources (academic papers, clinical trials, legal filings, SEC docs, and more).
  • Optimize distributed data systems using Python, Spark, and Flyte.
  • Implement robust deduplication, preprocessing, and embedding workflows to support real-time ML applications.
  • Collaborate with AI engineers, ML researchers, and product teams to ship features directly to end users.
  • Define best practices around data quality, storage, and architecture in a fast-growing environment.

What We’re Looking For

  • 5+ years of experience as a Data Engineer, with ownership over production data platforms.
  • Strong proficiency in Python and SQL (window functions, UDFs, partitioning, clustering).
  • Hands-on expertise with Spark optimization and large-scale distributed data processing.
  • Experience with columnar storage (e.g., Parquet) and data quality management.
  • Track record of building systems that support user-facing products rather than just BI.

Nice to have:

  • Experience with document parsing (PDF, XML, HTML) or deduplication at scale.
  • Familiarity with ML workflows and embedding pipelines.
  • Background in academic publishing, search, or scientific data.
  • Exposure to Ray, Dask, or other distributed frameworks.

 

Share job:
Decor
Image

Upload resume

Boost your career with expert recruitment solutions!

Your resume will be confidentially submitted to our team, who will be in touch if we have a match for your job search

Upload resume
Image
jobs