Founding Data Engineer

Location: Remote (US time zones)
Compensation: $185K–270K + above-market equity + strong benefits

About the Company

We’re partnering with a mission-driven AI startup building a research assistant that helps scientists, clinicians, and decision-makers reason more effectively with evidence. Their platform ingests and structures millions of academic papers, clinical trials, and other high-value documents, powering language models that improve how complex questions are answered.

This is a ground-floor opportunity to define the data platform at a company pushing the frontier of AI and safe reasoning systems.

The Role

As the Founding Data Engineer, you will:

Architect and own large-scale data pipelines ingesting heterogeneous document sources (academic papers, clinical trials, legal filings, SEC docs, and more).
Optimize distributed data systems using Python, Spark, and Flyte.
Implement robust deduplication, preprocessing, and embedding workflows to support real-time ML applications.
Collaborate with AI engineers, ML researchers, and product teams to ship features directly to end users.
Define best practices around data quality, storage, and architecture in a fast-growing environment.

What We’re Looking For

5+ years of experience as a Data Engineer, with ownership over production data platforms.
Strong proficiency in Python and SQL (window functions, UDFs, partitioning, clustering).
Hands-on expertise with Spark optimization and large-scale distributed data processing.
Experience with columnar storage (e.g., Parquet) and data quality management.
Track record of building systems that support user-facing products rather than just BI.

Nice to have:

Experience with document parsing (PDF, XML, HTML) or deduplication at scale.
Familiarity with ML workflows and embedding pipelines.
Background in academic publishing, search, or scientific data.
Exposure to Ray, Dask, or other distributed frameworks.

Founding Data Engineer