Founding Data Engineer
Location: Remote (US time zones)
Compensation: $185K–270K + above-market equity + strong benefits
About the Company
We’re partnering with a mission-driven AI startup building a research assistant that helps scientists, clinicians, and decision-makers reason more effectively with evidence. Their platform ingests and structures millions of academic papers, clinical trials, and other high-value documents, powering language models that improve how complex questions are answered.
This is a ground-floor opportunity to define the data platform at a company pushing the frontier of AI and safe reasoning systems.
The Role
As the Founding Data Engineer, you will:
- Architect and own large-scale data pipelines ingesting heterogeneous document sources (academic papers, clinical trials, legal filings, SEC docs, and more).
- Optimize distributed data systems using Python, Spark, and Flyte.
- Implement robust deduplication, preprocessing, and embedding workflows to support real-time ML applications.
- Collaborate with AI engineers, ML researchers, and product teams to ship features directly to end users.
- Define best practices around data quality, storage, and architecture in a fast-growing environment.
What We’re Looking For
- 5+ years of experience as a Data Engineer, with ownership over production data platforms.
- Strong proficiency in Python and SQL (window functions, UDFs, partitioning, clustering).
- Hands-on expertise with Spark optimization and large-scale distributed data processing.
- Experience with columnar storage (e.g., Parquet) and data quality management.
- Track record of building systems that support user-facing products rather than just BI.
Nice to have:
- Experience with document parsing (PDF, XML, HTML) or deduplication at scale.
- Familiarity with ML workflows and embedding pipelines.
- Background in academic publishing, search, or scientific data.
- Exposure to Ray, Dask, or other distributed frameworks.