This position is listed on behalf of a partner company, who manages all applications and next steps. Our partner is looking for a Senior Software Engineer, Data Processing based in the United States.
This is a high-impact engineering role focused on building the core data processing infrastructure that powers large-scale AI training data systems. You will be responsible for transforming raw, multimodal, high-volume datasets into structured, validated, and AI-ready outputs that can be reliably used downstream. The work sits at the heart of the platform, where ingestion quality directly determines the value of the entire ecosystem. You will design and operate distributed data pipelines handling complex formats such as imaging, audio, video, and text across diverse industries. This role requires strong systems thinking, deep backend and data engineering expertise, and comfort working with ambiguity at scale. You will collaborate closely with product and partner teams in a fast-paced, high-ownership environment where speed, precision, and reliability are critical.
This position is listed on behalf of a partner company, who manages all applications and next steps. Our partner is looking for a Senior Software Engineer, Data Processing based in the United States.
This is a high-impact engineering role focused on building the core data processing infrastructure that powers large-scale AI training data systems. You will be responsible for transforming raw, multimodal, high-volume datasets into structured, validated, and AI-ready outputs that can be reliably used downstream. The work sits at the heart of the platform, where ingestion quality directly determines the value of the entire ecosystem. You will design and operate distributed data pipelines handling complex formats such as imaging, audio, video, and text across diverse industries. This role requires strong systems thinking, deep backend and data engineering expertise, and comfort working with ambiguity at scale. You will collaborate closely with product and partner teams in a fast-paced, high-ownership environment where speed, precision, and reliability are critical.
Accountabilities:
- Design, build, and operate scalable data ingestion and processing systems that transform raw multimodal data into structured, validated, AI-ready datasets.
- Own end-to-end data pipelines, including ingestion, validation, transformation, tracking, and downstream delivery at scale.
- Develop modality-specific processing logic for complex data types such as medical imaging, audio, video, and unstructured text data.
- Build reusable parsers, validators, normalization workflows, and internal tooling to standardize and industrialize data processing.
- Optimize distributed data systems for performance, reliability, throughput, and cost efficiency across large-scale workloads.
- Diagnose bottlenecks and ensure system robustness as data volume, complexity, and modalities continue to grow.
- Implement strong data quality, security, and compliance mechanisms, including handling sensitive or regulated data (e.g., PHI) with appropriate safeguards.
- Collaborate cross-functionally with Product, Data, and Partner Engineering teams to support new data modalities and evolving requirements.
- 5+ years of experience building and operating production-grade backend or data processing systems at scale.
- Strong experience designing and maintaining large-scale data pipelines in high-volume, distributed environments.
- Proficiency in Python for backend and data engineering development.
- Hands-on experience with distributed data processing systems and cloud infrastructure, particularly AWS.
- Strong ability to work with messy, high-variance, multimodal datasets and extract structure from ambiguity.
- Experience with system design, performance optimization, and building reliable, production-critical infrastructure.
- Strong attention to detail combined with a bias for action and delivery in fast-moving environments.
- Excellent problem-solving skills with a proactive, ownership-driven mindset.
- Nice to have: experience with modalities such as medical imaging (e.g., DICOM), audio, video, or large-scale text systems.
- Nice to have: exposure to regulated or sensitive data environments (e.g., healthcare, HIPAA, PHI).
- Nice to have: familiarity with orchestration tools (Airflow, Dagster), streaming systems, or multi-cloud environments (GCP, Azure).
- Nice to have: experience with ML/AI systems such as embeddings, NLP, or LLM-based workflows.
- Competitive compensation aligned with experience and market benchmarks.
- Opportunity to work on cutting-edge AI infrastructure at massive scale.
- High ownership role with direct impact on core product systems and data quality.
- Fast-paced, high-trust engineering culture focused on autonomy and execution speed.
- Exposure to multimodal data challenges across diverse industries including healthcare and media.
- Strong technical growth opportunities in distributed systems and AI data infrastructure.
- Collaborative environment working closely with expert engineers and cross-functional partners.
- Opportunity to help define foundational systems for next-generation AI training data platforms.