Abaka AI
About Abaka AI
Abaka AI is built on one mission: to be the world’s most trusted data partner for AI companies. More than 1,000 industry leaders across Generative AI, Embodied AI, and Automotive AI rely on us to power their data pipelines. With our headquarters in Silicon Valley—and teams in Paris, Singapore, and Tokyo—we support global partners with fast, reliable, and scalable data solutions.
Our offerings include a diverse catalog of off-the-shelf datasets (image, video, multimodal, reasoning, 3D, and beyond) as well as comprehensive data collection and annotation services. Whether teams need raw data, curated datasets, or full-cycle data engineering, Abaka AI provides the foundation for building high-performance AI systems.
About The Role We’re hiring a Data Engineer (Web Data) focused on Web Crawling in the United States, a foundational role that will shape how Abaka AI acquires high-quality web-scale data to power multimodal AI systems. You’ll design, build, and maintain robust crawling infrastructure that supports large-scale data collection across diverse domains and formats.
This role blends low-level system design with real-world operational problem-solving. You’ll work closely with data engineering and research teams to define crawling targets, implement anti-bot resilient architectures, manage proxies, and transform raw web content into structured datasets optimized for AI training and evaluation.
As an early technical hire, you’ll play a key role in setting standards for reliability, scalability, and data quality across our web data pipelines. If you're excited about building distributed systems, solving complex scraping challenges, and enabling the next generation of frontier AI models, this role offers the opportunity to make a lasting impact.
Responsibilities
Collaborate closely with clients to understand their data requirements, and coordinate internal teams to create tailored delivery plans that ensure on-time, high-quality data delivery, including meeting expectations for format, precision, and volume.
Lead the development of mid- to long-term plans for the data engineering function. Build scalable, end-to-end pipelines for multimodal data (text, image, audio, video, 3D point cloud, etc.), covering data sourcing, cleaning, annotation, QA, storage, and iterative optimization for training, fine-tuning, and evaluation.
Develop solutions to core technical challenges in multimodal data processing, such as cross-modal alignment (for example, image-text semantic matching), large-scale data cleaning (deduplication, denoising, format normalization), annotation efficiency, and data encryption and security.
Work cross-functionally with algorithm, product, and business teams by providing feedback to model teams on data bottlenecks, helping refine internal tools and services, and supporting client-facing teams with technical documentation and pre-sales materials.
Evaluate and optimize the cost structure of data processing operations, including headcount, infrastructure, and tooling, to balance quality, efficiency, and scalability.
Qualifications
Strong background in computer science, data engineering, artificial intelligence, or related fields, with hands‑on experience working with large-scale data systems.
3+ years of experience in data engineering or data operations. Leadership experience is highly valued, and prior involvement in LLM or multimodal dataset preparation is a strong plus.
Must‑have technical skills: Strong Python proficiency; HTML/DOM parsing (lxml, XPath); HTTP internals; advanced Scrapy; async crawling (aiohttp/asyncio); Playwright/Selenium; familiarity with browser internals.
Deep understanding of end-to-end multimodal data workflows, with practical experience in at least two modalities, such as text, images, audio, or video.
Proficiency in designing technical architectures for large-scale data pipelines, including distributed processing and automation frameworks. Familiarity with data privacy and security best practices such as access control and data anonymization.
Strong execution and team management skills, with the ability to translate high-level objectives into actionable plans and drive team results.
Excellent communication and cross-functional collaboration skills, with the ability to clearly communicate technical and operational requirements, resolve conflicts, and manage stakeholder expectations.
High sense of ownership and resilience, with comfort operating in a fast‑paced, evolving AI environment and the ability to navigate urgent delivery timelines.
Compensation & Benefits The base salary range for this position is $175,000 - $250,000 USD annually.
Compensation may vary outside of this range depending on a number of factors, including a candidate’s qualifications, skills, competencies and experience. Base pay is one part of the Total Package that is provided to compensate and recognize employees for their work at Abaka AI. This role is eligible for equity, as well as a comprehensive benefits package (health, dental, vision, PTO, flexible work schedule).
Seniority level Mid‑Senior level
Employment type Full‑time
Job function Information Technology
Industries IT Services and IT Consulting
#J-18808-Ljbffr
Our offerings include a diverse catalog of off-the-shelf datasets (image, video, multimodal, reasoning, 3D, and beyond) as well as comprehensive data collection and annotation services. Whether teams need raw data, curated datasets, or full-cycle data engineering, Abaka AI provides the foundation for building high-performance AI systems.
About The Role We’re hiring a Data Engineer (Web Data) focused on Web Crawling in the United States, a foundational role that will shape how Abaka AI acquires high-quality web-scale data to power multimodal AI systems. You’ll design, build, and maintain robust crawling infrastructure that supports large-scale data collection across diverse domains and formats.
This role blends low-level system design with real-world operational problem-solving. You’ll work closely with data engineering and research teams to define crawling targets, implement anti-bot resilient architectures, manage proxies, and transform raw web content into structured datasets optimized for AI training and evaluation.
As an early technical hire, you’ll play a key role in setting standards for reliability, scalability, and data quality across our web data pipelines. If you're excited about building distributed systems, solving complex scraping challenges, and enabling the next generation of frontier AI models, this role offers the opportunity to make a lasting impact.
Responsibilities
Collaborate closely with clients to understand their data requirements, and coordinate internal teams to create tailored delivery plans that ensure on-time, high-quality data delivery, including meeting expectations for format, precision, and volume.
Lead the development of mid- to long-term plans for the data engineering function. Build scalable, end-to-end pipelines for multimodal data (text, image, audio, video, 3D point cloud, etc.), covering data sourcing, cleaning, annotation, QA, storage, and iterative optimization for training, fine-tuning, and evaluation.
Develop solutions to core technical challenges in multimodal data processing, such as cross-modal alignment (for example, image-text semantic matching), large-scale data cleaning (deduplication, denoising, format normalization), annotation efficiency, and data encryption and security.
Work cross-functionally with algorithm, product, and business teams by providing feedback to model teams on data bottlenecks, helping refine internal tools and services, and supporting client-facing teams with technical documentation and pre-sales materials.
Evaluate and optimize the cost structure of data processing operations, including headcount, infrastructure, and tooling, to balance quality, efficiency, and scalability.
Qualifications
Strong background in computer science, data engineering, artificial intelligence, or related fields, with hands‑on experience working with large-scale data systems.
3+ years of experience in data engineering or data operations. Leadership experience is highly valued, and prior involvement in LLM or multimodal dataset preparation is a strong plus.
Must‑have technical skills: Strong Python proficiency; HTML/DOM parsing (lxml, XPath); HTTP internals; advanced Scrapy; async crawling (aiohttp/asyncio); Playwright/Selenium; familiarity with browser internals.
Deep understanding of end-to-end multimodal data workflows, with practical experience in at least two modalities, such as text, images, audio, or video.
Proficiency in designing technical architectures for large-scale data pipelines, including distributed processing and automation frameworks. Familiarity with data privacy and security best practices such as access control and data anonymization.
Strong execution and team management skills, with the ability to translate high-level objectives into actionable plans and drive team results.
Excellent communication and cross-functional collaboration skills, with the ability to clearly communicate technical and operational requirements, resolve conflicts, and manage stakeholder expectations.
High sense of ownership and resilience, with comfort operating in a fast‑paced, evolving AI environment and the ability to navigate urgent delivery timelines.
Compensation & Benefits The base salary range for this position is $175,000 - $250,000 USD annually.
Compensation may vary outside of this range depending on a number of factors, including a candidate’s qualifications, skills, competencies and experience. Base pay is one part of the Total Package that is provided to compensate and recognize employees for their work at Abaka AI. This role is eligible for equity, as well as a comprehensive benefits package (health, dental, vision, PTO, flexible work schedule).
Seniority level Mid‑Senior level
Employment type Full‑time
Job function Information Technology
Industries IT Services and IT Consulting
#J-18808-Ljbffr