HireTalent
Role Overview
We are seeking a Senior Data Scientist to build and deploy LLM-based capabilities for working with large, diverse datasets and documents relevant to growth analytics & bid strategy. This role emphasizes ingestion, document processing, information extraction, and retrieval methods to support analytics use cases in production. Experience with modern LLM tooling and Databricks is required; hands‑on experience with advanced reasoning models & agentic/orchestration frameworks are a plus.
Key Responsibilities
Architect, build, and refine retrieval-grounded LLM systems, including basic and advanced RAG patterns, to deliver grounded, verifiable answers and insights.
Design robust pipelines for ingestion, transformation, and normalization of public and internal data, including ETL, incremental processing, and data quality checks.
Build and maintain document processing workflows across PDFs, HTML, and scanned content, including OCR, layout‑aware parsing, table extraction, metadata enrichment, and document versioning.
Develop information extraction pipelines using LLM methods and best practices, including schema design, structured outputs, validation, error handling, and accuracy evaluation.
Own the retrieval stack end‑to‑end, including chunking strategies, embeddings, indexing, hybrid retrieval, reranking, filtering, and relevance tuning across a vector database or search platform.
Implement web data acquisition where needed, including scraping, change detection, source quality checks, and operational safeguards like retries and rate limiting.
Establish evaluation and monitoring practices for retrieval and extraction quality, including golden datasets, regression testing, groundedness checks, and production observability.
Collaborate with subject matter experts to translate business needs into practical retrieval and extraction workflows and measurable success criteria.
Communicate complex findings, tradeoffs, and recommendations to technical and business stakeholders, supporting data‑driven forecasting and strategy.
Ensure compliance with data governance and security standards when handling sensitive data and deploying systems to production environments.
Qualifications
Advanced degree in Computer Science, Data Science, Statistics, Engineering, or a related quantitative field.
Minimum of 4 years experience in data science or applied ML/NLP with focus in NLP & GenAI.
Proficiency in Python and SQL, with strong engineering practices for maintainable, testable pipelines.
Strong experience with Databricks for data processing and pipeline development, including Spark and common lakehouse patterns.
Demonstrated experience building retrieval-grounded LLM systems and or LLM-based information extraction for real‑world use cases.
Experience with document ingestion and parsing, including OCR and handling messy, semi‑structured content such as PDFs, tables, forms, and web pages.
Familiarity with vector databases and retrieval concepts, including indexing, embeddings, hybrid retrieval, reranking, and performance and cost tuning.
Strong understanding of best practices for reasoning models and techniques that improve reliability and reduce hallucinations, including grounding and attribution.
Excellent communication skills, with a track record of partnering with stakeholders and turning ambiguous requests into adopted solutions.
Libraries and Tools
Proficiency with LLM and orchestration libraries such as: openai, google-genai, langgraph, langchain.
Experience with supporting tooling commonly used in production LLM systems, for example: pydantic for schema validation, tenacity for retries, beautifulsoup4 for html data extraction, and standard Python data tooling such as pandas and numpy.
Experience with retrieval and vector tooling, such as: FAISS, Elasticsearch or OpenSearch, and vector database platforms (for example Pinecone, Weaviate, Milvus, Chroma).
Preferred Qualifications
Exposure to agentic patterns and tool‑calling for workflow automation.
Experience working in regulated environments and implementing governance controls such as access control, auditability, and retention.
All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender, identity, national origin, disability, or protected veteran status.
#J-18808-Ljbffr
Key Responsibilities
Architect, build, and refine retrieval-grounded LLM systems, including basic and advanced RAG patterns, to deliver grounded, verifiable answers and insights.
Design robust pipelines for ingestion, transformation, and normalization of public and internal data, including ETL, incremental processing, and data quality checks.
Build and maintain document processing workflows across PDFs, HTML, and scanned content, including OCR, layout‑aware parsing, table extraction, metadata enrichment, and document versioning.
Develop information extraction pipelines using LLM methods and best practices, including schema design, structured outputs, validation, error handling, and accuracy evaluation.
Own the retrieval stack end‑to‑end, including chunking strategies, embeddings, indexing, hybrid retrieval, reranking, filtering, and relevance tuning across a vector database or search platform.
Implement web data acquisition where needed, including scraping, change detection, source quality checks, and operational safeguards like retries and rate limiting.
Establish evaluation and monitoring practices for retrieval and extraction quality, including golden datasets, regression testing, groundedness checks, and production observability.
Collaborate with subject matter experts to translate business needs into practical retrieval and extraction workflows and measurable success criteria.
Communicate complex findings, tradeoffs, and recommendations to technical and business stakeholders, supporting data‑driven forecasting and strategy.
Ensure compliance with data governance and security standards when handling sensitive data and deploying systems to production environments.
Qualifications
Advanced degree in Computer Science, Data Science, Statistics, Engineering, or a related quantitative field.
Minimum of 4 years experience in data science or applied ML/NLP with focus in NLP & GenAI.
Proficiency in Python and SQL, with strong engineering practices for maintainable, testable pipelines.
Strong experience with Databricks for data processing and pipeline development, including Spark and common lakehouse patterns.
Demonstrated experience building retrieval-grounded LLM systems and or LLM-based information extraction for real‑world use cases.
Experience with document ingestion and parsing, including OCR and handling messy, semi‑structured content such as PDFs, tables, forms, and web pages.
Familiarity with vector databases and retrieval concepts, including indexing, embeddings, hybrid retrieval, reranking, and performance and cost tuning.
Strong understanding of best practices for reasoning models and techniques that improve reliability and reduce hallucinations, including grounding and attribution.
Excellent communication skills, with a track record of partnering with stakeholders and turning ambiguous requests into adopted solutions.
Libraries and Tools
Proficiency with LLM and orchestration libraries such as: openai, google-genai, langgraph, langchain.
Experience with supporting tooling commonly used in production LLM systems, for example: pydantic for schema validation, tenacity for retries, beautifulsoup4 for html data extraction, and standard Python data tooling such as pandas and numpy.
Experience with retrieval and vector tooling, such as: FAISS, Elasticsearch or OpenSearch, and vector database platforms (for example Pinecone, Weaviate, Milvus, Chroma).
Preferred Qualifications
Exposure to agentic patterns and tool‑calling for workflow automation.
Experience working in regulated environments and implementing governance controls such as access control, auditability, and retention.
All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender, identity, national origin, disability, or protected veteran status.
#J-18808-Ljbffr