ByteDance
Student Researcher [Seed Vision – Multimodal Joint Modeling] – 2026 Start (PhD)
ByteDance, San Jose, California, United States, 95199
Overview
Student Researcher [Seed Vision – Multimodal Joint Modeling] – 2026 Start (PhD) internship at ByteDance. The Seed Vision Team focuses on foundational models for visual generation, developing multimodal generative models, and conducting research and application development to solve fundamental computer vision challenges in GenAI. PhD internships provide students opportunities to contribute to products and research and to the organization\'s future plans and emerging technologies.
Responsibilities
Conduct research on joint training of vision, language, and video models under a unified architecture.
Develop scalable and efficient methods for autoregressive-style multimodal pretraining, supporting both understanding and generation.
Explore cross-modal tokenization, alignment, and shared representation strategies.
Investigate instruction tuning, captioning, and open-ended generation capabilities across modalities.
Contribute to system-level improvements in data curation, model optimization, and evaluation pipelines.
Qualifications Minimum Qualifications:
Currently pursuing a PhD in Computer Vision, Machine Learning, NLP, or a related field.
Research experience in multimodal learning, large-scale pretraining, or vision-language modeling.
Proficiency in deep learning frameworks such as PyTorch or JAX.
Demonstrated ability to conduct independent research, with publications in top-tier conferences such as CVPR, ICCV, ECCV, NeurIPS, ICML, ICLR.
Preferred Qualifications:
Experience with autoregressive LLM training, especially in multimodal or unified modeling settings.
Familiarity with instruction tuning, vision-language generation, or unified token space design.
Background in model scaling, efficient training, or data mixture strategies.
Ability to work closely with infrastructure teams to deploy large-scale training workflows.
Job Information For Pay Transparency: Compensation Description (Hourly) - Campus Intern. The hourly rate range for this position in the selected city is $65- $65. Benefits may vary depending on the nature of employment and the country work location. Interns have day one access to health insurance, life insurance, wellbeing benefits and more, and receive 10 paid holidays per year and paid sick time. Interns who are not working 100% remote may also be eligible for housing allowance. The Company reserves the right to modify or change these benefits programs at any time, with or without notice.
Equal Employment Opportunity & Accessibility For Los Angeles County (unincorporated) Candidates: Qualified applicants with arrest or conviction records will be considered for employment in accordance with all federal, state, and local laws including the Los Angeles County Fair Chance Ordinance for Employers and the California Fair Chance Act. Our company believes that criminal history may have a direct, adverse and negative relationship on the following job duties, potentially resulting in the withdrawal of the conditional offer of employment:
Interacting and occasionally having unsupervised contact with internal/external clients and/or colleagues;
A_ppropriately handling and managing confidential information including proprietary and trade secret information and access to information technology systems; and
Exercising sound judgment.
ByteDance is committed to building an inclusive space where employees are valued for their skills, experiences, and unique perspectives. We strive to celebrate diverse voices and to create an environment that reflects the communities we reach. If you need assistance or a reasonable accommodation, please reach out to us at https://tinyurl.com/RA-request
#J-18808-Ljbffr
Responsibilities
Conduct research on joint training of vision, language, and video models under a unified architecture.
Develop scalable and efficient methods for autoregressive-style multimodal pretraining, supporting both understanding and generation.
Explore cross-modal tokenization, alignment, and shared representation strategies.
Investigate instruction tuning, captioning, and open-ended generation capabilities across modalities.
Contribute to system-level improvements in data curation, model optimization, and evaluation pipelines.
Qualifications Minimum Qualifications:
Currently pursuing a PhD in Computer Vision, Machine Learning, NLP, or a related field.
Research experience in multimodal learning, large-scale pretraining, or vision-language modeling.
Proficiency in deep learning frameworks such as PyTorch or JAX.
Demonstrated ability to conduct independent research, with publications in top-tier conferences such as CVPR, ICCV, ECCV, NeurIPS, ICML, ICLR.
Preferred Qualifications:
Experience with autoregressive LLM training, especially in multimodal or unified modeling settings.
Familiarity with instruction tuning, vision-language generation, or unified token space design.
Background in model scaling, efficient training, or data mixture strategies.
Ability to work closely with infrastructure teams to deploy large-scale training workflows.
Job Information For Pay Transparency: Compensation Description (Hourly) - Campus Intern. The hourly rate range for this position in the selected city is $65- $65. Benefits may vary depending on the nature of employment and the country work location. Interns have day one access to health insurance, life insurance, wellbeing benefits and more, and receive 10 paid holidays per year and paid sick time. Interns who are not working 100% remote may also be eligible for housing allowance. The Company reserves the right to modify or change these benefits programs at any time, with or without notice.
Equal Employment Opportunity & Accessibility For Los Angeles County (unincorporated) Candidates: Qualified applicants with arrest or conviction records will be considered for employment in accordance with all federal, state, and local laws including the Los Angeles County Fair Chance Ordinance for Employers and the California Fair Chance Act. Our company believes that criminal history may have a direct, adverse and negative relationship on the following job duties, potentially resulting in the withdrawal of the conditional offer of employment:
Interacting and occasionally having unsupervised contact with internal/external clients and/or colleagues;
A_ppropriately handling and managing confidential information including proprietary and trade secret information and access to information technology systems; and
Exercising sound judgment.
ByteDance is committed to building an inclusive space where employees are valued for their skills, experiences, and unique perspectives. We strive to celebrate diverse voices and to create an environment that reflects the communities we reach. If you need assistance or a reasonable accommodation, please reach out to us at https://tinyurl.com/RA-request
#J-18808-Ljbffr