Software Engineer, Model Training

TikTok

TikTok Trust and Safety Engineering Team

At TikTok, our mission is to inspire creativity and bring joy. We believe that creation is the core of our purpose, and our platform is built to help imaginations thrive.
About Us

We are a global community with offices in Los Angeles, Singapore, New York, London, Dublin, Paris, Berlin, Dubai, Jakarta, Seoul, and Tokyo. Our team is responsible for protecting our users from harmful content and abusive behaviors.

Responsibilities

• Optimize Algorithm Integration: Work closely with business teams to improve efficiency in evaluating and using algorithm applications across various business scenarios.
• Architectural Design and Development: Be responsible for the architectural design, development, and performance tuning of algorithm applications, solving technical challenges such as high concurrency, high reliability, and high scalability.
• Machine Learning Infrastructure: Responsible for the design and development of Machine Learning infrastructure for LLM/AIGC, etc.
• Large-Scale Machine Learning System: Build up a super large machine learning system integrating GPUs, RDMA networking, and high-performance storage.
Qualifications
• Hands-on Experience: Hands-on experience in one or more of the following areas: Machine Learning, Deep Learning, Recommender Systems, Natural Language Processing, or Computer Vision.
• Programming Languages: Be proficient in 1 to 2 programming languages such as C++/Go/Python/Shell in Linux environment.
• Distributed Systems: Understand the principles of distributed systems and have experience in design, development and maintenance of large-scale machine learning systems.
• Kubernetes Architecture: Be familiar with Kubernetes architecture, and have rich experience in system-level development and tuning.
• ML Infrastructure: Familiar with the ML Infrastructure of Large Model training and inference.
• Cutting-Edge LLM Research: Strong understanding and engineering experience of cutting-edge LLM research and engineering (., long context, multi modality, active learning, alignment research, agent ecosystem).

Preferred Qualifications:

• Awards in ACM/ICPC, NOI/IOI, Top Coder, Kaggle: Excellent programming skills, data structure and algorithm skills, proficient in C/C++ or Python programming language, candidates with awards in ACM/ICPC, NOI/IOI, Top Coder, Kaggle and other competitions are preferred.
• Research or Industry Experience: Research or industry experience in the field of machine learning, especially in large language models (LLMs) and generative artificial intelligence.
• Distributed Training Framework Optimizations: Distributed training framework optimizations such as DeepSpeed, FSDP, Megatron, GSPMD.
• CUDA Programming and Performance Tuning: Experiences in in-depth CUDA programming and performance tuning (cutlass, triton).
• PhD/Master’s Degree: PhD/Master’s degree required, with top artificial intelligence conference papers (NeurIPS, ICML, ICLR, CVPR, ACL, EMNLP, in machine learning (ML), computer vision (CV), natural language processing (NLP) and other fields.)