The 100-person AI lab that became Anthropic and Google's secret weapon | Edwin Chen (Surge AI)
Key Takeaways Copied to clipboard!
- Surge AI achieved unprecedented success ($1B+ revenue in under four years with under 100 employees, bootstrapped) by intentionally rejecting the standard Silicon Valley playbook of constant PR, fundraising, and pivoting, focusing instead on building a 10x better product through word-of-mouth from expert researchers.
- High-quality AI data requires a deep, subjective understanding of quality (like Nobel Prize-winning poetry) that goes far beyond simple checkbox criteria, necessitating complex measurement systems tracking thousands of signals on worker performance and output effectiveness.
- The current AI landscape is being pushed in the wrong direction by optimizing for superficial metrics like LM Arena scores and engagement, which incentivizes models to chase 'dopamine instead of truth' (hallucinations, excessive length) rather than focusing on fundamental accuracy and utility for advancing the species.
- Edwin Chen's primary motivation remains rooted in his scientific desire to understand the universe and language, manifesting in hands-on deep dives and analysis of new AI models.
- Surge AI is intentionally built like a research lab, prioritizing intellectual rigor, curiosity, and long-term incentives over typical startup pressures like quarterly metrics and constant fundraising.
- The critical work Surge AI performs is defining and training AI toward rich, complex 'dream objective functions' that advance humanity, rather than optimizing for easy proxies like clicks or laziness.
Segments
Surge’s Unprecedented Success
Copied to clipboard!
(00:00:00)
- Key Takeaway: Surge achieved over $1 billion in revenue with fewer than 100 employees, completely bootstrapped, by intentionally avoiding the typical Silicon Valley fundraising and PR cycle.
- Summary: The company hit over a billion in revenue last year with under 100 people, a feat Edwin Chen believes will become more common due to AI efficiencies. This success was driven by a contrarian approach, building an elite, small team focused on technology rather than pitching VCs. This lean structure allows founders to focus on building what they care about, potentially leading to more innovative products.
Defining Data Quality
Copied to clipboard!
(00:09:36)
- Key Takeaway: True data quality in AI training requires deep, subjective assessment (like Nobel Prize-level poetry) rather than merely checking explicit, low-level boxes.
- Summary: Most people misunderstand quality, thinking throwing bodies at a problem yields good data; however, high quality demands assessing subtle aspects like uniqueness, imagery, and emotional impact. Surge measures quality by gathering thousands of signals on worker performance, expertise, and output improvement on the model, similar to how Google Search ranks pages.
Claude’s Coding Superiority
Copied to clipboard!
(00:13:31)
- Key Takeaway: The superiority of models like Claude Code stems from nuanced post-training decisions regarding objective functions, including trade-offs between marketing benchmarks and real-world task performance, reflecting the taste of the leaders.
- Summary: Frontier labs face infinite choices on data selection, such as prioritizing front-end versus back-end coding, or optimizing for visual design versus efficiency. Post-training is an art guided by taste; optimizing for academic benchmarks for PR purposes can make a model worse at real-world tasks if the objective function is misaligned.
Skepticism Towards Benchmarks
Copied to clipboard!
(00:17:37)
- Key Takeaway: Academic benchmarks are untrustworthy because they often contain errors and are easily ‘hill-climbed’ by models optimizing for well-defined objectives, which differs significantly from messy, ambiguous real-world problems.
- Summary: Models can achieve high scores on benchmarks like IMO gold medals while still failing at basic tasks like parsing PDFs because benchmarks lack the ambiguity of reality. Progress measurement should rely on deep human evaluations by experts across diverse, complex roles, rather than relying on casual online A/B tests.
AGI Timelines and Trajectory
Copied to clipboard!
(00:21:54)
- Key Takeaway: AGI is likely a decade or more away because the final steps from 99% performance to near-perfect performance require disproportionately more effort than initial gains.
- Summary: Human evaluations remain necessary until AGI is reached, as there is always more for models to learn from experts. While models might automate 80% of an average software engineer’s job in the next one to two years, reaching 99% automation will take several more years.
Critique of AI Optimization
Copied to clipboard!
(00:23:03)
- Key Takeaway: Frontier labs risk optimizing AI for ‘AI slop’ by chasing dopamine-driven engagement metrics and flawed leaderboards instead of focusing on truth and advancing species-level goals like curing cancer.
- Summary: Leaderboards like LM Arena reward superficial qualities like excessive emojis and length, even if the model hallucinates, because casual users pick what ’looks flashiest.’ This creates negative incentives where researchers prioritize climbing leaderboards over model accuracy, mirroring how social media optimization led to clickbait.
Contrarian Company Building
Copied to clipboard!
(00:28:33)
- Key Takeaway: Founders should reject Silicon Valley mantras like pivoting and blitzscaling, instead focusing on building the one deep, novel idea only they can execute, which avoids chasing valuations.
- Summary: The standard playbook of pivoting every two weeks chases quick valuations rather than taking big risks on meaningful ideas. Founders should prioritize mission alignment and deep belief, as failing while swinging at a hard, novel idea is preferable to succeeding at an LLM wrapper company.
RL Environments as Next Frontier
Copied to clipboard!
(00:33:35)
- Key Takeaway: Reinforcement Learning (RL) environments, which simulate complex, multi-step real-world scenarios, are the next crucial training method beyond SFT and RLHF.
- Summary: An RL environment is a simulation where agents interact with tools, data, and other entities (like a simulated startup with Slack and AWS outages) to achieve a reward. This method exposes weaknesses in models that perform well on single-step benchmarks but fail in long time-horizon, messy tasks requiring tool use and state modification.
Evolution of Model Training
Copied to clipboard!
(00:41:04)
- Key Takeaway: Model advancement has progressed sequentially through Supervised Fine-Tuning (mimicking masters), RLHF (preference ranking), Rubrics/Verifiers (detailed grading), and is now moving toward RL Environments (simulated practice).
- Summary: SFT is like copying a master, while RLHF is learning which of five essays a human prefers. Rubrics and verifiers function as detailed grading feedback, measuring progress across candidate checkpoints. RL environments represent the next stage, teaching models complex, multi-step skills through simulated practice.
Future Model Differentiation
Copied to clipboard!
(00:48:11)
- Key Takeaway: Future AI models will become increasingly differentiated based on the core values and objective functions prioritized by the labs building them, moving away from commoditization.
- Summary: Initially, models were expected to commoditize, but company values now shape behavior; for example, one model might optimize for productivity (stopping after the perfect email), while another optimizes for engagement (offering 50 more iterations). This differentiation will be similar to how Google, Facebook, and Apple build fundamentally different search engines based on their principles.
Underhyped and Overhyped AI
Copied to clipboard!
(00:51:03)
- Key Takeaway: The integration of interactive mini-apps or ‘artifacts’ within chatbots is underhyped, while the long-term maintainability risks of ‘vibe coding’ generated code are being dangerously overlooked.
- Summary: The ability for chatbots to create interactive UIs or mini-apps that users can click on (like sending a text message directly from the chat) represents a powerful next step in user experience. Conversely, dumping AI-generated code into codebases without deep review risks creating unmaintainable systems in the future.
Motivation Beyond Success
Copied to clipboard!
(00:54:57)
- Key Takeaway: Edwin Chen’s core drive is scientific exploration, specifically understanding language and communication, exemplified by his hands-on deep dives into new AI models.
- Summary: Edwin Chen remains motivated by his scientific roots, aspiring to understand the universe and language, even dreaming of communicating with aliens. He actively engages in deep dives, running evals, and writing analyses on new models, often doing this work himself late into the night. He admits to being poor at typical CEO duties like sales but thrives on hands-on research and jamming with the science team.
Surge’s Contrarian Company Building
Copied to clipboard!
(00:56:19)
- Key Takeaway: Surge AI is intentionally structured like a research lab, prioritizing curiosity and long-term incentives over quarterly metrics to ensure it shapes AI beneficially.
- Summary: Chen believes Surge’s unique perspective on data and quality allows it to be unconstrained by influences that steer other companies negatively. The company values curiosity and intellectual rigor over short-term financial metrics or board deck appearances. This approach aims to ensure Surge guides AI development toward beneficial outcomes for humanity.
Influence on AI Trajectory
Copied to clipboard!
(00:57:06)
- Key Takeaway: Data and evaluation providers like Surge have significant, often underestimated, influence on the direction AI models take, shaping the ecosystem’s discussions.
- Summary: While leaders at OpenAI and Anthropic are visible, companies providing data and evaluation insights hold substantial influence over where models head. Many labs are still uncertain about their ultimate model objectives and how humanity should factor into the future of AI. Surge aims to continue shaping this crucial discussion through its unique perspectives.
Defining AI’s Dream Objective
Copied to clipboard!
(00:57:52)
- Key Takeaway: The mission of Surge AI is helping customers define their ‘dream objective functions’โwhat kind of entity they want their model to beโwhich is complex like raising a child, not just passing a test.
- Summary: Defining an AI’s objective function is difficult, analogous to defining what makes a good person versus just achieving a high SAT score. Surge helps customers define these North Stars and measure progress toward them, focusing on hard, important metrics rather than easy proxies like clicks. The core philosophy is that ‘you are your objective function,’ necessitating rich, complex goals.
AI’s Role: Enriching vs. Lazifying
Copied to clipboard!
(01:00:17)
- Key Takeaway: The goal is to build AI tools that make humans more curious and creative, actively avoiding optimization for systems that simply increase engagement by making users lazier.
- Summary: It is crucial to optimize for metrics that measure whether AI makes life richer, not just those that maximize engagement through ease. Humans are inherently prone to choosing the easiest path, making AI soft prompts that increase metrics the simplest route. Choosing the right, hard objective functions over easy proxies is vital for the future.
Building Without Hype
Copied to clipboard!
(01:01:03)
- Key Takeaway: Founders can successfully build a company by being heads-down, focusing purely on amazing research and product quality, without constant tweeting, hyping, or fundraising.
- Summary: Chen wished he knew earlier that building a successful company could be achieved through deep research and building something amazing, contrary to the expected CEO role of constant fundraising. He modeled this on DeepMind’s research focus, avoiding the perceived boredom of pure business management. Building something so good that it cuts through noise allows success without needing to become someone you are not.
Data Labeling as Child Rearing
Copied to clipboard!
(01:02:37)
- Key Takeaway: The work Surge AI does is fundamentally different from simplistic data labeling; it is akin to raising a child by teaching values, creativity, and subtle judgments.
- Summary: Chen dislikes the term ‘data labeling’ because it implies simplistic tasks like labeling cat photos. He views Surge’s work as teaching AI values, beauty, and the infinite subtle aspects that define a good person. This process is framed as raising humanity’s children, focusing on profound, subtle instruction rather than simple information input.
Recommended Books and Media
Copied to clipboard!
(01:03:40)
- Key Takeaway: Edwin Chen recommends literature centered on linguistics, philosophy, and the complexity of translation, reflecting his core interests in communication and quality.
- Summary: Recommended books include ‘Story of Your Life’ by Ted Chiang (about a linguist learning an alien language), ‘The Myth of Sisyphus’ by Camus, and ‘Le Ton Beau de Marot’ by Douglas Hofstadter, which explores the nuanced motivations behind 89 different translations of one poem. He also enjoys science fiction involving deciphering alien communication, such as the TV show ‘Travelers’ and the movie ‘Contact’.
Founders’ Values and Destiny
Copied to clipboard!
(01:06:11)
- Key Takeaway: Companies often become an embodiment of their CEO’s personal values, suggesting founders should build companies based on what they personally care about and want to shape in the world.
- Summary: Chen believes founders should build companies that only they could build, shaped by their unique life experiences and interests. When making big decisions, he asks what his personal values dictate, rather than what metrics or external pressures suggest. Following personal values is key to acquiring the unique experiences that lead to creating something truly important.
Hiring and Listener Input
Copied to clipboard!
(01:09:13)
- Key Takeaway: Surge AI is actively hiring individuals passionate about the intersection of data, math, language, and computer science, and listeners can contribute by suggesting blog topics or sharing real-world AI failures.
- Summary: Surge is seeking people who love data and the intersection of math, language, and computer science to reach out for potential roles. Listeners can be useful by suggesting topics for the Surge blog, which Chen is restarting. He is also highly interested in seeing examples of interesting, real-world AI failures that illustrate deep questions about desired model behavior.