PodcastEconomiaLatent Space: The AI Engineer Podcast

Latent Space: The AI Engineer Podcast

Latent.Space
Latent Space: The AI Engineer Podcast
Ultimo episodio

268 episodi

  • Latent Space: The AI Engineer Podcast

    AIE Europe Debrief + Agent Labs Thesis: Unsupervised Learning x Latent Space Crossover Special (2026)

    23/04/2026 | 54 min
    Today, we check in a year after the first Unsupervised Learning x Latent Space Crossover special to discuss everything that has changed (there is a lot) in the world of AI. This episode was recorded just after AIE Europe, but before the Cursor-xAI deal.
    Unsupervised Learning is a podcast that interviews the sharpest minds in AI about what’s real today, what will be real in the future and what it means for businesses and the world - helping builders, researchers and founders deconstruct and understand the biggest breakthroughs.
    Thanks to Jacob and the UL production team for hosting and editing this!
    Jacob Effron
    * LinkedIn: https://www.linkedin.com/in/jacobeffron/
    * X: https://x.com/jacobeffron
    Full Episode on Their YouTube
    We discuss:
    * swyx’s view from the center of the AI engineering zeitgeist: OpenClaw, harness engineering, context engineering, evals, observability, GPUs, multimodality, and why conference tracks now reveal what matters most in AI
    * Whether AI infrastructure has finally stabilized: why “skills” may be the minimal viable packaging format for agents, why infra companies have had to reinvent themselves every year, and why application companies have had an easier time surviving model volatility
    * The vertical vs. horizontal AI startup debate: why application companies can act as the outsourced AI team for enterprises, why some horizontal companies still matter, and why sandboxes may be the clearest reinvention of classic cloud infrastructure for the AI era
    * The “agent lab” playbook: starting with frontier models, specializing for your domain, then training your own models once you have enough data, workload, and user behavior to justify the cost and latency savings
    * Why domain-specific model training is real, not just marketing: how companies like Cursor and Cognition can get users to choose their in-house models, and why search, domain specialization, and distillation are becoming more important
    * Open models, custom chips, and alternative inference infrastructure: why swyx has turned more bullish on open source, why non-NVIDIA hardware is suddenly getting real attention, and why every 10x speedup can unlock new product experiences
    * What it means to sell to agents instead of humans: why agent experience may mostly just be good developer experience by another name, why APIs and docs matter more than ever, and how pretraining-data incumbents are compounding advantages in an agent-first world
    * Why memory and personalization may become the next big wedge: today’s models mostly reward frequency of mentions, but in the future, swyx expects product choice to be shaped much more by personalized memory systems
    * The state of the AI coding wars: why coding has become one of the largest and fastest-growing categories in AI, how Anthropic, OpenAI, Cursor, and Cognition have all ridden the wave, and why the category may still have more room to run
    * Capability exploration vs. efficiency: why the industry is still in a token-maxing, experiment-heavy phase where people are rewarded for spending more rather than less
    * Claude Code vs. Codex and the strange stickiness of coding products: why first magical product experiences may matter more than expected, and why the bigger mystery may be why only a few names have emerged as real winners so far
    * What the end state of the coding market might look like: two major players, a longer tail of niche products, and possible disruption if Microsoft, Mistral, xAI, or the Chinese labs push harder into coding
    * Where application companies still have room against the labs: why frontier labs are trying to expand into verticals like finance and healthcare, but still leave space for focused companies that own the workflow and the last mile
    * Why coding may be a preview of every other AI market: the first category to truly go parabolic, the clearest example of foundation model companies colliding with application companies, and a template for how future vertical AI markets may develop
    * Why AI valuations now feel unbounded: from billion-dollar ARR products built in a year to trillion-dollar market caps, swyx and Jacob unpack how the AI market has broken traditional startup intuitions about scale and durability
    * Consumer AI vs. coding AI: why ChatGPT’s consumer category may have plateaued on frequency and product design, while coding continues to feel like a daily-use category with real momentum
    * The next product frontier beyond coding: consumer agents, computer use, and “coding agents breaking containment,” with swyx’s thesis that 2025 was the year of coding agents and 2026 may be the year they begin to do everything else
    * Whether foundation models are really killing startup categories: why swyx is less worried for early founders, more worried for mid-size startups and traditional SaaS, and why building something ambitious may now be the best job interview for a frontier lab
    * AI vs. SaaS and the internal culture war around adoption: the tension between AI-native employees who want to rip out expensive software and skeptics who think quick AI-built replacements create fragile systems
    * Why traditional SaaS may be under real pressure: swyx’s own experience spending six figures on event and sponsor management software, the temptation to rebuild it cheaply with AI, and the broader question of whether teams will trust custom AI-native replacements
    * Biosafety, security, and frontier model access: why swyx raised biosafety at a dinner with Anthropic’s Mike Krieger, why Krieger argued security is the bigger issue, and what restricted model releases reveal about Anthropic vs. OpenAI
    * The era of giant models: why 10T+ parameter systems may only be a temporary rationing phase before bigger clusters arrive, why labs may increasingly keep their most powerful models private for distillation, and why scale alone no longer feels like a complete answer
    * Memory as the slowest scaling factor in AI: why context windows have improved far more slowly than people hoped, why million-token context still has not changed most real workflows, and why memory may be the key bottleneck for the next generation of systems
    * What swyx changed his mind on in the past year: becoming more bullish on open models, more convinced that the top tier of agent startups behaves very differently from the median AI company, and more optimistic about fine-tuning and specialized model adaptation
    * “Dark factories” and zero-human-review coding: the next frontier after zero human-written code, where models not only write the code but ship it without human review, forcing companies to rethink testing and verification from first principles
    * Why RL and post-training may matter more than people assumed: even if the resulting models get thrown out every few months, the data, workflows, and domain-specific improvements persist
    * Synthetic rubrics, Doctor GRPO, and multi-turn RL: why reinforcement learning is becoming much more domain-specific and multi-step than many people realize, opening the door to much deeper customization
    * The next frontier after coding: memory, personalization, and world models, including why swyx thinks world models matter not just for robotics or gaming, but for giving AI something closer to lived understanding
    * Fei-Fei Li, spatial intelligence, and the Good Will Hunting analogy: the idea that today’s LLMs may know everything by reading it all, but still lack the lived experience that turns knowledge into a deeper kind of intelligence
    Timestamps
    * 00:00:00 Intro preview: AI coding wars, startup pressure, and market structure
    * 00:00:28 Welcome to the Latent Space × Unsupervised Learning crossover
    * 00:01:17 What AI builders are focused on now: OpenClaw, harnesses, and infra
    * 00:04:33 Why AI infra is harder than apps, and where startups can still win
    * 00:06:39 Should companies train their own models?
    * 00:09:28 Open models, custom chips, and the new inference race
    * 00:11:25 Designing products for agents, not just humans
    * 00:16:49 The state of the AI coding wars in 2026
    * 00:19:27 Capability exploration, token-maxing, and why coding is going parabolic
    * 00:21:41 What the end state of the coding market could look like
    * 00:23:50 Where app companies still have room against the labs
    * 00:27:02 Why AI valuations and market swings feel unprecedented
    * 00:28:56 Consumer AI vs. coding AI, and why sticky products still matter
    * 00:32:28 What the next breakthrough product experience might be
    * 00:32:53 2026 thesis: coding agents break containment and eat the world
    * 00:35:27 Are foundation models wiping out startup categories?
    * 00:37:33 AI vs. SaaS, vibe coding, and internal team tensions
    * 00:40:01 Biosafety, security, and the politics of restricted model releases
    * 00:42:19 Giant models, compute constraints, and the limits of scale
    * 00:44:30 Memory as the real bottleneck in AI
    * 00:44:57 Why swyx changed his mind on open models
    * 00:47:44 Dark factories and the future of zero-human-review coding
    * 00:49:36 Why post-training and RL may matter more than people think
    * 00:51:50 Memory, world models, and the next frontier of intelligence
    * 00:53:54 The Good Will Hunting analogy for LLMs
    * 00:54:21 Outro
    Transcript
    [00:00:00] swyx: Isn’t that crazy? That number is just mind boggling.
    [00:00:03] Jacob Effron: What is the state of the AI coding wars today?
    [00:00:05] swyx: We’re in a phase of sort of like capability exploration. The general thesis that I have been pursuing now is that the same way that 2025 was a year coding agents 2026 is coding agents breaking containments to do everything else.
    [00:00:16] Jacob Effron: Do you worry about the foundation models just getting into a bunch of these startup categories?
    [00:00:21] swyx: Mid-size startups. Yes.
    [00:00:23] Jacob Effron: What do you think the end state of this market is
    [00:00:25] swyx: for the market structure to, to significantly change? There would be
    [00:00:28] Jacob Effron: today on unsupervised learning. We had a, a fun episode and what’s really become an annual tradition, a crossover episode with our friends at Latent space.
    Swix and I sat down and we talked about everything happening in the AI ecosystem today. What we thought of the various changes at the model layer, what’s happening in the infra world, the coding wars, and a bunch of other things. It’s a ton of fun to do this with someone I really respect and another great podcaster in the game.
    Without further ado, here’s our episode. Well switch. This is, uh, super fun to be back with another unsupervised learning, uh, latent space crossover episode.
    [00:01:02] swyx: Yeah,
    [00:01:02] Jacob Effron: I feel like a lot of places we could start, but you know, one thing I always find fascinating, uh, about the way you spend your time is you obviously are like at the epicenter of this engineering movement and community, and you run these events and conferences and put on these.
    Awesome talks and, and I think just have a great pulse on the zeitgeist of what’s going on.
    [00:01:16] swyx: Yeah.
    [00:01:17] Jacob Effron: Maybe to, to start just what are the biggest topics people are thinking about right now?
    [00:01:21] swyx: Yeah, so I just came back from London, uh, where we did a IE Europe and we’re doing roughly one per quarter now, which Yeah, you’ve
    [00:01:27] Jacob Effron: really up
    [00:01:27] swyx: the, hopefully
    [00:01:28] Jacob Effron: up the, up the pace.
    [00:01:29] swyx: It’s trying. We’re trying to match AI speed, you
    know?
    [00:01:30] Jacob Effron: Yeah, exactly. The tops would be completely different, I imagine. Uh,
    [00:01:33] swyx: yeah. You know, I definitely curate the tracks, like you can see what I think. When you see the track list and the, the speakers that I invite, obviously Open Claw is like the story of the last four or five months, and then be, be just below that.
    I would consider harness engineering, context engineering to be two related topics in agents and rag. And then there’s a long tail of Evergreen stuff like evals, observability, GPUs, uh, and uh, LM infra and just general, just in general. We also have other updates on like multimodality and, uh, generative media, let’s call it.
    Um, but I definitely, the, the first three that I mentioned are top of mind people. Yeah.
    [00:02:13] Jacob Effron: I think harness is particular like, so interesting. Um, you know, there was this tweet from Harrison Chase, the, the lane chain, CEO, that, that caught my eye recently where he said, you know, it finally feels like we have stability, uh, around the infrastructure for, uh, you know, around ai.
    And I think what. He basically was implying his like, look over the past two, three years as a company at the epicenter of AI infrastructure, it was a bit like playing whack-a-mole, right? You were constantly moving around with, however, the building patterns were evolving
    [00:02:36] swyx: for Harrison for sure. Right? Like he’s basically had to reinvent the company every year since he started Lang Chain.
    Right? It was Lang chain, Ang graph and LP agents and like, uh, I think he’s like one of the most nimble, adept sharp people about this. Yeah. Yeah.
    [00:02:49] Jacob Effron: Saying now, now is finally the time stability
    [00:02:51] swyx: this. Yeah.
    [00:02:52] Jacob Effron: Yeah. Um, do you buy that or what have you kind of make of that take?
    [00:02:56] swyx: I think that. It, it’s very expensive to say this Time is different sometimes, but when you’re just writing code, like it’s actually okay to just like try to make a call and I think it may not even matter if this call is right or not.
    Like I just don’t even care that much because you can be right on a thesis, but if you don’t, you don’t figure out how to monetize the thesis, then who cares if you said something first that said, um, it does feel like, for example. Uh, we went through a lot of different ways of passion packaging integrations up with, uh, with agents.
    And it feels like we’ve landed at skills, which is like the minimal viable format. Yeah. Which is just a markdown file, uh, with some scripts attached to it, and I don’t see how it can be more simple than that. And so there is some justification for. The stability around harnesses. I feel like there may be more adaptation with regards to maybe like the real time elements or subagents or memory or any of those like agent disciplines, let’s call it in, in agent engineering.
    Uh, but if, if the thesis is that, okay, you just want agents are LMS with tools in the loop with a file system, what they can do. Retrieval with, with skills and all these like standard tooling that now seems to be relatively consensus then probably. That makes sense. Um, I just think like there’s no point trying to stake your reputation on this thesis that we’re there because if it changes again, just change with it.
    It’s fine.
    [00:04:33] Jacob Effron: Yeah. It’s always, you know, I’ve always been struck by how that is. Much more challenging for infrastructure companies and application companies. Like obviously I think, yeah. You know, on the application side you’ve seen, you know, Brett Taylor from Sierra Max, from Lara. Like, they’re like, look, we build, you know, what’s ahead of the models and we’re willing to throw everything out every three months, you know, as the models get better and better.
    Exactly. Yeah. But the thing you at least have there is you have. Uh, you have an end customer, right? That’s like decently sticky. Um, you know, they will mostly stick, you know, they’ll, they’ll give you a shot at least of, of building these things. What I’ve always found more challenging, uh, at, at the kind of like, you know, reinvent yourself every three months of the infrastructure layer, it’s like, you know, developers are definitely a, a pickier audience maybe than an accounting firm or, uh, you know, a bank.
    Yeah. And so it’s definitely a, a, a more challenging position to be in to, to have to constantly reinvent yourself.
    [00:05:17] swyx: Yeah. Yeah. Yeah. And, and like when they turn, it’s like. Very complete. Like, they’ll leave to like the, the hot new thing, uh, because there’s like no defensibility, I guess. Like e even, even if you are a database, like, uh, people can migrate workloads off databases.
    Like it’s, it’s a, it’s a known thing. Uh, so I think like basically what we’re talking about is the vertical versus horizontal, uh, debate in, in AI startups. And uh, the way I think about it also is just that like when you are. Um, Lara, when you are a bridge, like you are the outsource AI team, right? You, you are, your job is to apply whatever state ofthe art AI methods.
    [00:05:55] Jacob Effron: Yeah. Like this translation layer between model capabilities and your
    [00:05:57] swyx: own customers. Yeah. To, to the end customers and like, well, if they didn’t have you, they would’ve to hire in house and they’re not gonna hire in house so they have you. And like, I think that’s like a reasonable, like very robust to any whatever trends and, and discoveries that people make in, in the engineering layer.
    I do think like there is, um. It like sort of useful horizontal companies being built, but they’re all. Very much like, sort of like the reinventions of classic cloud in the AI era and the, the primary one being sandboxes. Yeah. Um, which like, it’s another form of compute guys, like, let’s not get too excited about it.
    But I mean, like the, the workloads are enormous.
    [00:06:38] Jacob Effron: Right.
    [00:06:38] swyx: Yeah.
    [00:06:39] Jacob Effron: It’s interesting, and I feel like as, as part of this, you know, the questions that folks are asking around infrastructure, there’s a lot around, you know, the extent to which companies should have their own AI teams and what they should be doing in-house.
    And, you know, uh, I think there’s questions around should people be training their own models? Should people be doing, you know, rl, uh, in-house based on the data they have? I feel like, you know, one has to evolve their takes on this every, every three months with paces. But where, where are you at on this today?
    [00:07:00] swyx: I think, well, I mean actually all models have gone up. Um, and obviously I’m involved in cognition and also cursors doing, doing, uh, a lot of own model training. And I think that that is some part of the, what I’ve been calling the agent lab playbook, where you start off with the state of the art models from, uh, from the big labs and you, uh, specialize for your domain.
    But once you have enough workload and enough high quality data from your users, then you can obviously train your own models and like save a lot on cost and latency and all that, all that good stuff. Um, you also get like a marketing bonus of like calling it some fancy name and putting out some research
    [00:07:38] Jacob Effron: from my seat.
    I can’t tell how much of it is like actual, you know, value that’s provided to the end user. And how much of it is that marketing bonus? Right. It seems some combination of the
    [00:07:45] swyx: I think it’s both.
    [00:07:46] Jacob Effron: Yeah.
    [00:07:46] swyx: Um, no, no. There, there actually is real value. Um, and you, you know that for a number of reasons. Like one, even when it’s not subsidized, people do choose it as like one of the top four or five.
    This is both composer two and, uh, suite 1.6 I one of the top five models. Like in a, in a fair market? In a free market, yeah. In a, in a, in a model switch. Or people do choose it and like, it’s not subsidized. Like, so that’s as good as it gets. Uh, but beyond that, like domain specific models, for example. For search with, with both, which both companies have absolutely makes, makes a ton of sense.
    Everyone says like, yeah, we should always, always do this. And honestly like, I think the infrastructure for that is becoming easier with, um, like thinking machines tinker thing as well as primary like, uh, lab stuff. Yeah, I mean like, this is one of those like reversal of the, the bitter lesson where you first bootstrap on the large models and the general purpose models to get big.
    And as you get very well-defined workloads that are just high quantity but not high variance, um, then you just distill down to a smaller model and run that on your own. Right. Which like totally makes sense.
    [00:08:50] Jacob Effron: What I’m less clear on is the kind of DIY RL use case, which I think is really mostly around, you know, improved, uh, quality for, for different things.
    Obviously there’s probably like more efficient ways to, you know, get a smaller model that’s that’s faster and cheaper. And it’ll be interesting to see whether. You know, obviously you had, you know, uh, two, three years ago this whole case of companies that were, you know, pre-training and claiming better outcomes in, in their domains than getting kind of cooked as each model iteration improved.
    You know, I wonder whether that’s a, a similar story plays out in the, uh, in, in the, our all space. Yeah, for the focus on, on on pure outcomes and quality, not the cost side, which clearly your own models for cost at scale makes a ton of sense.
    [00:09:28] swyx: I think there are this, there are two sides of the same coin.
    Like you basically always want to hold, uh, quality constant or trade off a little bit of quality for a drastic decreasing cost. And that’s true for everyone. Uh, one element I wanted to bring out, which is very much in favor of open models, is custom chips. So this would be cereus, but also talu. And then there’s a huge range of stuff in between.
    This has been a huge story this past year on just like everything non Nvidia is getting bid up, including like freaking MatX is working for, which is very, which is very rewarding for me, but I think one of those things where like, oh, like the suddenly, because the number of alternative. Hard, uh, hardware is increasing and the inference that you can get is insanely high.
    Like, um, we’re talking thousands of tokens per second instead of less than a hundred. So the trade off for qua quality doesn’t hold as much anymore because the speed is so high.
    [00:10:24] Jacob Effron: Have you seen a lot of companies go all in on the alternative chip?
    [00:10:26] swyx: So cognition has Yeah. On Cerebras, uh, and, and so has OpenAI
    Um, uh, and so no, I don’t think so beyond that, uh, and that, do you think that’s like a, that’s mostly, that’s foreshadowing of, that’s, yeah. I used to be kind of a skeptic in terms of like, okay, so what if I get my inference at a hundred to a hundred tokens per second sped up to 200 tokens per second. It’s only two X faster.
    It’s not that big a deal. Um, but when you, uh, I think every 10 x does unlock a different usage pattern. Um, and you, we have proof in Talas and, and some of the others. That you can actually, um, drastically imp improve inference speed and what happens from there? I don’t even really know, like it’s, it’s so hard to predict when entire applications just appear at once.
    Yeah. Uh, and it also isn’t that expensive, right? So like, um, this is one of those things where like, I, I think the, the investment cycle is gonna be multi-year. Um, and I. Would caution people to not dismiss it too, too quickly.
    [00:11:25] Jacob Effron: Yeah. I mean, one other like infra question I was curious to get your thoughts on is obviously it seems increasingly a lot of the cutting edge infra companies are building for agents as the buyers of their product or users of their product, right?
    [00:11:35] swyx: Ooh,
    [00:11:36] Jacob Effron: and
    [00:11:37] swyx: another huge theme. Yeah. Yeah.
    [00:11:38] Jacob Effron: And I’m trying to figure out like what. What, what do you have to do differently about selling into agents? Um, are they just the ultimate rational developers? Uh, or is there, you know,
    [00:11:46] swyx: no, absolutely not. Um, I think they are easily prompt, injected and, uh, very tuned towards like, basically com compounding existing winners.
    [00:11:57] Jacob Effron: Yeah,
    [00:11:57] swyx: so like if, like, congrats if you won the lottery for getting into the training data right before 2023, because now you’re like installed in there for the foreseeable future. But yeah. Uh, you know, one stat that Versal, uh, CTO Malta dropped at my conference was that there are now, uh, 60% of traffic to Elle’s, um, like app arch, like admin app architecture for like configuring versal applications, uh, is bought.
    It’s not, it’s not human. Uh, so like your primary customer is agents now. Um, and it’s mostly co like mostly coding agents, mostly people using CLI on CP or whatever. But yeah, I mean, I think. More. I, I think step one, if it doesn’t exist as an API that agents can use, it doesn’t exist. Right, right. Which I think is like, uh, it’s a good hygiene thing anyway, to, to make everything API available, but not as like an extra, um.
    Push on like products, people to not only work on the ui, um, you should probably work on the on SCLI stuff. Beyond that, I think honestly there is like, so I, I come from the sensibility of, I think everything that you are trying to do for agents experience now, which is the term that Matt Bowman and Nullify is trying to coin, is the same thing that you should have been doing for developer experience.
    That you should have had good docs, you should have had a consistent API, uh, that is. Mostly stateless. Um, you should have, I guess, discoverable or progressive disclosure or like search or like whatever. And so now that people have energy in like finding these customers to do that, that’s great. Um, do I believe in.
    Extending beyond that into something like a EO, um, for gaming The chatbots? Not necessarily, but obviously there’s gonna be huge advantages when people who figure out the short term wins. Yeah. And short term wins can compound.
    [00:13:43] Jacob Effron: Do you think these compounding advantages to like the, the pre-training data cutoff companies, like, you know, obviously over some period of time, I imagine that doesn’t persist.
    And so as you think about like. I dunno, three, four years from now what the, you know, selection criteria end up being. Do you think it still mirrors exactly what you were saying before? Like it’s exactly what you should have been doing all along to sell a good product to developers?
    [00:14:01] swyx: It could be, except that I think in three, four years we’ll probably have much better memory and personalization.
    So then general a EO or GEO doesn’t really matter as much. So I think whatever memory or personalization system we end up with will probably d determine what you end up choosing much more. Than, than what is currently the case, which is just frequency of mentions, let’s call it. Yeah,
    [00:14:26] Jacob Effron: yeah.
    [00:14:26] swyx: Uh, so you just spa quantity and I think that’s, I mean, that’s something I’m looking forward to.
    I do think, like, like, you know, I, I think that the fundamental exercise to work through for yourself is if you start a new, um, sort of. Uh, disruptor company. Now there’s a, there’s a big incumbent that everyone knows, like, like superb base. Super base is like, kind of like the Postgres, like database, uh, incumbent.
    If you wanna start like new superb base, how would you compete with them? And I don’t necessarily have the answer, but I, I, I do think like people, like resend like relatively new. I think they would start like 20, 23 and still there was, there was a recent survey where like, people. Checked what Claude recommends by default.
    If you just don’t prompt it with anything, just say, gimme an email provider and says, resent as in like 70, 70% of each cases. Like the fact that you can get in there with like such a relatively short existence, I think is, is encouraging.
    [00:15:14] Jacob Effron: Yeah.
    [00:15:14] swyx: I do think like. Um, you do want to do whatever it is to, to like to, to get in that Very short mentions this because, um, it’s not gonna be 20 of them, it’s gonna be like three.
    [00:15:26] Jacob Effron: No, definitely. It feels like, uh, you know, probably more, more consolidation than ever. Uh, or, or kind of like, you know, uh, a winner take most market than maybe the, the, the physics of go-to market in the past. Yeah. Might have, uh, enabled.
    [00:15:38] swyx: The other thing also is like, semantic association is gonna be very important, uh, in the sense that like, you want to do like the combo articles where you’re like, use my thing with for sale, with blah, blah.
    And like that all gets picked up in a, in a corpus. And so that’s. Probably one thing that you, you wanna do? Well, I don’t know what else. Uh, it’s, it’s, it’s, it’s one of those things where like, I think I feel, I feel I’m behind, uh, I don’t know how you feel about this, but like,
    [00:16:04] Jacob Effron: I think AI is just everyone constantly feeling like they’re behind some, uh,
    [00:16:08] swyx: yeah.
    With,
    [00:16:09] Jacob Effron: I wanna meet the person that doesn’t feel behind,
    [00:16:11] swyx: but like with, with ax, right? Like, so, so like, my, my stance was that exactly what I said before, like everything that you, that you should do for agents is something that you should have done for humans anyway. Yeah. And so. To the extent that you’re just getting it more energy to, to do things for agents, great.
    But like, uh, it’s hard to articulate what new thing apart from just like more spam, um, that you should be doing. Anyway, that would be my take right now. Um, I I, I do think like there, there will be more turns at this. I think the personalization turn that is coming, um, will be big. And I don’t know what that looks like because like basically we’re kind of, we feel kind of tapped out on the memory side of things.
    [00:16:49] Jacob Effron: Yeah. I, I guess since we last chatted, you know, you, you took this role over at cognition, um, and you’ve obviously have a, have a front row seat to the AI coding space today. You know, I feel like coding in many ways. You know, people view it as this, like, I mean, besides being like the, the mother of all markets and this massive opportunity, I think it’s kinda a preview of like, what’s to come for many other spaces.
    Both. Yeah. You know, I feel like agents are most advanced in coding. I also feel like the, you know, competition between foundation models and application companies, you know, and, uh, mirrors what we may see in other spaces. And so maybe for our listeners, can you just lay out like what is the state of the AI coding wars today?
    [00:17:25] swyx: Um, it is massive, right? Like, uh, and I don’t think necessarily, last time we talked about this, we appreciated the size of what
    [00:17:32] Jacob Effron: No, I wish we did.
    [00:17:33] swyx: I state of AI coding wars today, um, both opening eye philanthropic have made it their p serials to competing coding. Um, and. Tropic is like 2.5 billion in a RR just from Cloud Code.
    The way they recognize a RR is. Opt for debate, uh, open ai. I don’t think the, a public number is known, but let’s call it 2 billion as well. And then cursor is like, rumored to be 2 billion, you know? And, and those, those are like the public numbers that are known? Yeah. Um, so like huge markets that have just been created in the past one year.
    Like, like anthropic, just like Claude Code just recently celebrated their one year anniversary, which is, yeah, pretty nice. Um, so, and then I think, like the other thing that I see is there’s, there’s some other people who are like, oh, here’s like the, the sort of relative penetration of, uh, Claude use cases, right?
    Like, and it’s like coding 50% and then legal, whatever. Health, uh, it’s like the, the remaining ones. And there was a very popular tweet that was like, okay, I’ll look at the, the empty space and all these other use cases. If you are a new founder today, you should be betting on the other stuff because on, on a sort of catch up Yeah.
    Theory and my. Consider my, my pushback is the same pushback that, uh, I had on app over Google, which is like, well, well why is this time different? Like, why, if it went from let’s say 10 to 50% in the past year, why can’t I keep going? Uh, and like getting that wrong is actually a very painful one because you could have just did, did the momentum bet.
    Instead of the mean reversion bed. So I, I, I think that that is the, the state of things now that people are very, very much into psychosis. Um, they’re are getting rewarded for spending more rather than spending less. And I think we’re not in that phase of efficiency. We’re in a phase of sort of like capability exploration.
    So I think people who are more crazy, who are more. Uh, creative, um, get rewarded comparatively. Yeah.
    [00:19:27] Jacob Effron: Well, it’s interesting. I mean, it feels like behind these like token maxing, leaderboards and whatnot is this, it’s like the first phase of this transition from a workforce perspective is you just gotta show your employer like, Hey, I, I use these tools.
    [00:19:37] swyx: Here’s my nu number of tokens I cost, and that’s it. They don’t care about the quality. Right. It is, uh, maybe distasteful to someone who cares about the craft and, and all that. Um, but directionally everyone just wants you to go up regardless. And so, um, there it is not very discerning. It’s, and it’s probably very sloppy, but I think it’s net fine because we’re still probably underusing ai just in generally.
    Yeah. Um, and so I think that’s like very interesting. Like we had on the podcast, uh, Ryan La Poplar from OBI, who spends a billion tokens a day. Yeah. Um, and that’s for those county home, it’s like something like 10,000 worth, $10,000 worth a day of API tokens. If they, they did market rates, um, and like most of us can’t afford that.
    Yeah. But like. And, and, and probably a lot of what he does is slop.
    [00:20:25] Jacob Effron: Right.
    [00:20:25] swyx: But like, he’s going to dis, he’s like, if there were a new capability, he would discover it first before you because he was, he was trying and you were not trying. Right. And like, you only do things that work like, well, good for you.
    But like the, the people who are going to discover the next hot thing are living at the edge.
    [00:20:42] Jacob Effron: Right and increase in living at the edge of just having the compute budget to like run these experiments. I mean, kind of similar to what living at the edge on the research side has always been. You know, it was constrained in many ways by the amount of compute you had to run these experiments.
    It feels similarly on the, almost on the builder or like actualizing these tools now.
    [00:20:56] swyx: Yeah. The other thing that’s, I mean, very obvious is philanthropic is kind of like the high price premium player. Um, that where, you know. Restricting limits or restricting model releases even is like the name of the game.
    Whereas Codex is like, come on in guys, use our SDK, use our login and we don’t care. We’re gonna reset limits. Whatever you do want to try to exploit the subsidies where you can get it. And definitely Codex is super subsidized right now. Gemini also very subsidized. Um, and. Comparatively, like, I think you should make, Hey, I guess while, while that’s going on, it’s not that bad to be a capabilities explorer on just the $200 a month plan from Cloud Code or from OpenAI.
    Um, and, uh, I I, I, my sense is that people aren’t even there yet.
    [00:21:41] Jacob Effron: How do you think this, like, market ultimately plays? I mean, it’s obviously such a big market that, you know, any slice of that market is interesting for, for anyone going after it. But I think what, what makes people so interesting in the coding market particularly is it feels like it’s kind of this.
    Foreshadowing of what will happen in other, you know, any other kind of application market that the foundation models eventually turn to and are all their models against and gather data around. And so how do you think, you know, like does there end up being room for lots of different kinds of players or like, what do you think the end state of this market is and is that, do you think that’s applicable to other markets?
    [00:22:10] swyx: I feel like there will be, I mean. Status quo is probably the most likely outcome, which is there are two big players and there’s a small range of longer tail people that, um, fit other use cases that the, the two big players don’t. That feels right to me. I think that, um, for it to, for the market structure to, to significantly change there would be, there needs to be significant change in like the economics or like the, the brand building or like the, the, the, the value propositions of the, of the companies involved and I.
    Haven’t seen any in the last six months that, that have really changed the stories materially. So I feel like they would just keep going until something, something else happens. Something else happens, meaning like Microsoft wakes up and like goes like. Guys, we have GitHub, we have, uh, you know, we, we, we’ll, we’ll do something much bigger here than other, other than just copilot.
    Um, and, uh, that would be a big change. Um, MSL has put out a model now, and I was in a breakfast with, uh, Alex Wang, where they were like, yeah, like, we, we really, really want to go after the coding use case. We haven’t done anything yet, but like, don’t underestimate them. Right. Um, and, and similarly for the Chinese labs.
    Um, I think they’re trying to go after it. Like ZAI is doing stuff. GLM uh, ZI and GLM is same thing. Um, uh, and, and so it’s, so like everyone’s trying to get a piece of that pie. I, I feel like the, the status quo has been pretty stable for the past, like almost a year I’ll say.
    [00:23:39] Jacob Effron: Yeah. And is the room for the, not like, you know, for, for the application companies more on like the enterprise side or like where do the, where do the, like what surface area do the model companies leave for application companies?
    [00:23:50] swyx: Yeah, that’s a good one. Um. It’s very much evolving. Um, it, I, I, I will say because opening I did not have this, the, this level of attention on coding. Yeah. Uh, a year ago. We just don’t have that much history. Right. Um, and it seems like, for example, so the big push at Open I now is the Super app. Um, is that a consumer thing?
    Is that like a products like. Portfolio rationalization thing, how much is that gonna take away attention from coding at the time when they actually do want to put more coding? I think it’s, it’s very unclear. So I do think like there’s, there’s all these, like in both big labs, there’s. Uh, sorry. Both of the, and, and drop and, and deep minus and XAI are are separate cases.
    Um, they are trying to see the other time expansion areas. So cloud code for finance. Yeah. Um, uh, cloud cowork, all those, all those things. Whereas I think cursor and cognition are like comparatively just focused on coding and so I, I do think they leave space and I do think for the other verticals that also means the same thing.
    Right. That, uh, that they’re not gonna be that. Um, intensely focused on, on, on that domain. Except for, I, I think I would mark out finance and healthcare as like the next ones, um, that they’re clearly going after. Uh, I, I would say comparatively, healthcare seems more thorny. There, there, there’ve been some announcements about it, but like, I would respect the, the finance work a lot more just because like the, the path to money is a lot clearer.
    [00:25:12] Jacob Effron: Yeah, no, I mean, obviously like, I, I think, you know, maybe similar to, to the space that’s being left in these other domains, you know, there’s obviously. Uh, a lot that’s required to actually implement these tools in enterprises, uh, versus, you know, maybe just giving them, uh, giving model access to, to folks outta the box.
    [00:25:27] swyx: Yeah, yeah. Yeah. So the, the agent lab thing is like, we’ll do the last mile for you. Whereas I think the model labs tend to just trust the model and, and be minimalist about it. Both of them work.
    [00:25:38] Jacob Effron: Yeah.
    [00:25:38] swyx: I, I don’t, I don’t necessarily think one, uh, beats the other, uh, for every, for every use case. Um, all I, all I do know is that it does seem like.
    Uh, the large enterprises do want a dedicated partner that isn’t just the model labs, which is kind of interesting.
    [00:25:55] Jacob Effron: We, we’ve been in this phase of, of pure capability exploration. And so I think nothing has been, you know, better for the large labs, right? I mean, they’re always gonna be, uh, uh, the frontier of, of capability exploration.
    And so I think have a very good relationship with a lot of these enterprises. But ultimately over time, like. The, uh, the incentive structure of these labs is always gonna be maximal, you know, token consumption for, uh, for the end customers they work with. And there’s just, I think, so few companies that have actually gotten to massive scale.
    Maybe coding again is the most interesting. So it’s the first space that really is just completely gone, you know? Yeah. You must love it every day. Like absolutely insane. And. I think it
    [00:26:32] swyx: gets even. Okay. I mean, like, I think we, we say good things about crystal cognition, but the sheer liftoff of like both end UPIC and open ai.
    ‘cause they, they, they have independent valuations. I mean, let’s throw an XEI in there because it’s now I ping at 1.2 trillion. That number is just mind boggling. Like I, I feel like in normal investing or normal startups, there’s kind of like a ceiling market cap or valuation. Totally. That, that like you, you reach and you go like, all right, let’s, it’s gonna be chiller from now on.
    And these guys are not slow down. No.
    [00:27:02] Jacob Effron: Well, I also think the dynamic is fascinating about some of these later stage companies is, is, you know, in the past, I feel like in, in venture world, if you got to a certain level of scale, the question around you was really more a valuation question. And this is like why there was different phase, like, you know, types of venture people did and like the late stage growth people were just incredible at like, you know, a little bit of what’s the ultimate market opportunity of this company, but also what’s the right way to, to value it.
    Like we know it’s, it’s in some bands of an outcome that is like. Sure there’s some variance to it, but it’s like relatively understood what that bands is and then maybe you get over time surprised to the upside. Whereas any kind of like later, even the labs themselves, any later stage company, the bands of which that company might be worth right now, even in a year or two years are so massive because of how fast the ecosystem changes that it’s like.
    Even for later stage companies, every three months could be an existential level event to the upside to the downside. Yeah. Um, and I think that, like, you are obviously seeing it in the, in the positive with code, which, you know, if you think about a company like philanthropic, you know, that. For a while, it was like unclear if they were going to have access to enough capital, um, to really stay in the, in the race, right?
    And then coding hit at the exact right time. They had the perfect model for it. They executed brilliantly. Um, and you know, now are, are, you know, uh, you know, one of the most valuable companies in the world.
    [00:28:13] swyx: Uh, at the same time, I, I don’t find, I, I have zero sympathy for opening eye because they’re crushing it and they’re all rich.
    You know, this is like a high class champagne problem to have to, uh, to be number two at coding or whatever. Like, who cares? Like, you’re, you’re doing great.
    [00:28:27] Jacob Effron: Yeah. It’s funny though. I can’t even, I mean, you would be closer to this, uh, you know, even that you’re in the AI coding space, but it’s like a lot of people I talk to think Codex is just as good, if not better than Claude Code.
    Right. I think one thing that I’ve been really surprised by, and maybe, maybe Cloud Code is a better product in some ways, I’m curious your thoughts is just in consumer AI with chat GBT. You saw this big first mover advantage, right? Where admittedly today, like, I don’t know, Claude Gemini. Great products.
    Not sure, not abundantly clear chat GBTs any better, but like. People stick with chat, GBT, it’s the first thing to introduce them.
    [00:28:56] swyx: They stay, but they’re not growing anymore. I don’t know if you’ve seen
    [00:28:59] Jacob Effron: Right. But that to me is more of like a, a, a product problem than it is. They’re not like, it’s not like they’ve like lost share to someone else.
    My understanding is the overall problem with consumer AI today is much more of a how do you take this tool and, you know, for, for folks like us, like knowledge workers, it’s like this incredible magic tool, but it’s not necessarily a daily active use tool for a lot of people around the world today. And what are the like products?
    It’s, it’s kind of a category wide problem. Like in coding, for example, like. The entire space has gone parabolic. There may be some relative growth in, uh, in other consumer AI players, but it’s not like consumer AI as a category is like going parabolic and they’re not capturing most of that thing. I think it’s actually the larger problem is much more, hey, the category has kind of hit a bit of a plateau of people haven’t figured out how to bring, you know, tons more users on board.
    Yeah, yeah. Or increase the frequency of those users. And so it seems more of a category wide problem than it is, you know, a massive market share of change. I was gonna draw the comparison to, to the coding space where Claude Co is the first product, obviously, to introduce people to this magical experience.
    You know, by all accounts, codex is, is pretty damn close to as good, if not better. Um, but like still that first product, you, you would’ve thought that would not be a super sticky, uh, you know, product surface area. And it actually has, it turns out, I, it feels like the first lab to introduce you and experience really does, uh, keep a lot of, uh, a lot of the focus.
    [00:30:12] swyx: I, I think. M maybe it’s like still, still early days. You know, Chad, BT is like three plus years old and Yeah. Cloud code is only one. Just turned a year. Yeah. So give it time, you know? Yeah. Like, yeah. I mean, definitely sometimes a lot of people have switched from to Codex. Maybe that will keep going. I, it’s like really hard to tell.
    Uh, yeah. I, I, I do, I do think that. Because we are in this like, high volatility, high temperature phase. Um, the loyalty and stickiness to first movers and category creators, I don’t think is as high as it might be in some other, uh, areas in our careers that we’ve looked at.
    [00:30:47] Jacob Effron: Yeah. Though, I mean, I’ve been surprised by the cloud code thing.
    I, I would’ve thought that, like, in many ways I always worried about the
    [00:30:52] swyx: enterprise. You think you would’ve been gone by now?
    [00:30:53] Jacob Effron: Not gone. But I would’ve, I I always worried that the, that the consumer business of these companies would be quite sticky. And then the enterprise API business. Uh, was actually like, you know, in some ways like your least loyal buyers, like they would, they would move to,
    [00:31:05] swyx: right, right.
    But, but they worked out that it wasn’t the enterprise API it was enterprise product.
    [00:31:09] Jacob Effron: Totally. And maybe that was the, that was the secret that like, but the amount of lock-in or just default behavior that has happened in that space, uh, is, is more than I might’ve imagined with two products that by all accounts are pretty damn similar.
    Yeah.
    [00:31:22] swyx: No fight there. Uh, I will say I do think that Codex is still in like a catch up. Like in terms of personal experience. Um, the only thing I like out of, out of Codex is the, is like Spark and like yeah. Uh, the, I, I feel like the skills integration is a little bit better. I feel like, uh, the, the speed is a bit better.
    Maybe ‘cause it’s in, is written in rust or whatever. Um, very minor things that you like. Almost like telling yourself rather than like objectively assessing between two, two of them. I, I, I do think, like vibes wise, I think that’s going on. Um, the, the, you know, I, I feel like the, the missing questions, uh, in, in this whole debate is like, why is this so concentrated in only two names, right?
    Yeah. Like, um, how, where, like, where is the Gemini? You know, presence, where’s the Xai presence? Um, and like they are trying, it’s just they haven’t made that much progress yet.
    [00:32:12] Jacob Effron: But what the, what the Claude Co moment does show, and it actually in some ways makes you a little more bullish on the potential for someone else to catch up because it does feel like if you’re the first person to introduce some magical net new product experience, that that actually might be stickier than one might have imagined.
    [00:32:27] swyx: Right, right, right. Okay. Yeah.
    [00:32:28] Jacob Effron: And so it’s, everyone can believe they have shot
    [00:32:29] swyx: that. What do you think that new product experience might be like? I, I, it’s, it’s like, and this is a failure of imagination on my part. Like, I always wonder, like, people always say this like, well, the, the thing that will save us is like being first to the next new thing.
    Like what is it?
    [00:32:41] Jacob Effron: Yeah.
    [00:32:42] swyx: It’s like,
    [00:32:45] Jacob Effron: I dunno, something around like, uh, consumer agent, computer use, like hybrid. I think, obviously, I think we’re like scratching the surface on the consumer side.
    [00:32:53] swyx: So my, my current theory is like the. Open claw is like a vision of things to come.
    [00:32:58] Jacob Effron: Totally.
    [00:32:58] swyx: Um, and uh, it’s good that O open I has like the association with open claw, but by no means do they have the rights to win it.
    The general thesis that I have been pursuing now is that the year the same way that 2025 was the year of coding agents, 2026 is coding agents breaking containment to do everything else. Um, and so coding agents continue to still win, but because they generate software and software eats the world, so like, it’s kind of like the trans.
    Associated property of like software, eat the world, coding agents, eat software, therefore coding agents eat the world. Um, which is like an interesting,
    [00:33:30] Jacob Effron: yeah, and breaking containment always an easier phase phrase in the consumer context than the enterprise one. You’ve seen people run these really cool, uh, experiments in their own personal lives.
    I think like,
    [00:33:37] swyx: yes.
    [00:33:38] Jacob Effron: Figuring out, you know, how you, obviously everyone’s focused, you know, on the enterprise side now around how you create these experiences. I feel like the vibes, you know, people love to have these narratives of like, everything is completely shifted. It’s like I actually, you know, open AI.
    Organizationally, uh, you know, volatility aside is, you know, great products, great team, great models like everyone else in the world is incentivized for there to be. Two, three more. Everyone would love more like great model companies. And so I feel like the, the natural forces of the world revolt when any one company, you know, is too much the star of the show, right?
    There’s so many people in the ecosystem that are incentivized for that not to happen. And so I think I’d be shocked if we don’t have. Uh, uh, reversion of vibes, not maybe completely the other way, but at least a little bit more equal at some point over the next six, 12 months.
    [00:34:24] swyx: I, I think there’s just a kind of different stages when, when you talk about the world, one wanting more model companies, I talked think about like the neo labs.
    [00:34:30] Jacob Effron: Yeah.
    [00:34:31] swyx: And I mean, I don’t know, is it fair to say none of them have really broken through in the past year?
    [00:34:35] Jacob Effron: I think that’s totally fair,
    [00:34:37] swyx: which is rough. Um, and well, how are we gonna, how are we gonna grow that diversity in, in, in choice, like. Um, that’s, this is it.
    [00:34:46] Jacob Effron: Yeah. It’ll be really interesting to see what, what, what ends up happening with that.
    And you’ve seen, you know, folks like Nvidia, you know, very incentivized to make sure there’s, there’s a broader platform of, of other model providers.
    [00:34:57] swyx: I think, uh, I don’t know people say this, but I, I, I don’t think they try it hard. Nvidia tries harder to build neo clouds
    [00:35:05] Jacob Effron: Yeah.
    [00:35:06] swyx: Than neo labs.
    [00:35:07] Jacob Effron: Well, they try pretty damn hard to build neo Cloud, so
    [00:35:09] swyx: that’s,
    [00:35:09] Jacob Effron: yeah.
    [00:35:10] swyx: But like, you know, let’s call it like the, the core weaves of the world, much happier place in the, you know, than any neo lab built on top of them.
    [00:35:18] Jacob Effron: Yeah. That one might argue it’s, it’s easier to, to enable a neo cloud to be successful than it is. Uh, you can’t will a neo lab into existence the same way you, so
    Nvidia
    [00:35:25] swyx: has more direct control over it.
    Uh, for sure.
    [00:35:27] Jacob Effron: What else is kind of catching your eye today on the startup side? I mean, you worry, there’s obviously this whole narrative of like, you know, the foundation models, you know, they announced a product and every stock goes down 15%. Like
    [00:35:36] swyx: Yeah.
    [00:35:37] Jacob Effron: Do you, do you worry about the foundation models just kind of eating into to a bunch of these startup categories?
    [00:35:43] swyx: Not really. I, I think actually like. As, uh, there’s, there’s, okay, there’s, there’s, there’s the, there’s the point of view of like being an investor in startups, and there’s a point of view of like, do you wanna start something? And I think honestly, like the, the downside for all these is so. Minimal in, in a sense of like, the worst you do is you just get hired into one of these labs anyway.
    So I, I think the, the market for people who just do things and try things and try to execute in like a competent way, even if like it doesn’t work out commercially, even if it just wasn’t that great anyway. Like, but like that’s your job interview to go into, into one of these things anyway, so, um, I don’t feel that.
    From a, from a very, very small startup perspective, mid-size startups. Yes. Uh, I will say there’s been a lot of dead, um, LM Infra, a lot of LM infra consolidation like the, the, uh, lang fuses of the world getting absorbed into, into click house. And I, I think. Like people have maybe worked out the domain specific playbook, uh, and like, I think that’s okay.
    Um, and, and yeah, I’m not that, not that worried about, uh, okay. So, um, I, I would say I’d be more worried about traditional SaaS, like low NPSS. This is the whole AI versus SaaS debate that has, that’s been going on. Uh, and, and like literally I’m going through that exact thing in my company where, so I like kind of.
    Thinking through this on a very visceral, visceral level, right? On one hand you have the people who say you vibe coders don’t appreciate the amount of work that goes into A-A-C-R-M and like, yeah, you think you can rip out Salesforce? So did the 30 entrepreneurs before you, right? Like, like, you know, you classically underestimate the things that you don’t.
    Deeply, no. And, and, and target audience is not you. Uh, at the same time, like we have never been able to build software so easily and customize software so easily and like Yeah, you’re not gonna use 90% of the things in Salesforce. So like, yeah. What’s the typical, so what have you, what
    [00:37:33] Jacob Effron: have you done internally?
    [00:37:34] swyx: So we have there the main SaaS that we do for event management and sponsor management. That’s, and we paid 200 KA year for that. Not, not huge, but like chunky for, for, for my, my scale. Um, and like, yeah, I could probably spend 2000 and, and build like a custom version of that. Um, the, the, the trick has been dealing with my, the rest of my team and getting them on board.
    Yeah. ‘cause I’m the most ethical person on my team, but like, I can’t make that decision myself. And I think in the same way I’ve been telling with other CEOs team leaders as well, it’s like, well you can be super cloud pilled. You can be super LM psychosis and that you think that’s okay, but you like you have to bring your team with you.
    And I think like there, the sort of widening disparity in LM psychosis in companies is causing real s real riffs because. And on one hand, on one hand, the people who are less AI native are not getting with the picture. They’re not, they’re actually like behind, they’re actually not waking up to the fact that like you, everything you think is necessary is not actually that necessary.
    And in fact, exactly would be better of you if you just like held your nose and went in and when came out the other side. Yeah, only talking to agents in natural language and like your life would actually be better and you just, you’re just like close-minded. There’s that perspective. The other perspective is, oh, you vibe coder.
    You, you did this in a weekend and you got the 80% solution and now the rest of your employees. Have to pick up the rest of your s**t, right, that you, that you thought you were, you were such hot, amazing, uh, uh, at, but like, actually you didn’t figure it out. And like, actually LMS are still useless at this and blah, blah, blah.
    So like, I think there’s this huge debate going on in every company right now. Um, and like, um, you know, I have a small microcosm of it, but like, yeah, it, it’s making me hesitate to, to pull the trigger. But like I will at some point, it’s like maybe I’ve put it off for one year, but not like five. Yeah, but like, so, so like SaaS is definitely getting squeezed.
    Um, it does make me wonder, like, I, I do think that there’s an opportunity for a more AI native, um, system of record thing that is not just Postgres. Um, or not just MongoDB, although both are very good. Maybe it’s like a convex or like people Yeah. Bring up convex a lot. I don’t know, like, like, I, I just feel like the sort of quote unquote firebase of, of AI apps isn’t really a thing yet.
    Um, beyond what we have. Uh, which, which is fine. It’s, it’s, it’s just. We could probably start in a more sort of rapid iteration cycle first before scaling up to like a Postgres or MongoDB, which are more sort of old tech. I was at a dinner with, uh, Mike Krieger, the CPO of en philanthropic, and, and he, we were just kind of going around the room going like, what are people most worried about?
    Yeah. And, uh, for me, uh, I, instead of security, I brought up biosafety. Yeah,
    [00:40:21] Jacob Effron: classic.
    [00:40:22] swyx: Um, actually, like I said, it was. Cliche and classic, and the rest of the table were, were like, what do you mean? Someone sitting at home can manufacture a virus that wipes out half of humanity,
    [00:40:32] Jacob Effron: almost like the OG Jeffrey Hinton.
    Like, this is why you should be scared.
    [00:40:35] swyx: I’m like, yeah, like the read the, you know, risk reports. Like this is like the thing. Um, I think, and Mike was just sitting there knowing he was sitting on Mythos and going like, actually it’s security. Um, and I think like, um, I think the, there’s, there’s, part of it is.
    A very good marketing. Like too good. Yeah, like I would actually advise and topic to tune down the marketing because also it’s, it is just a very good model and you don’t have to make so many marketing claims around it. At the same time, it is not really a private model. If you give it to 40 companies.
    Each of whom have like 10,000 employees or whatever. Right. It’s not, it’s not private, it’s, it’s like there’s bad actors in there.
    [00:41:18] Jacob Effron: Yeah. Hopefully, hopefully not as, uh, as bad as releasing it widely, but, uh, no, I mean, it’s an interesting. You know, it’s an interesting case study for how all, I mean, many model releases might, I mean, you know, this might be the first model release that looks like the rest of ‘em from from now on, right?
    [00:41:31] swyx: It, it, so it’s, it’s the, there’s an overall product strategy, uh, for anthropic of like bundle, uh, you know, restrict access bundle, uh, product with model maybe.
    Whereas, uh, OpenAI has definitely been a lot more sort of. Philosophically aligned on like, we will just enable access everywhere and we don’t know what you, what will come out of it. Right.
    [00:41:51] Jacob Effron: Right. Though, I mean, this current moment, uh, obviously the cynical take is also just ties to the amount of compute that both companies
    [00:41:56] swyx: Yeah.
    Right, right, right. Yeah, I think, I think that’s true. I I do think like the, the, this is the, the, the scale, the dawn of like larger than 10 trillion parameter models is very interesting. I don’t think it, I think it’s a temporary phenomenon because we have much larger compute clusters coming online for everyone over the next like three, five years.
    It’s, and this is like already written in, in the cards.
    [00:42:18] Jacob Effron: Yeah.
    [00:42:19] swyx: So to the extent that like, you know, will we have rationing of models, uh, above 10 trillion, uh, in like two years? I don’t think so. I think everyone will have no, we’ll just
    [00:42:29] Jacob Effron: have rationing of the next phase.
    [00:42:30] swyx: Right. Right. But like, that’s as it should be almost like, um.
    My, my classic example, which I, this is just me theorizing, not anything confirmed by Google. When Google announced Gemini, they actually announced three sizes, which was Flash Pro Ultra. They never released Ultra. They only have Pro and Flash. Um, so my theory is they have ultra sitting in a basement and they just could distilling from it for, for flashing pro.
    Um, which like, yeah, I mean, I, I actually think that’s. As it should be for any lab that they, that they do that.
    [00:43:02] Jacob Effron: Yeah. Just because those are the models that people actually wanna end up using. And it’s just like cost prohibit.
    [00:43:06] swyx: It is more, yeah, it’s cost. Yeah. It’s, it’s not the want, it’s just, just, just the cost.
    Um, I do think, like, uh, it is interesting that, uh, for a while I was, I was considering the theory that models capped out at two, 2 trillion, and I think that’s proving to be wrong. And well then if I’m wrong, how wrong? How wrong am I? Do we do 200 trillion? Do we do two quarter trillion, whatever? Um, and I don’t think we have the straight answer to that, but like, uh, it’s interesting that we are continuing to scale number of pers when everyone kind of assu like can see that we’re not going to get like the next thousand or 1 million x from this paradigm.
    So like the others, like the alias of the world are working on other. Um, model architecture improvements. We need a different scaling law, I guess, because like, we’re, I, I feel like people already already feel like we’re tapped out on this. Like the, the end, the end state of this is we turn most of the world into data centers and like, I don’t know.
    I don’t know if we want that.
    [00:44:08] Jacob Effron: Yeah, I mean, uh, if the, if, if, if the return of intelligence are there, maybe, uh, maybe not so bad.
    [00:44:13] swyx: I, I, I think there, there’s just a sheer amount of like, like un scalability that like is wrangling people’s sensibilities right now. Um, especially in terms of like context lengths.
    Um, my classic quote is that context length is like the slowest scaling factor in, in lms.
    [00:44:30] Jacob Effron: Yeah.
    [00:44:30] swyx: Um, we, like, we took maybe. Three years to go from like 4,000 context length to a million and that’s about it. Yeah. Like Gemini has had a million token context length for two years now. Um, and no one’s using it.
    Like, so like yeah, it’s memory. Memory is probably gonna be the, the biggest limiting constraint on all these things.
    [00:44:50] Jacob Effron: Yeah. Certainly seems that way. I guess I’m curious over the last year since you recorded last, like what’s one thing you’ve changed your mind on?
    [00:44:57] swyx: I feel like I was kind of bearish on open models like last year.
    Um, in a sense of, like, I, I had just done the podcast with an Al
    [00:45:07] Jacob Effron: Yeah.
    [00:45:08] swyx: Of Braintrust where he, and he, I mean, you know, he has a good cross section of all the top AI companies and he says market share of open source is 5% and going down. Um, I think that’s changed. I think it’s going up. Um, and even if,
    [00:45:22] Jacob Effron: even though the capability gap does seem to be increasing.
    Spending on the
    [00:45:26] swyx: time. It’s hard to tell. Yeah, it’s, it’s really hard to tell. ‘cause like, okay, for, for listeners, capability gap increasing is like on public benchmarks. And let’s say you’re comparing mythos versus like, I don’t know, G-T-O-S-S or like GLM 5.1. And, um, it’s, it is really hard to tell. ‘cause even if they were closing, you will also not believe that they were closing that much because it’s very easy to gain the benchmarks.
    Yeah. So you just don’t really, really know. Um, all you know is like. Uh, there’s somewhat objective open router stats on like what people choose in a free market. And people do choose some of these open models in significant volume, except that a lot of them are heavily discounted. So you need to kind of like price adjust, uh, these things.
    So even if, even if that were true, which I, I’m not sure, like I, I, I feel like the numbers just up now instead of down. Uh, I think the. Separation between what the top tier agent labs are doing versus the average startup in ai or the average GPT wrapper is significant enough that you should not worry about the, the, the sort of mean industry number.
    And you should, you should cohort things into like, here’s the median here, here’s like the bottom 80% and here’s the top 20%. And top 20% acts very differently than the pome percent. And so top 20% is, which is what I all I care about, um, is. Definitely going towards more open models. Um, the fireworks and the togethers are crushing.
    Um, and, uh, and so will all the fine tuners, right? So like, um, I think maybe last time we even said things like, fine tuning is a service doesn’t work. Well, now it’s gonna work. It’s, it’s a derivative of the open market, uh, open models market.
    [00:47:01] Jacob Effron: Well, and also in the workload scaling to the point where people care about cost and speed, you know, more and more.
    [00:47:06] swyx: Yeah.
    [00:47:06] Jacob Effron: And that like the, you know, moving from just pure use case discovery of like, what can these models do to, okay, we know what they’re gonna do at scale now let’s do ‘em cheaper and faster.
    [00:47:14] swyx: Yeah. Yeah. Um, so, so like, uh, that change I, I think, is probably the most significant in, in my mind. And like, I, I always like to do the mental math of like, uh, this is what.
    Think about, uh, scheduling a learning rate, like when you’ve been wrong once. Yeah. What else were you wrong on? Um, and I, I’m kind of working through it. I, I, to me, the, the, the other thing was the coding one, um, which obviously I, I have now come full 360 on, but I think like. People are not appreciating dark factories enough, which I don’t know if you’ve discussed in the pod yet.
    [00:47:44] Jacob Effron: No.
    [00:47:45] swyx: Um, uh, and so this is a kind of a strong DM slash Simon Willis term. Uh, the, the general idea is, okay, there’s different levels of AI coding psychosis. You can have, um, the, the very first level, which I, I, by the way I encountered first in cognition five months ago was zero. Uh, human written code. Yeah.
    Right. Which like, seems like a reasonable thing now was less reasonable five months ago. The next frontier that sounds as crazy today as it as, as zero coding was in in the past is zero Human review.
    [00:48:17] Jacob Effron: Yeah.
    [00:48:18] swyx: Like, just, just check it in without even. Reviewing it, and very few people are doing that, but opening Eyes is, is exploring this and I feel like it’s, it’s definitely the only scalable way to do this.
    Uh, which it just means like you have to just kind of like flip the S-S-D-L-C or change large amounts of what, what you normally do. Um. Which is probably things you should have done anyway. More testing, more, you know, more automated verification or whatever. But like that is a frontier at which, like when you have unlocked that in your companies, um, you are just gonna produce much more quantity of software than than you’ve ever had.
    Uh, and it’s gonna be like so much, so disposable, so cheap that you can probably innovate in quality a lot as well. Like that that quantity helps you get to quality.
    [00:49:00] Jacob Effron: Yeah.
    [00:49:01] swyx: Which I think people are very uncomfortable with. ‘cause like people associate more quantity with slop.
    [00:49:07] Jacob Effron: Right. No, it’s back to exactly the discussion we’re having on like the reaction to these token maxing scoreboards and the, and the idea that like, today, maybe that’s not the most, uh, the, the, the, the best sign of, of, of productivity in efficiency, but going forward
    [00:49:18] swyx: yeah, you, but you still get rewarded for it.
    So they’re like, f**k it, whatever. But like, uh, I, I, I think like the, the, the people who are, who are doing well, who do well, who do most well in 2026, are not the cynics who go like, oh, that’s just slop. I’m not gonna participate in that. They’re like, okay, like this is happening with, with or without me. Bend this the right way.
    [00:49:36] Jacob Effron: Yeah, no, I love that. Um, I mean, I think for, for me, like any kind of related thing on, on the open source model side is for so long, I really didn’t think it made any sense to do any sort of RL post-training, pre-training, anything you could do to like improve kind of overall quality. Certainly for like latency and cost, it always made sense to me.
    But for overall quality, like God, you just get that for free in the models like three, six months later. I, I think what I’m starting to change my tune on a little bit is. You know, hearing all these app companies talk about, like, you know, we build stuff and then we throw it out three months later, as, as like the models improve.
    You’re like, okay, well then what you’re doing for capability improvement is just another version of that, right? Like, I still don’t think that like your RL or like post train is gonna make you have a better model for like. Years and years to come. But maybe I, I think you still have to be pretty rigorous on like, is that the single best thing you can do to solve a customer problem?
    And like, you know, oftentimes, like, it’s literally just like now, like add more data and like feed more data even via connectors to these models or like, I don’t know, do some clever engineering on the back end or whatever it is. But at the single best thing you can do for that three month time period to improve your customer’s outcomes is, you know, post-training in some way that like really improves the output of model even if you throw it out three months later because the general models get up there.
    It still might have been worth doing. And so I think I’m like more open to
    [00:50:45] swyx: you, you throw out the results, but you don’t throw out the raw data.
    [00:50:47] Jacob Effron: Totally.
    [00:50:48] swyx: And like, so like
    [00:50:48] Jacob Effron: Right. Then you just run it again. And so basically there’s some, obviously at the level of cost of like $10 million, maybe that’s too much, but there’s some level of cost where
    [00:50:55] swyx: No,
    [00:50:55] Jacob Effron: it’s the, it’s
    [00:50:56] swyx: not even 10 million,
    [00:50:56] Jacob Effron: right?
    No, of course it’s not. Uh, you know,
    [00:50:58] swyx: yeah.
    [00:50:58] Jacob Effron: There’s obviously some level of investment, uh, at which it’s the equivalent of just like staffing four engineers to go build something for three months.
    [00:51:04] swyx: Yeah. Uh, so the other thing I really, uh, for, for listeners, I’m just gonna leave some, some droplets of info. Uh, look into like the, the long trajectory, the synthetic rubrics work that people are doing is very important, uh, including, uh, something that’s called Doctor GRPO.
    I’ll just, I’ll just leave those key search terms in there. Um, I, I think it, what it means is that RL is going much more multi turn than. People think, and that means that you can customize the models in way more specific dimensions than traditional, let’s call it SFT, or uh, uh, you know, like a, a sort of shallow rl, um, that was done in a year ago.
    Um, so like hundreds of turns.
    [00:51:44] Jacob Effron: Yeah.
    [00:51:45] swyx: Uh, and, and, and I think that that leads you down a path of like complete domain specificity.
    [00:51:50] Jacob Effron: What else? Like are you, you know, uh, of these like unanswered questions in AI today? Are you like looking for, you know, in the next year? Are you, you, uh, you know, paying close attention to,
    [00:51:58] swyx: I, I have a few thesis for like, what?
    Is the sort of next frontier. Uh, one is memory, which memory and personalization we talked about. The other is really, uh, world models, which we’ve done a small little series on from Fefe Lee. Yeah, of course. To, uh, even Moon Lake. Um, and, uh, general intuition and there’s a lot of debate as to like. The relative importance of this.
    I think a lot of it, it manifests as like 3D static walls that you kind of inhabit for a little bit and you walk around and they’re like, cool, but like, how does this help me with my B2B SaaS? Right. And
    [00:52:29] Jacob Effron: it’s like all the hype now is robotics, right?
    [00:52:31] swyx: Yeah. Um, and there’s a, obviously a correlation between, uh, role models and embodied.
    Uh, vision and experiences, which leads to robotics. Uh, but I think role models is very interesting in just in improving intelligence itself. Um, from the next, from the next token prediction paradigm. Um, and so I think people are kind of testing their edges around that. One of our top articles this year so far has been on adversarial award models.
    Um. I, I do think, like, uh, if you don’t do anything else, just read FE’S essay on spatial intelligence on why, um, LMS don’t need, don’t have it. And she is, she may, she may not have the solution yet, but she has the right problems statement. Yeah. And so everyone else is trying to solve that problem statement in their own way.
    Um. And let’s see who wins. But like, I, I don’t think it does you any favor to equate role models to robotics or role models to gaming or some kind of like, uh, or like the current manifestations because what is at stake is a much more important. Conception of intelligence than just answering questions.
    It is, does, does, does, does the AI understand what a table is? Like, what, what matter is, what physics is? It is almost like for, for those who are movie fans, it’s like Google Hunting where, um, Matt Damon like knows everything because he read it in a book, but he’s never lived. Great,
    [00:53:54] Jacob Effron: great scene with
    [00:53:55] swyx: Robin Williams.
    With Robin Williams and I, I look at that scene and I go like, that’s exactly the, the, the difference between like a very intelligent LLM who knows everything but hasn’t experienced anything.
    [00:54:04] Jacob Effron: Wow. That’s an awesome note to end on. Uh, that’s a, have you used that before? That’s great.
    [00:54:08] swyx: Yeah. So, so one thing I’ve done with Lean Space is I moved to like, uh, adding daily writeups.
    Yeah. And so one, one of the times I was doing this daily writeup, I wrote that.
    [00:54:16] Jacob Effron: That’s a great
    [00:54:17] swyx: one. I love
    [00:54:17] Jacob Effron: that. Um, well, so it’s been a ton of fun. Thanks so much
    [00:54:19] swyx: for, for Coming Man.
    [00:54:21] Jacob Effron: I’m Jacob Effron and this has been Unsupervised Learning. A podcast where I get to talk to the smartest people in AI and ask them tons of questions about what’s happening with models and what it means for businesses in the world.
    As I hope is clear, I have a ton of fun doing this. It’s a nights and weekends project in addition to my day job as an investor at RedPoint, but our ability to get these incredible guests on really comes from folks like you subscribing to the podcast, sharing it with friends. It’s really what ultimately makes this whole thing work.
    And so please consider doing that. And thank you so much for your support and listening. We’ll see you next episode.


    This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe
  • Latent Space: The AI Engineer Podcast

    Shopify’s AI Phase Transition: 2026 Usage Explosion, Unlimited Opus-4.6 Token Budget, Tangle, Tangent, SimGym — with Mikhail Parakhin, Shopify CTO

    22/04/2026 | 1 h 12 min
    Early bird discounts for the San Francisco World’s Fair, the biggest AIE gathering of the year, end today - prices will go up by ~$500 tonight so do please lock in ASAP!
    From near-universal AI tool adoption inside Shopify to internal systems for ML experimentation, auto-research, customer simulation, and ultra-low-latency search, Mikhail Parakhin joins us for a deep dive into what it actually looks like when a 20-year-old, $200B software company goes all-in on AI. We cover why Shopify has become much more vocal about its internal stack, what changed after the December model-quality inflection, and why the real bottleneck in AI coding is no longer generation, but review, CI/CD, and deployment stability.
    We also go inside Tangle, Tangent, SimGym, which are three major AI initiatives that Shopify is doing to make experimentation reproducible, optimization automatic, customer behavior simulatable, and search and catalog intelligence faster and cheaper at scale. Along the way, Mikhail explains UCP, Liquid AI, and why token budgets are directionally right but often measured badly, why AI-written code can still increase bugs in production, what makes Shopify’s customer simulation defensible, and what he learned from the Sydney era at Bing.
    We discuss:
    * Mikhail’s path from running a major Microsoft business unit spanning Windows, Edge, Bing, and ads to becoming CTO of Shopify
    * Why Shopify is talking more publicly about AI now, and why staying at the frontier has become necessary for the company
    * Shopify’s internal AI adoption curve, the December inflection, and why CLI-style tools are rising faster than traditional IDE-based tools
    * Why Jensen Huang is directionally right on token budgets, but raw token count is still the wrong way to evaluate engineering output
    * Why the real unlock is not more agents in parallel, but better critique loops, stronger models, and spending more on review than generation
    * Why AI coding can still lead to more bugs in production even if models write cleaner code on average than humans
    * Why Shopify built its own PR review flow, and why Mikhail thinks most off-the-shelf review tools miss the point
    * How PR volume, test failures, and deployment rollback are becoming the real bottlenecks in the agent era
    * Why Git, pull requests, and CI/CD may need a new metaphor once code is written at machine speed
    * What Tangle is, and how Shopify uses it to make ML and data workflows reproducible, collaborative, and production-ready from the start
    * Why Tangle is different from Airflow, and why content-addressed caching creates network effects across teams
    * What Tangent is, and how Shopify is using auto-research loops to optimize search, themes, prompt compression, storage, and more
    * Why Tangent is becoming a democratizing tool for PMs and domain experts, not just ML engineers
    * Why AutoML finally feels real in the LLM era, and where auto-research still falls short today
    * Why Tangle, Tangent, and SimGym become much more powerful when combined into one system
    * What SimGym is, why simulated customers only work if you have real historical behavior, and why Shopify’s data gives it a moat
    * How SimGym evolved from comparing A/B variants to telling merchants what to change on a single live storefront to raise conversions
    * Why customer simulation is so expensive, from multimodal models to browser farms to serving and distillation costs
    * How Shopify models merchant and buyer trajectories, runs counterfactuals, and thinks about interventions like discounts, campaigns, and notifications
    * Why category-level behavior is so different across commerce, and why ideas like Chinese Restaurant Processes are showing up again in practice
    * Shopify’s new UCP and catalog work, including runtime product search, bulk lookups, and identity linking
    * Why Shopify is using Liquid AI, and why Mikhail sees it as the first genuinely competitive non-transformer architecture he has used in practice
    * Where Liquid already works inside Shopify today, from low-latency query understanding to large-scale catalog and Sidekick Pulse workloads
    * Whether Liquid could become frontier-scale with enough compute, and why Shopify remains pragmatic and merit-based about model choice
    * Who Shopify is hiring right now across ML, data science, and distributed databases
    * The Sydney story at Bing, why its personality was not an accident, and what Mikhail learned from deliberately shaping AI character early on
    Mikhail Parakhin
    * LinkedIn: https://www.linkedin.com/in/mikhail-parakhin/
    * X: https://x.com/MParakhin
    Timestamps
    00:00:00 Introduction: Mikhail Parakhin, Microsoft, and Shopify
    00:01:16 Why Shopify Is Talking More About AI
    00:02:29 Internal AI Adoption at Shopify and the December Inflection
    00:06:54 Token Budgets, Jensen Huang, and Why Usage Metrics Can Mislead
    00:10:55 Why Shopify Built Its Own AI PR Review System
    00:12:38 AI Coding, More Bugs, and the Real Deployment Bottleneck
    00:14:11 Why Git, PRs, and CI/CD May Need to Change for Agents
    00:18:24 Tangle: Shopify’s Reproducible ML and Data Workflow Engine
    00:21:19 Why Tangle Is Different from Airflow
    00:26:14 Tangent: Auto Research for Optimization and Experimentation
    00:30:07 How Tangent Democratizes Experimentation Beyond ML Engineers
    00:33:06 The Limits of Auto Research
    00:36:36 Why Tangle, Tangent, and SimGym Compound Together
    00:37:20 SimGym: Simulating Customers with Shopify’s Historical Data
    00:42:47 The Infra Behind SimGym
    00:46:00 Why SimGym Gets Better with Real Customer History
    00:47:30 Counterfactuals, HSTU, and Modeling Merchant Trajectories
    00:51:55 CRPs, Clustering, and Category-Level Customer Behavior
    00:53:30 UCP, Shopify Catalog, and Identity Linking
    00:55:07 Liquid AI: Why Shopify Uses Non-Transformer Models
    00:59:13 Real Shopify Use Cases for Liquid
    01:03:00 Can Liquid Scale into a Frontier Model?
    01:09:49 Hiring at Shopify: ML, Data Science, and Databases
    01:10:43 Sydney at Bing: Personality Shaping and AI Character
    01:13:32 Closing Thoughts
    Transcript
    [00:00:00] swyx: Okay. We’re here in the studio, a remote studio, with Mikhail Parakhin, CTO of Shopify. Welcome.
    [00:00:08] Mikhail Parakhin: Thank you. Welcome.
    [00:00:10] swyx: I don’t even know if I should introduce you as CTO of Shopify. I feel like you have many identities. Uh, you led sort of the, the Bing ML team, I guess, uh, uh, or ads team. I, I don’t know, I don’t know, uh, you know, it’s, uh, people va-variously refer you as like CEO or, or, uh, I don’t know what that, that, that said previous role at Microsoft was.
    [00:00:29] Mikhail Parakhin: Uh, that was... Yeah, my previous role w- at Microsoft was the-- I actually was the CEO of one of Microsoft’s business units, which included, as I, you know, as we discussed, all the things that people like to laugh about, uh, including Windows and Edge and Bing and ads and everything.
    [00:00:47] swyx: Yeah, yeah. What a, what a, what a wild time.
    You’ve obviously, uh, done a lot since you landed at Shopify. Uh, one of the reasons I reached out was because you started promoting more sort of internal tooling, uh, primarily Tangle, but also a lot of people have seen and adopted Tobi’s QMD, uh, and obviously, I think, uh, Shopify has always been sort of leading in terms of, uh, engineering.
    I think more-- it’s just more recent that you guys have been more vocal about your sort of AI adoption. Is that, is that true?
    [00:01:16] Mikhail Parakhin: Well, I think AI tools in general are fairly recent development, uh, and we’ve-- Shopify, you know, at this stage of its development, we’re developing AI in-in-house and other, uh, building tools that use AI and, you know, interfacing with the wider AI community, uh, you know, are on the sort of the, uh, runaway trajectory.
    So it just did by sort of natural byproduct. We, we talk about it more also. We just, uh, just even yesterday, Andrej Karpathy was famous in tweeting about, oh, are there some, uh, ways, uh, that, that you can organize your agents to store the data and then, uh, look up the data so that you don’t have to research or, or lose context every- Yes
    time. And a little bit tongue in cheek, I tweeted that, “Hey, we’ve, we’ve done it much earlier, and we even have different approaches, Tobi and I.” Tobi, of course, is a big fan of QMD, and I’m more of a SQL, SQLite fan. But, uh, yeah, very similar things that we’ve already done here. The point is, yeah, we’re very dynamic, you know, explosively growing company, and we have to be at the forefront of AI adoption, obviously.
    [00:02:29] swyx: Yeah. Yeah. Um, you, your team kindly prepared some slides actually that we were gonna bring up on to, uh, the screen. I think I can, I can screen share, and then we can kind of go through some of the shocking stats that maybe, maybe put some numbers to what exactly is going on. So here we have, uh- An internal AI tool adoption chart.
    What are we looking at here? What ?
    [00:02:54] Mikhail Parakhin: Yeah, this is very interesting statistics. Uh, this is number of daily active workers, you know, think of, uh, DAO, basically the active users of-
    [00:03:05] swyx: Yeah ...
    [00:03:05] Mikhail Parakhin: AI tool as a percentage of all the people in the company, right? And then- Yeah ... different AI tools. And, uh, you could see two things here is that one is the green is total.
    Uh, green is just total. So you could see that it approaches really % by now. It’s hard not to do your job now without interacting deeply, at least with one tool. You could see another interesting thing is just as many people commented in December was the phase transition when suddenly models gotten good enough that, that everything took off and started growing.
    Uh, it, it was many people noticed that the thing is that small improvements accumulated into this big change in Sep- December roughly timeframe.
    [00:03:52] swyx: Yeah.
    [00:03:52] Mikhail Parakhin: The other thing I would claim you could see is that, uh, CLI-based tools and tools that don’t require you to look at the code becoming more popular, and you could see, yeah, various versions of, uh, Cloud Code and Codex and Pi and internal development tools taking off.
    Uh, exactly, yeah, uh, and blue is our River, just internal agent for coding, where tools, uh, that require IDEs such as, uh, GitHub, Copilot or Cursor, they’re not exactly shrinking, but they’re not growing as fast. Like, uh, red, red line is, is the IDE kind of tools. So you could see that they’re, they’re not experiencing as, as fast of a growth.
    [00:04:37] swyx: As I understand it, basically, every employee has their choice, right? Of choose whatever tool you use, and then you’re just kind of doing a, a daily sur-survey or something.
    [00:04:47] Mikhail Parakhin: Exactly. And, uh, we- Yeah ... the, the push is to get your job done, you can use any tool, and we effectively fund unlimited tokens for everybody.
    Uh, we, we do, we do try to control the models that, uh, people use, but from the bottom, not from top. Like we basically say, “Hey, please don’t use anything less than Opus four point six.”
    [00:05:09] swyx: Oh .
    [00:05:10] Mikhail Parakhin: Some people, some people end up using GPT five point four extra high. Some people use Opus four point six. Um, uh, you know, uh, there are some, uh, there are plus and minuses in going for full one million context window versus not.
    But, uh, we try to discourage people from using anything less than that.
    [00:05:28] swyx: Yeah, yeah. Got it, got it. Uh, I mean, uh, that’s, you know... The, the next chart here, it really kind of shows the expansion and the sort of December twenty twenty-five inflection, right? That, uh, people are using a lot of tokens. I think it’s also really interesting that no one was kind of abusing it in twenty twenty-five.
    Like it was- Had comparatively, uh, to this year, there was almost no growth. I mean, it’s still like, you know, probably, probably gave fifty percent.
    [00:05:56] Mikhail Parakhin: Yeah. This is just a different scale. It’s still exponential- Yeah, yeah ...growth at just a different- ...rate of expansion. Uh, there was inflection point, and Sean, I would claim the, the super interesting part here is that you could see that the distribution becoming more and more skewed.
    Yes. The top percentiles grow faster. So that means- Yeah ...the people in the top ten percentile, they, their consumption grows faster than seventy-five and so forth. So, uh, the distribution skews more and more towards the highest users, which is... I don’t know what it tells me. It’s like it feels not ideal, to be honest.
    Or maybe it’s okay. We’ll see.
    [00:06:36] swyx: Why does it feel not ideal? Is, is it because of, um, quantity over quality, or what’s the concern?
    [00:06:42] Mikhail Parakhin: Because take it to the limit. That means, you know, if, if this rate of separation continued- Ah, yes ...a year, there will be one person consuming all the tokens. So it’s just, it’s kinda strange.
    [00:06:54] swyx: Yeah, I mean, um, uh, I, I think internal like teaching and all that, uh, will, will help sort of distribute things more widely. But in, in the early days, of course, the people who are sort of more AI-pilled will obviously find more ways to use it than the people who are less AI-pilled. Maybe let’s, let’s call it that.
    I’ll just, I’ll just kinda quickly, uh, pause from the, the... You know, we will go back to the rest of the slides, but I just wanna, um, review, you know, there are a lot of CTOs of, of large companies like yourself where they’re all considering some kind of token budget, right? Like I think it’s something, something that Jensen Huang has been talking about, where like if your 200K engineer is not using 100K of tokens every year, like they’re, they’re underutilizing coding agents.
    Of course, Jensen Huang would say that, but like it seems a very quantity over quality approach and like some, some people are basically saying like, well, is this comparable to judging engineer quality by lines of code, right? Which we also know is like kind of flawed, but better than nothing. So I, I don’t know if you have like a sort of management take here on, on how to view this kind of, uh, metrics.
    [00:08:02] Mikhail Parakhin: Well, I mean, you’re, you’re baiting me. I, I like... This is my favorite topic. Uh, if you let me, I’ll probably talk for two hours on just this. I have a lot of things to say. Like I do think Jensen gotten a lot of bad press saying, “Oh, of course you’re, you know, this, uh, the- ...the cake seller says you don’t need enough cakes.”
    You know? Like, of course. Uh, but, uh, I actually, uh, think that’s undeserved. I think he, he’s actually right. Uh, I do think- He,
    [00:08:33] swyx: he’s directionally correct.
    [00:08:35] Mikhail Parakhin: Yeah. Yeah. He’s directionally correct for sure. Uh-
    [00:08:37] swyx: Who knows what the right number is? Yeah.
    [00:08:39] Mikhail Parakhin: The thing that I do Uh, want to say, and this is something that we learned through trial and error and very important is like two things.
    One is that it’s not about just consuming tokens. Uh, you can consume tokens and, and in fact, the anti-pattern is running multiple agents, too many agents in parallel that don’t communicate with each other. That’s almost useless, uh, compared to just fewer agents and burns tokens very efficiently. Uh, setting up the right critique loop, especially with the high quality models, where one agent does something, the other one, ideally with a different model, critiques it, uh, suggests ways to improve it, the agent redoes it with this critique and, and so it takes much longer.
    So people don’t like it because latency goes up. You know, they, they have to wait until this debate is happening. But, uh, the quality of the code is much higher. And another thing, just since you mentioned like, look, uh, uh, yeah, the overall budget is just like, uh, lines of codes. Lines of codes are exploding for everybody right now, or partially because AI is really mover balls, but partially just because AI can write a lot more code, you know, doesn’t get tired.
    And so you have to have to have a very strong narrow waist during PR review. Otherwise, just the number of bugs will go through the roof. It’s, uh, it’s this unexpected consequence of the just volume trumping everything. I would claim by now good model writes code on average with fewer bugs than, than the average human.
    But since they write so much more of it, like more of it will make it into production. So you have to- You still
    [00:10:26] swyx: have
    [00:10:26] Mikhail Parakhin: more bugs. Yeah. Have to have a very rigorous PR reviews, also automated of course. But, uh, yeah, that to spend a lot budget there. Like this, this for me, for me, actually, the important metric is the ratio of budget spent during code generation versus, uh, spent, uh, expensive tokens like GPT, uh, five point four Pro or, uh, uh, Deep Think from Gemini, you know, checking on PR reviews.
    [00:10:55] swyx: Yeah, totally. Uh, I noticed in your chart you didn’t have any review tools. Do you just use like, like let’s say a Claude code to review tools? Or do you have another set of review tools like the Greptiles, the Code Rabbits, uh, Devin Reviews has a review tool. I don’t know if you’ve had those specialist review tools.
    [00:11:13] Mikhail Parakhin: You are a little bit jumping on my store tool right now because the graphs I was only showing public tools. Uh, uh, the-- I haven’t found a good PR review tool that, that does what I think should be done. And, uh, partially my, my thinking is because it’s so... It just goes against both what people feel like emotionally they prefer and, uh, some of the, uh, you know, frankly Even business models that, that the companies run.
    At peer review tool, uh, time, you want to run the largest models. That means, I don’t know, Codex or, or, uh, Cloud Code is not gonna cut it. You need to have pro-level models if you really want to, uh, stand the tide of bots from going into production. And you need us to spend a lot of time, the models taking turns, but you don’t want, like, a big swarm of, uh, of, uh, agents.
    So in fact, you end up in a different dual-dualistic world where you generate not that many tokens. You, in fact, generate few tokens, but it takes f-a long time because these are expensive models taking turns rather than many, many agents trying to do many things in parallel. So that’s, that’s why I feel like I haven’t found good tools, so we are using our own for peer review for now.
    [00:12:33] swyx: Yeah. Yeah. I mean, uh, I think a lot of companies are building their own, uh, especially to their needs, right?
    [00:12:38] Mikhail Parakhin: Mm-hmm.
    [00:12:38] swyx: Um, I, uh, you also have a chart here going back to the slides on, uh, PR merge growth, where we’re now at thirty percent, uh, month on month rather than ten percent. Uh, and also the, the estimated complexity is going up.
    You know, this is productivity, right? ‘Cause y- presumably there’s more stuff going into the code base and more, more features getting worked on. I’m curious about the backlog, right? Like the, the, the-- I actually don’t mind a pro-level model taking an hour or two hours to review my PR, because I’ve dealt with humans who take a week to review my PR, right?
    And I keep pinging them on Slack, “Hey, hey, review my PR.” So, you know, I think there’s some trade-off here where, like, it still doesn’t make sense.
    [00:13:18] Mikhail Parakhin: Exactly. That, that’s exactly m-my point. Uh, that on one hand, you can tolerate longer latencies at, uh, PR. On the other hand, like right now, the real problem is not in spending time waiting for PR.
    It’s real problem is since there’s so much more code than- Yeah ... uh, probability of at least some tests failing going up, and then you, like, keep de-failing, then you have to find the offending PR, evict it, retest it without that PR, and so deployment cycle becomes much longer. Uh, so it actually, in terms of the overall time to deploy, it’s total time savings if you spend more time on a longer model, like thinking for an hour, because then, then you, you don’t have to spend all that time during testing and rolling, you know, rolling back the deployment.
    [00:14:03] swyx: Yeah, totally. That’s still worth it. You know, you don’t look at the individual, look at the aggregate, and look at the, the, the change in the aggregate system.
    [00:14:11] Mikhail Parakhin: Exactly.
    [00:14:11] swyx: I’m kind of curious if, like, there’s this PR mentality and, like, c-- the, the, the CICD paradigm will be changed eventually. Some people are like, obviously a lot of people want new GitHub, but I even wonder if, like, Git is the problem, right?
    Like, is that the bottleneck? Is the concept of a PR a bottleneck? Do you guys use stack diffs? I don’t know if, uh, that’s a, like, a merge queue stack diff type of thing.
    [00:14:34] Mikhail Parakhin: We, we use, we use Stacks, we u- we use Graphite. We worked with, uh, Graphite a lot. Uh, so we use Stack, uh, PRs. I think, uh, like that’s clearly the overall CICD in general, and the interaction with the code repository right now is the, clearly the sort of the, the main issue and the bottleneck for us, uh, and highest top of mind.
    I would say we probably need a different metaphor or different whole design of how to process it in new agentic world. I haven’t seen anything dramatically better yet. I, I think everybody right now is just trying to keep their head above the water ‘cause, ‘cause there, there’s so many PRs and then everybody’s CICD pipelines start creaking, the, the times are increasing, the number of bugs slipping by increasing, and you have to, have to clap on down.
    And so we are a little bit in this situation when we need to first stabilize that story and then start thinking, hey, what, what it could be a completely different and new world, which I haven’t... I know some people working on it. I haven’t seen something, like anything super compelling yet, but clearly the old thing were designed for humans will need to be morphed into something new.
    [00:15:53] swyx: One of the thing that I, I think about is kind of like the merge conflict is basically a global mutex on the whole system, right? And in, in hu- in human organizations, we do have something like that. It’s the company standup. But like, other than that, it’s like it’s actually fitting for us to be somewhat decentralized, somewhat plugged into one stream of information source, but somewhat lossy.
    Like it’s okay, you know, that, that not every delivery is like atomic consistency. Like we’re not dealing with a database sometimes.
    [00:16:27] Mikhail Parakhin: This is a very good point, uh, because since humans don’t write code too fast, you know that global mutex is not too bad. Once you-
    [00:16:36] swyx: Yes ...
    [00:16:37] Mikhail Parakhin: start writing code at the speed of machine, it becomes the, you know, the bottleneck.
    Then what do you do? Maybe, and I can’t believe I’m saying this because I, I’m long-- lifelong opponent of, uh, microservices, and I always thought that was, like, a really bad idea. And now that you’re saying it, like, maybe in new guys like microservices will make a comeback, you know, because then you, you can ship things independently in tiny things and, and the managing all that complexity automatically will be much easier.
    I don’t know. Like, we’ll s-- we’ll have to see.
    [00:17:10] swyx: Yeah. I mean, I don’t know what the Microsoft or, or Shopify thing is, but I, I read this paper from Google where they have a monorepo that deploys into microservices, right? And then, uh, the other concept that I think about a lot is the Chaos Monkey concept from, from Netflix.
    Being able to create, like, this robust system where, um, uh, you know, you, you have the service discovery, you have the, uh, the independent, independent microservices discovery and, and, uh, you know, probably going to be a fair amount of duplication. That’s how an organic system sort of scales, uh, that, that you have that...
    I don’t know how you call it. Slack? Robustness? Depend-- uh, d-duplication. I, I, I forget the-- I, I’m-- And this-- those-- these are not exactly the terms- Hmm ... I’m looking for, but I c-can’t really think of the words. Okay. I was gonna go into Tangent and Tangle. Uh, so, uh, we, we sort of discussed the overall stats that, uh, Shopify has.
    Uh, but, you know, I, I think some, some pretty cool stuff that you guys are working on is your ML experimentation, uh, and your, your sort of auto tr-research training pipeline. Presumably you’re much closer to this one because it’s, it’s a sort of personal hobby of yours. How, how would you explain them in, together?
    I thought we have a slide that, like, uh, has the s- the system diagram.
    [00:18:24] Mikhail Parakhin: Yeah. Tangle first and then Tangent as a-
    [00:18:27] swyx: Yeah ...
    [00:18:28] Mikhail Parakhin: as a thing on top of Tangle. And, uh, Tangle is the third generation, I claim, of, uh, systems of, uh, running any data processing, but a bit with a skew for ML experiments, but not necessarily. Any sort of data processing tasks where you need to iterate, share, and you have scale so that you want maximum efficiency.
    You know how, like, normally you would work, you would-- Imagine you’re a data scientist or an ML practitioner, you would get Jupiter notebooks or, or maybe you would get, uh, you know, Pyth- your Python scripts, and you would manage the data, and you produce those TSV files, and you put them in some JFS or something.
    Then you would notice that, oh, it has this, uh, weird missing values. You go and write another script that, uh, goes and replaces them with, uh-
    [00:19:20] swyx: Ah ...
    [00:19:21] Mikhail Parakhin: dash S. And then, then you, then you run some, some, uh, “Oh, I need to filter bots.” And so you run some light GBM model that, uh, removes the bots. And then, then you like-- And then you, you kind of like get into shape, and then you start experimenting, and you run multiple experiments, and then you’re like, “Oh my God,” like, “this experiment is worse.”
    You undo, and you cannot get to previous result. And like, “Ah, what did I do?” Like that. Again, then, then you finally like get everything working. Then you like start throwing it over the fence to production. You, you replicate it, those things don’t work, and then sometimes you like don’t notice that you forgot some feature naming and the, the features don’t match.
    But then, like imagine you, you did everything, and then six months later you’re like, have to repeat it because now there’s more data, or you wanted to do another pass, and you’re like, “What, what did I do?” Or like, or like, “This script crashes now,” or the, “the path has changed.” And then, then you’re trying to, like you spend another month just doing ar- digital archeology on your own, you know, history, right?
    Now multiply that by many, many teams. Now imagine you got an intern that you wanna ramp up. Now you have to show that intern, “Oh, you know, look, here’s the folder, there’s the scripts, you know, ask your cloud agent to do, and then, uh, to, to figure it out.” And then cloud agent does something, and then you’re, “Ah, yeah, right, right, it was the wrong folder.
    I forgot to tell you, I actually have this other thing I forgot myself.” And, and that’s, that’s the, like, the daily life we all, uh, all know it, uh, if, if you’re a data scientist, machine practitioner, ma- machine learning practitioner or, uh, or even like any data managing, uh, person.
    [00:21:00] swyx: Yeah. So I, I used to do this, uh, f- uh, on the quant finance side, uh, in, in my hedge fund.
    So we did this before Airflow, and then, uh, obviously Airflow came along and, uh, then more recently Dagster, uh, I would say is like, in my mind, what I would use for that shape of problem, uh, where you had to materialize assets and create a pipeline.
    [00:21:19] Mikhail Parakhin: And that’s, that’s very good segue because... So Airflow is great, but Airflow is more about you, you have something and you wanna repeatedly run it in production on schedule.
    It’s less about you as a team developing things and being able to share, and you grabbing the standard pipeline and saying, “Hey, I wanna change this tiny little component in the huge sea of data processing, and I don’t wanna-- I wanna run ten experiments on this, and I wanna do hyperparameter optimization.”
    All that is very hard to do with Airflow. It’s very easy to do with Tango. Tango is m- more about, it’s everything about group of people Running experiments, it might be agents too nowadays. Uh, running experiments cheaply, collaborating, sharing results. Uh, you don’t need to understand fully. You, you grab-- you clone somebody else’s experiment or somebody else’s pipeline, uh, run, uh, change small piece, run it, be, like, get it to production state, and then ship in one click.
    So then the... You don’t have to port it into any other system to, to run in production. You can just run the same experiment. It’s, it’s fully production ready. And, and it’s, uh, it has lots of... Again, as I said, it’s third generation system. The original one was, I would claim there was Ether and then, uh, at least in my career, Ether was the first, first, uh, that pioneered this type of approach.
    And then there was, uh, Nirvana, which, uh, uh, at Yandex, which did kind of sec-second take on this. And now this one aggregates the, the learnings from all of those and, and Airflow as well to, to get to the state where you try it, it, it feels kind of magical. Uh, ‘cause now everything is based on content, uh, hashes.
    So even if the version changed, but if the output didn’t change, nothing is being rerun. It’s very efficient. If you... Multiple people start experiment that needs the same sort of data preprocessing, it’s not repeated multiple times. It’s automatically done only once. If you start ten experiments that all require, you know, some, some data preparation first as the first step, and you don’t have to coordinate for that.
    Like, you don’t have to know that other people are starting it. You now, it’s very easy compos-, uh, composability, any language you can u- uh, you wanna use, and it’s very visual. So you can see immediately, you can edit it easily, you can assemble small things with just even mouse clicks if you want to, and, uh, share, clone.
    And everybody knows also it’s fully kind of static in the sense that we rerun it second time, it will exactly have the same results. Like, you will never have to do digital archeology. So full versioning and everything is also there.
    [00:24:06] swyx: Uh, so, so people can, uh... It’s open source. Go to the GitHub repo and, and, uh, check it out.
    Uh, and it is also a really good, uh, blog post about it. I think all these is, like, really appealing. The, the, the, the thing that I think sells me the most about it is that, um, sort of development to production transition, right? Which I think, um, a lot of people haven’t really solved that, uh, strictly, right?
    Like, we develop really, really well in, in Python notebooks, but then, you know, that’s obviously not a sort of production ready process. I think that, like, any way in which that is solved, I think is, is very appealing. Then the other thing that you mentioned, which also raised my eyebrows, was content-based caching, which you mentioned is, is, um, you know, is ve-very much, uh, um, a sort of efficiency measure about, uh, you know, just like recalculation only on, on sort of content addressing Which I think makes sense.
    Uh, it surprised me that the savings could be this much, but maybe I just haven’t worked at your scale where there’s so much duplication, uh, that people just rerun because they change a single ID upstream.
    [00:25:10] Mikhail Parakhin: It does, yeah. But it’s not only you rerun. The, the main savings are coming from the fact that you ran it, you got your job done, and you moved on.
    Then- Yeah ... somebody else in some department you don’t know existed runs the same task, but on a newer version.
    [00:25:27] swyx: Yeah.
    [00:25:27] Mikhail Parakhin: Like right now, you can’t, in, in most of the organizations, you can’t even find out about it so that you can’t even measure that you’re spending that time twice, right? Here- Yeah ... if everybody’s on Tango, that’s detected automatically and detected that the output is the same.
    And then for that person, all it looks like is like experiment just suddenly moved, jumped forward, right? Uh, uh- Yeah ... so that’s because, because the, there’s network effect of multiple people helping each other.
    [00:25:51] swyx: Yeah. This is one of those things where it’s designed to be a platform from the beginning rather than an individual developer’s tool from the beginning, right?
    And, and everything’s gonna streams down from there. That is the sort of Tango, uh, orchestrator, and it’s, it manages jobs. We’ve seen a few versions of this, and this is obviously, uh, uh, the sort of, uh, unique approaches that you guys have, have, uh, figured out. And then there’s Tangent.
    [00:26:14] Mikhail Parakhin: Yeah. And Tangent is basically an automatic auto research loop that can help and kind of do your work for you.
    Uh- ... you know, uh, effectively, effectively, Andrej Karpathy recently popularized it with auto research. Yes. Remember he said like he was, uh, speed running this, uh... Yeah, uh, you know the story. The, here we’re basically bringing the same capability into Tango so that, uh, the, uh, Tangent can analyze it. It’s just an agent that can run multiple experiments, figure out what can be changed, and keep on rerunning it, keep on modifying until, uh, maximizing some goal, some loss function, whatever you need to, to achieve.
    And in general, I would say if you’re not using auto research-like approach in whatever you do, like literally whatever you do, then you’re missing out. We saw at Shopify that taking like a wildfire, anything where you can put measurements can be done dramatically better. Our-
    [00:27:19] swyx: Mm-hmm ...
    [00:27:20] Mikhail Parakhin: uh, speed of, uh, templatization HTML, uh, completely new UX tem- uh, templatization of, uh, reducing latency for liquid themes.
    Uh, we-- Our, uh, search, uh, recently we moved from It’s hard even, uh, quote from eight hundred QPS to forty-two hundred QPS with the same quality just by pure optimizations and not a research loop that kept running and changing code in our index serve on the same number of machines, just increasing the throughput.
    We, we managed to improve the quality of gisting and machine learning process. Uh, you know, gisting is the prompt compression technique that
    [00:27:59] swyx: allows for
    [00:28:00] Mikhail Parakhin: lower latency and, and lower and, uh, actually higher quality slightly. So like literally whatever different walks of life, and it doesn’t have to be AI related.
    Uh, we, we had a reduction in, uh, storage because the agents would go and find data sets that clearly are derivative, uh, and then you don’t need to store things twice. You know, we, we, we found somewhat embarrassingly that it was one of the largest tables was hashing random IDs into another random ID, and we literally- Oof
    put only one. So it was translating, yeah, two random IDs hashed
    [00:28:36] swyx: into
    [00:28:37] Mikhail Parakhin: each. So, so
    [00:28:37] swyx: it has access to the code as well, so it can, it can check the, like what, what the hell is it doing?
    [00:28:42] Mikhail Parakhin: So there, there cou- it could be run in two levels. You, uh, you know, at the superficial level, it could just use ex-existing components and, uh, reshuffle them.
    Uh, you know, like you can grab- Yeah ... uh, XGBoost, and you can grab some, some Py- PyTorch module, and then can grab some, you know, grab another tools and, and combine them. At a deeper level, since Tangle is all sort of CLI based underneath you, every, every component is a wrapped really CLI, uh, call and a YAML file, it can analyze code and create new components and, and, uh, keep on iterating as well.
    So, so you can, you can both have quick modifications of existing t- uh, pipelines with the, with components that are already there pre-baked, or you can create new components, uh, and-
    [00:29:29] swyx: Yeah ...
    [00:29:29] Mikhail Parakhin: keep iterating on those. So auto research is, again, this is probably the, the thing I was excited the most in the last two months happening, and we see it taking like, like totally like a wildfire.
    Just, uh, everybody, every day, every... well, every day, every minute, I would, uh, have somebody Slack message saying, “Oh, look how much better I made it.” And, uh, it’s all throughout the research.
    [00:29:53] swyx: Is this democratized in some way in, in the sense that like is it your ML, uh, engineers and researchers doing this, or is it your regular PMs and software engineers also have the ability to auto-- to use Tangent?
    [00:30:07] Mikhail Parakhin: This is an awesome question. Like, Tango in general and Tangent in particular are extremely democratizing. Like they- Yeah ... they are the main tools for- ‘Cause I don’t
    [00:30:15] swyx: need the details.
    [00:30:16] Mikhail Parakhin: Yeah. Exactly. Initially used by ML and AI engineers, but then literally, as you said, PMs are like the highest user right now is one of PMs on our org, uh, Sartak and he was, he was number one by, by usage of, of this ‘cause they’re just, uh, energetic and knowledgeable, and now it, it unlocks a lot of capability where you don’t have to co-change code manually.
    [00:30:39] swyx: I mean, I mean, because it kind of cuts out the ML, ML engineer from the process because the, the, the PMs have the domain knowledge and the ability to think about, uh, from first principles about, okay, what, what results do I want? And they can-- they even have the access to the data that, that needs to go in.
    So it’s like in some ways, like this is the magic black box that we’ve always wanted for, for training and, and for, uh, I guess, uh, uh, hill climbing, whatever.
    [00:31:04] Mikhail Parakhin: It’s basically cloud code for your AI development- ... uh, situation, right? Like now, now you don’t have to know exactly how algorithms work. You can just, uh, bring your domain knowledge and expertise and product knowledge and iterate within Tangent until you’ve gotten the results that you need.
    [00:31:21] swyx: In my previous roles, every time that someone has pitched AutoML, you know, I’ve always been like, “Uh, this is not, this is not gonna work. It’s, you know, it’s, it’s always gonna be a flop.” Somehow it’s working now. I mean, presumably the answer is now we have LLMs and it’s good enough, right? It’s, it’s an emergent property that we can do auto research, but like, it doesn’t feel that satisfying that how come we didn’t do this before, right?
    Like we just did like parameter search and like, I don’t know. That’s maybe that’s it.
    [00:31:48] Mikhail Parakhin: Yeah. Bayesian optimization and hyperparameter optimization was, was the one that, or facet of AutoML that was used very actively, which incidentally also built into, uh, Tango. But, you know, I know Patrice Simard very well, and, uh, he was such a, uh, such a proponent of AutoML, and he put, like literally spent careers trying to democratize it.
    Without LLMs, it just turned out to be very hard. Like it, you, you would have flexibility within certain narrow domain, but it was hard to wider scale, and now with LLMs suddenly it’s like magic wand, and so suddenly everybody- ... is an AutoML expert.
    [00:32:28] swyx: Yeah, I, I think it’s multiple things, right? Like I’m, I’m just gonna bring up the, the, the chart again, right?
    Like LLMs can do the monitoring very well. That is the very potentially unbounded, super unstructured. It can do the analysis very well, it can do the... Uh, and basically it is much more intelligence poured into every single step. Uh, there’s maybe nothing structurally changed about AutoML, but this is just m-more intelligent and more unstructured.
    [00:32:53] Mikhail Parakhin: Exactly.
    [00:32:54] swyx: Any flaws that you’ve run into? Like everyone is like drinking the Kool-Aid, oh my God, time savings, uh, you know, performance improvements. Like what, what, uh, issues have you have, uh, come up?
    [00:33:06] Mikhail Parakhin: This is really cool. It’s not a solution to all the world’s problems for sure. The limitations are usually the ones I-- And this is where we get into a bit of a subjective territory.
    Uh, I can only share what I’ve, I’ve seen so far, and I’m sure the situation, uh, is changing, and, you know, maybe after I say it, like many people will reach out and say, “Hey, what about this?” And you don’t know that, and then, then we’ll be probably right. But what I’ve seen is auto research is very good at doing kind of obvious things that you don’t have bandwidth to do or you didn’t notice or maybe you’re not aware of like the-- some standard practices.
    It is not good at doing something completely out of distribution, something that, you know, you have to think for, for multiple days, uh, and, and do something like none of this. So, so it’s, uh, I, uh, set an experiment once, uh, on, on my sort of, uh, hobby thing, and I let it run for, uh, ended up, uh, several weeks run, uh, you know, it’s like full production kind of scale, so it, you know, slow runs and, and it ex-- it performed in the end, uh, over four hundred experiments, and only one was successful.
    I’m like, “Okay, that’s, that’s good.” But-
    [00:34:18] swyx: But it saved time.
    [00:34:19] Mikhail Parakhin: Yeah, I saved time. Like it, it was the, that thing. Yeah, if I, if I were doing four hundred experiments myself, my betting average, as I said, would have been much higher, I’m sure. But also, first of all, it would take me like three years to do four hundred experiments.
    And, uh, I didn’t have to do them. Like the machines were just, uh, the price of electricity did that. So, and I got one improvement, uh, that in, uh, my, my-- Honestly, when I was starting that experiment, my thinking was to go and show that, “Hey, Andre, maybe you just don’t know how to optimize.” And I was super smart because in, in my pro-problem, it was optimized for many years, and it was like fully improved.
    Uh, and I didn’t expect it, you know, auto research to find anything at all. Yet it did. So instead of making fun of Andre, I ended up, uh, a big, big supporter. Yeah, that’s exactly the tweet. Yes.
    [00:35:10] swyx: You and Toby really, really go back and forth on-online a lot, which is really funny. Uh, think of it as, as an eval for the optimalness of the code it’s running on.
    Uh, it’s almost like it reminds me of like a Kolmogorov complexity thing, but, uh, I guess it’s-- there’s some optimal thing that you’re trying to sort of reduce down to, I guess. Um, and so, so you, you, you know, you should congratulate yourself that you had, uh, you know, uh, ninety-nine percent, uh, optimality.
    [00:35:36] Mikhail Parakhin: Exactly, yeah. I think Andre really deserves a lot of credit for popularizing this approach. This is, uh, this is incredibly, I think, powerful and cool and You know, the, uh, even him, him just mentioning it led to a lot of gains in a lot of places in the industry, so we should be thankful.
    [00:35:56] swyx: Yeah. I think he also has a just...
    I don’t know what it is. Like, um, you know, it, it is a simple self-contained project that people can take and apply to other things, which is, is, is one thing, but also just the name. Just like somehow no one, no one managed to call their thing auto research. It’s just naming things is very important. I think that that is mostly, uh, our coverage of Tango and, and, uh, Tangents.
    I think obviously, you know, there’s a lot of, uh, ML infra at, at Shopify that people can, uh, dive into. We’re about to go into SimGym, but before I do that, any, any other sort of broader comments around this whole effort? Like where is it, where is it leading to?
    [00:36:36] Mikhail Parakhin: As a segue to SimGym, like all those things start composing strongly.
    And, uh, you could see a huge unlock when you can look at each one of the tools and, and you see, oh, they’re extremely useful. Uh, Tango is useful by itself. Auto Research is useful by itself. SimGym is useful by itself. If you combine all three, you create like synergetic effect. I think that’s why we wanted to even, uh, cover them today is because this is something that if you go back even, you know, five years ago, would’ve been unthinkable.
    Uh, replicating that, uh, would, would be either incredibly costly or impossible, right? With probably thousands of people are required.
    [00:37:20] swyx: Well, we have serverless human, uh, serverless intelligence, right? Like, uh, so yes, you do have thousands of hu-- of, of intelligences, not just, not humans. And that’s, that’s close enough, right?
    Even if they’re not AGI, they’re, they’re close enough to do the, the task that you need them to do. And, and, you know, that’s, there’s plenty for, for a lot of routine work, knowledge work. Okay, let’s get into SimGym. Um, this is one of those things I, I was surprised to see actually it’s apparently your, uh, one of your most popular launches, and I think something that, uh, I think Sim AI, I think Yunjun Park, who did the Smallville thing, there’s a very small cottage industry of people trying to do like the simulate customer thing.
    I think a lot of people maybe don’t super trust this yet because they’re like, well, obviously they would just do what you prompt them to do, right? But maybe just think, uh, tell us about the sort of inspiration or origin story.
    [00:38:10] Mikhail Parakhin: That’s exactly actually the thing I wanted to cover, because if you don’t have the historical data, all you can do is prompt a-agents in a vacuum, and they will do exactly what you prompt them to do.
    In fact, when I first proposed it, and this is a bit of, um, my brainchild initially, if I, I can boast, even Toby said like, “But wouldn’t they, they just repeat what, what you tell them?” And, uh, but I’m like, “Yes, except Shopify has decades of history of how people made changes and what there is, uh, there, what it resulted in terms of sales.”
    So now what we can do is we can-- we have this... It’s not, it’s a noisy data. There’s a small, usually websites, uh, you know, like things, things are never in isolation. It’s almost never AB experiment. It’s always AA experiment when there’s has two meanings, but basically, you know, in different time you run two different things.
    But if you aggregate in general, uh, like everything together, and you apply, uh, denoising and collaborative filtering like approach, you can extract a very clear signal. And then you can optimize your agents. And that’s why it took so long. It took almost a year of that optimization of just us sitting and fiddling, and, and we had this internal goals of correlation of hitting-- internal goal was to hit zero point seven correlation with, uh, add to cart events, for example.
    Like that, that if we run real AB test experiment, that it should, it should go and, and rep-uh, replicate, uh, same sort of success that, that humans had or lack thereof. And it, it took forever, and I don’t think that’s easily replicatable because, uh, like who else would have that data? You have to have this historic, you know, decades, uh, worth of data.
    And now, now the, like the other thing you need is in-infrastructure and the scale, right? Because, uh, w- again, what we found, uh, stat sig results, you need to run a lot of simulations, a lot of agents, and, and it’s-- Those are expensive things. Like you’re, you’re making actions in the browser because you want a real friction.
    You want to, to be able to get the image like of what humans will see because you wanna, uh, detect effects like, “Hey, if I make my images larger, will I have more sales or l- uh, fewer sales?” And like usually people’s intuition here, by the way, is that I increase my images, I will have more because they look nicer.
    You know, designers all look sparse and big images. Like usually your sales tank, right? But, but, uh, you know, from HTML, all the characters look the same only the, the size tag looks different, right? So it’s very hard. So you have to take visual information, you have to run this in simulated browser environment on the big farm and, and of course, you have to have, uh, like very, very expensive model, good model with multi-model model.
    So all this it’s-- is what’s taken so long and, uh, to share my personal fail a little bit there, Sean, is like, you know, we always had this bias to-- for like large company bias. You know, we always, uh, whenever you-- we do, we’re like, “Hey, we’ll run an experiment,” right? We make, make a change, and we will run an experiment and then, uh, see, uh, see which one’s better or like, “No, this is worse,” and most of them are worse, so you discard it and keep iterating, hill climbing.
    And we’re like, “Oh, like smaller merchants, they cannot get stat sig results. They cannot really run experiments simply because, you know, in a week there would be not enough data for them.” So we thought from this perspective. What we didn’t realize is that most people don’t have A and B, they just have one thing, and they need suggestions of What A and B should be.
    So, uh, we first build this, hey, we run simulation on two separate teams and, and, uh, say, “Hey, which one is better?” We then morphed it into, and very recently just released it, when you have just your site, your theme, we run over it and we say, “Hey, here’s what predicted values of, of, uh, uh, conversions are, and here’s how we think you should modify it to increase your conversions.”
    And then circling back to what you started with, the proof is in the pudding. Like, if we are not correlating with reality, like, people will not be using it. And, uh, thankfully, we see literally every day more users than the previous day. So, so right now, uh, right now- It’s working. Yeah. I’m-- Right now my problem is how to pay for it all because the so our major thing is how to optimize the LLMs, do distillation, how to run the headless browsers, uh, and handful browsers, uh, uh, cheaper so that we can accommodate the increase in traffic.
    [00:42:47] swyx: Yeah. I, I understand that you, uh, you published a lot of technical detail at GTC, so I was just gonna bring it up a little bit. I think s- was this in, in con-conjunction with some kind of GTC presentation? Or something like that, right?
    [00:42:59] Mikhail Parakhin: Well, we, yeah, we, we did it in several place, but yeah, we had the engineering- Yeah
    blog, uh, as well. Yeah.
    [00:43:05] swyx: Yeah. So you’re running, uh, GPT OSS. Uh,
    [00:43:08] Mikhail Parakhin: the, this is an older version. You know, now we run multimodal model. But yeah- Yeah ... GPT OSS, we still run GPT OSS as well for
    [00:43:15] swyx: And then you have the VMs, and you also have browser-based. I really like this one where it you said, “It violates almost every assumption that standard LLM serving is designed for.”
    And then you had like, basically orders of magnitude differences between everything.
    [00:43:29] Mikhail Parakhin: Exactly. Which is, which, uh, which was, you know, a bit of a challenge to implement, like when, like even simple things. Uh, be- since it violates all the assumptions, for example, multi-instance GPUs, like MIGs don’t work as well.
    But we needed, uh, to get MIG to work because, ‘cause otherwise it’s way too expensive. And so we had to deal with the, yeah, with, uh, lots of infrastructure and, and, uh, work with, uh, uh, Fireworks and CentML, uh, you know, to help with optimizations and browser-based, as you mentioned. Yeah, like, takes a village.
    [00:44:04] swyx: Okay. So there’s a lot of like, I guess, experimentation in the infrastructure so far, and you’ve published more or less what you have here. I guess I’m, I’m less familiar with CentML. I, I don’t do, uh, that much work in this, this part of the stack. But why was it the sort of preferred instance platform?
    [00:44:22] Mikhail Parakhin: There are really three probably top companies. There used to be, uh, uh- Three top companies, uh, at least I was aware of that did, uh, LM optimization. You know, together Fireworks and Santa ML, not necessarily in that order. Santa ML recently got acquired by NVIDIA. Uh, what they did is if you have a model and you want to optimize it to a specific prof-- uh, profile of usage, uh, they would go and do it.
    And, uh, we work with, with those companies, uh, this was work particularly in with Santa ML and NVIDIA to get them the best possible results out of it. And, and sometimes you, you have to retune depending on, like sometimes you want the maximum throughput, sometimes you want minimal latency, sometimes you want like the cheapest, right?
    And, yeah, or some combination. And so yeah, these are people who would come and help you.
    [00:45:14] swyx: I see. I see. Yeah, yeah. I’m familiar with these people for the LLM, you know, autoregressive stack. But the other interesting category of these optimizers is also the diffusion people, whereas like Fel and, you know, uh, Pruna recently has come up a lot as well, which I think is like really underappreciated, uh, at least by myself, because I, I thought, oh, all the workload would be LLMs, but actually there’s a lot of diffusion as well.
    [00:45:38] Mikhail Parakhin: Exactly.
    [00:45:38] swyx: There’s a lot here, so I, I, I... it’s, it’s, uh, it’s, it’s, it’s hard to cover. But I, I do think like people underappreciate the importance of customer simulation, basically. I think this is something that I’m candidly still getting to terms with. Uh, you know, uh, you also-- your team also like prepared this, like, really nice diagram.
    Uh, I, I assume this is AI generated.
    [00:46:00] Mikhail Parakhin: Yeah, it looks-
    [00:46:01] swyx: Maybe it’s not.
    [00:46:01] Mikhail Parakhin: Yeah, it looks, uh, Gemini-ish. Yeah, but, uh, uh, honestly, I, I don’t know where, where the hell they generated. It looks, look, uh, looks like it’s, uh, Google. But the interesting part, John, that, that, uh, we haven’t covered, but I, I wanted to mention is if your store had previous customers, rather than it’s a new store, you’re like new merchant just launching things, it helps tremendously in just correlation and forecast.
    Yeah, we take your previous, uh, customer’s behavior, and we create agents that replicate those specific distribution of, of customers that you get, and then we a- we apply those to your changes, and then that, that raised raw, you know, the re-- uh, just correlation with the add to cart events or to-- with conversion or whatever it, it, it may be, uh, quite dramatically.
    So, uh, replicating humans in general seems like an interesting, cool challenge.
    [00:46:58] swyx: As a shareholder, I think this is the-- like if people are Shopify shareholders, they should really deeply understand this because this is basically the moat. The, the more you use Shopify, the more it will just automatically improve, right?
    Like you’re, you’re doing the job for them.
    [00:47:13] Mikhail Parakhin: Yeah, that’s what we started with. Like, uh- ... uh, otherwise, if you’re just a startup, I wouldn’t do it if, uh, you know, if it was my startup because Without the data, it, yeah, as, as you said, it’s, it’s exactly the case that, uh, whatever you say in prompt, that’s, that’s what the agents will be doing.
    [00:47:30] swyx: The statistician in me wants to like really satisfy the sort of, um, statistical intuition, I guess. Um, to me it’s kind of, uh, the, the word that comes to mind is, um, ergodicity. Uh, so let’s say a, a customer takes this path, customer takes this path, customer takes this path, right? Um, the... In my mind, the way I explain it is like, okay, here, here’s the ninety-five percentile, here’s the five percentile, and here’s the median, right?
    Um, but to me, what SimGym is potentially doing is that it can, uh, modify... It can sort of model the sort of in-between sort of journeys as well, that, that maybe are dependent on the previous states. This may be like a very RL-type conclusion where like basically the summary statistics, if you only did naive AB testing, you only have the, the statistics at, at, at a certain point, and you only judge based on the sort of overall summary statistics.
    But here you can actually model trajectories. Does that make sense? Or-
    [00:48:31] Mikhail Parakhin: That makes total sense because like, well, that, that makes even more sense that maybe even you realize bec- because-
    [00:48:38] swyx: Okay. Please,
    [00:48:38] Mikhail Parakhin: please. Yes ... we do-- Yeah. The, so internally, uh, we have this system, we talked about it briefly once at NeurIPS.
    We have a huge HSTU-based system that models the whole companies, uh, and their possible paths. And like- Yeah ... what you are, what you are showing, like actually at any point of time, you can either model the user’s behavior or you mo- can also think about, uh, the whole merchant as a company, as the entity that acts in the world.
    You can model that as well. And then you can do, can do counterfactuals. In your graph, like in your blue graph, uh, if you’re... Imagine in the center there, uh, somewhere in the middle, you would have an intervention. I give that person a coupon, or I don’t know, I send a personal thank you card, or give a discount in some- somewhere.
    And then you can, uh, then you can do forward rollouts from that counterfactual. So what would have happened with that intervention or without the intervention? And you can even ch- change where that intervention, uh, in time can happen, right? Like some- where, where in this journey. So we, we do this at the Shopify scale for our merchants, and then if we notice that something that they can be fixing, like there’s a strong counterfactual, like we have Shopify policy, they basically get a notification like, “Hey, we think your...
    something is wrong with your-” I don’t know, Canadian sales. Like, uh, it looks like it’s misconfigured. Here’s what you need to do. Or do you think like, uh, you have to set up this campaign with these parameters? And we do that at the buyer level to literally offer discounts or cashback or, or things to buyers.
    So this is-- I’m getting very excited. Like this is my sort of area of, uh, interest, I guess, and, and hobby. But being able to m-model something complex as human beings or companies and model counterfactuals on it, where you can have interventions in the future and optimize when to make intervention, what kind inter-- uh, what kind of intervention to make.
    It’s such an unlock that previously was completely impossible. Like the-- it was, it was always dreamed of, but never... Like how would you even simulate it without LLMs or HTUs? I think very, very exciting times.
    [00:50:59] swyx: I just wanted to, uh, to maybe illustrate this. I, I’m not the best illustrator, but I, I am a conceptual statistics guy.
    And y-you know, you cannot just do this. Like this is a dimensionality AB test doesn’t do, right? Like, uh, because it doesn’t have the, the, the change over time, uh, stochastic nature, uh, and it doesn’t have the sort of contextual like... Here’s all the context to this point. Um, okay, cool. Um, that’s SimGym.
    You’re, you’re gonna burn a lot of tokens on this thing. But you’re, you’re one of the, the only scale platforms in the world that can, uh, that can do this across a huge variety of workloads, right? I’m even curious on a sort of human, uh, research level of like, well, do, does retail behave d-differently from like clothing sales?
    D-does that behave differently from electronic sales? I, I don’t know. I don’t know what else you guys... The Kardashian shoppers, do they differ from like people who buy, uh, I don’t know, cars and, uh, whatever.
    [00:51:55] Mikhail Parakhin: Well, very different, and different sensitivities and different modes of, uh, shopping and, and different levels of what’s important.
    Now, to-totally, you can do aggregations at, uh, at a store level. You can do aggregations at a different, uh, category level. I don’t know if, uh, you know, for our statisticians among us, I couldn’t believe, but we-- recently we’re looking at it, and we had to bring back, uh, CRPs, you know, Chinese restaurant process.
    It’s a, like, way of aggregating and, like, naturally grow clustering. So across... Specifically to answer questions that, uh, like you were just posing on how, how if, if buyers behave different categories. And I’m like, “I haven’t seen CRP since two thousand and one.” It’s
    [00:52:37] swyx: so What? It’s so- What is... No, I haven’t, I haven’t seen this.
    No. This is not in my training. Uh,
    [00:52:44] Mikhail Parakhin: but, but yeah, it, uh, uh, it actually, like the, the-- there was a very popular kind of theory, popular neurips HTML circles in early two thousands, uh, kind of nice. And now, now it has practical applications, uh- Yeah ... that we were resurrecting.
    [00:53:03] swyx: Yeah, amazing. Uh, I, I can see, I can see how this is like a, uh, a fun job for you where you get to apply all these things.
    Um, yeah, yeah, so super cool. Super cool. So, okay, so, so anyone who, who knows what CRPs are and has always wanted to use them at work, uh, they should, they should definitely join Shopify. Okay, so w-we have a lot and but I, I’m, I’m being mindful of the time. I, I do wanted to, to sort of cover some other things.
    Um, I-I’ll give you a choice, UCP or Liquid?
    [00:53:30] Mikhail Parakhin: Liquid. I think, I think on UCP, you know, like UCP is very important for us and, and it just we are-- UCP, we have a structured, uh, discussions, and you can read about them, and we have, uh, blog posts, and we have a big release this week, in fact, like with our catalog.
    Oh,
    [00:53:46] swyx: okay.
    [00:53:46] Mikhail Parakhin: Uh, yeah,
    [00:53:46] swyx: but- Le-I mean, we, we can, we can discuss the, the, the release briefly because we’ll release this after the-- after it’s already announced so whatever. There’s a catalog that you guys are doing?
    [00:53:55] Mikhail Parakhin: Yeah. So we are, we are- Okay ... we are bringing in capabilities of a whole, uh, Shopify catalog.
    Basically, you now you can search for products, you can do lookups by specific ID, you can do bulk lookups when you need to bring m-multiple products. You don’t need to know in ad-in advance what you’re trying to show or to sell or check out. Like, you can now, you can now have this decided at, at runtime, and this big area for investment for us for both non-personalized and personalized searches, trying to provide basically a win-window into whole universe of products that are being sold everywhere in the world.
    And Shopify is really not exactly, but almost like a super set of any-anything being sold. Now we are bringing it into UCP and, uh, and, uh, identity linking is another big thing for us, uh, so that you, you can use, uh, like Google or whatever, whatever identity you have, uh, they’re minimizing friction.
    [00:54:56] swyx: Yeah. So
    [00:54:57] Mikhail Parakhin: yeah, big release for us.
    But Liquid AI of course we never talk about, and the problem might be more, more aligned with what we d-discussed previously on this chat.
    [00:55:07] swyx: Sure. The main thing that everyone understands about Liquid is that it is inspired by Worm, and I still don’t know why. I’m curious on your explanation. I think you, you, uh, you can make things very approachable.
    And also I think like what is the potential of like the, the level of efficiency that you get out of Liquid?
    [00:55:23] Mikhail Parakhin: You- we all familiar with transformer architectures. And, uh, for the longest time, there was a competing architecture, it’s called the state space models. So, so Sams, uh, you know, Chris, Chris Reyes, one of the pioneers and, and lots of startups, uh, trying to make those realities.
    They have, uh, significant benefits being main being, uh, being much faster and, uh, lower footprint and not quadratic in length, you know, sort of, uh, linear in, in, uh, in your context length. But with state space models- They never quite made it. Like they’re used-- They have, uh, certain niches when they thrive, their hybrid architectures are useful, but they never quite made it.
    And liquid neural networks are, you can think of them as a next step, like, uh, sort of, uh, state-space model square. It’s non-transformer architecture that’s more complicated than sta-state space and really difficult to code if you-- if I’m being honest. But it’s, um, very efficient. It’s, uh, subline-- sub, uh, quadratic in, in length of your context.
    Uh, it’s very compact way to represent things, and that’s a liquid AI company. They... Their goal is to productize it, and very often you have this need, uh, when you need to have long context and small model, and you want to have low latency. Like in general, it’s basically on par with transformers, and if you do hybrids with transformers, it’s, it’s even better.
    That’s why we at Shopify, when we tried multiple and we constantly try multiple models, multiple companies, we found that for small, particularly with low latency applications, when you have low latency and/or if you need longer context lengths, liquid was the best. And so we still use the whole zoo and always like obviously test and use everything, uh, every open source model and, you know, it feels like sometimes even every private model.
    Uh, but liquid’s been taking quite a bit of, uh, at least internal Shopify share. And the reason I’m excited is, yeah, because it’s, it’s the only non-transformer architecture that I found being genuinely competitive. Uh, and, uh, you know, for we use it for search and for, for long context, uh, pulse distilling and others.
    This is the overview. I don’t know how approachable Sha, sorry. Maybe, maybe still too obtuse.
    [00:57:51] swyx: I, I mean, I think they haven’t been that open about their implementation details. I think the... I would say like liquid hasn’t been like if there’s a lot of technical detail published, I haven’t read like a, a formal sort of paper on the implementation details.
    Uh, but I, I did get the sort of relationship between the SSMs and the others. This is one of the sort of, uh, charts that was, you know, showing the relationship between like full attention versus Something that’s, uh, more like a RNN type in terms of their, their efficiency. Um, and then the, the other chart was this old one, uh, where it compares versus, uh, some of the other models.
    Uh, doesn’t exactly have the correct Y-axis, but close enough where you can see like it’s basically a, a step change difference in terms of the efficiency. I think the surprise to me was that you guys are, uh, actively using it already in internally inside of Shopify. And like I, I’m curious, like what are the constraints that you’re optimizing for, right?
    Is it when you say smaller, is it like the 1B size? Uh, what kind of like latency constraint are you, are you optimizing for? What kind of context length, um, sort of considerations, right? Like I think for example, right, like in the audio kind, kind of use cases, the SSMs ef-effectively have unbounded context length because they, they just have to operate on like the most, the sliding window of the most recent stuff.
    Uh, I’m just kinda curious, like w-what do you see the potential here?
    [00:59:13] Mikhail Parakhin: Yeah. The SSMs are effectively because, yeah, because the state embeds all the, all the previous information needed, or that’s the assumption. SSMs effectively have infinite context length. The, the problem with, uh, with them is that expressiveness is not there.
    The, uh, uh, Liquids are effectively souped up SSMs. We are much more expressive, m-uh, com-more complicated again to code. There is, there is a paper on it. You can, you can see it. Differential equation rolled out and, and then computed as a, uh, as really as a convolution. It’s a bit involved. The thing where we, we use it is specifically either for where we need super low latency, and we’re-- there was a lot of very fun project with, uh, Santa ML and Liquid AI themselves.
    We run it at, uh, thirty milliseconds, a, a tiny model, like three hundred million parameters in, but we run it in thirty milliseconds, uh, end to end for search when you, when you type a query, and then we produce all the possible things what you, what you can mean by that query and some, you know, uh, not only synonyms, but, but, uh, a que-kind of full query understanding the, the whole tree of what you might need and including your personal personalization because you might have done like previous queries and lowering it all down into the search server so that the requirements on latency obviously they are very, uh, very strict.
    So, so then we are able to run it under thirty milliseconds because, ‘cause at Liquid, you know, Qwen doesn’t run on this. And even Liquid, we had to work a lot with NVIDIA and to... because almost everything is not designed in CUDA for or in, in the current stack for, for low latency. Like small things that don’t matter with large models, you know, start mattering a lot, and we had to optimize it.
    There is different end of the spectrum where this is maximum through, uh, bandwidth throughput for things like, for example, offline categorization when A new product appears. We need to do analysis. We need to assign where it is in taxonomy. We need to extract and normalize attributes. We need to do, uh, you know, clusters like, oh, it’s the same thing as that other merchant is selling, right?
    That is like un-- like almost unbounded, uh, amount of energy you need to spend on it because it’s, uh, you know, it’s quadratic kind of, uh, problem, and we have billions and billions of products. So you don’t care about latency as much. You know, it’s kind of an overnight batch job, but you, you want to maximum throughput.
    And you usually in those cases, you also sometimes like for, uh, Sidekick Pulse, you also need long context. These are... We are talking models in maybe seven, eight billion, uh, parameter range, uh, where we would, we would take a large model, like we would take something huge, largest we can, we can find. We would distill into liquid for a specific task, such as, for example, for our catalog, uh, formulation or for, for Pulse.
    And then we run it at a very large scale, like in batch jobs. Because just running... And, and it beats in that situation beat very often beats, uh, Qwen or, yeah, Kimi is more on the reasoning side. So Qwen, Qwen I would say is probably their major alternative. That’s when we use it. I mean, not a, not a panacea, not, not really, uh, I wouldn’t say that it’s frontier model in the sense of it’s not gonna suddenly compete with, uh, GPT 5.4.
    Uh, but, but, uh, uh, it is a phenomenal target for distillation, which is right now becoming more and more important with, uh, explosion of token usage.
    [01:03:00] swyx: Is that a, a now only thing or do you think you give Liquid a hundred billion dollars and they will do... Is it, is it just more scale or like what, what is limiting it?
    You know, what prevents it from running into the same issues that SSMs had?
    [01:03:14] Mikhail Parakhin: Their scale is already much larger than the largest SSM I, I’m aware of. Uh, uh- Wow, okay. So yeah. So, uh, SSM was just, was just not expressive enough or in my opinion. Like, um, again, I’m sure I’ve-- I’ll get a lot of pushback and probably accurately so.
    But in my opinion, SSMs are not expressive enough and, uh, liquid models are. I think, uh, especially in their hybrid form when with combined with the transformer, like in Mamba fashion, they probably the best architecture I’m aware of like period. But of course, Liquid AI is not at the scale of, uh, you know, Anthropic or, or Google or OpenAI in terms of compute.
    So I don’t think, uh, they... I think if, if they, uh, if they had similar level of compute, they, they would be very competitive and maybe even beat the, uh, the largest models, at least from what I’ve seen. They don’t have, uh, this level of, uh, investment But they still have decent investment and, and it’s, uh, it’s, uh, definitely for this scenario of smaller models and distilling into their second to none very often.
    We are very omnivorous, and we’re on purely merit-based. So the moment they will start being competitive, we’re like, we will switch to something else, and we constantly test. But, but so far, if you see progression, if I draw a graph of our workloads on Liquid versus our workloads on, I would say Qwen, which is another awesome model and probably, uh, another kind of standard within Shopfy, I would say, uh, Liquid’s been definitely taking share
    [01:04:48] swyx: I think that’s very promising and probably the best explanation I’ve heard, uh, directly from, from someone involved in Liquid.
    Um, I, I do have Maxime Lebon coming to, uh, my conference in London, uh, this week, so I, um, we’ll- Oh, that’s great ... hear more from him. I-- ‘cause, uh, there was this, like Liquid, uh, investor day or something like a, a year or, or a year and a half ago, and I, I think there just wasn’t that much technical detail that I think was, was sort of speaking to my crowd of like potential customers and users, right?
    Which like, yeah, it’s fine. Like, you know, maybe, maybe, uh, there, uh, we, we still need to wait for more results that come out, uh, before, before this. But I think it would be news to a lot of people that you guys are actually actively already using it for high-frequency use cases. I also wanted to highlight Psychic Pulse, which, uh, we didn’t cover, and we probably don’t have time to cover, but it’s something that you also launched, uh, recently.
    Basically REXIS, um, but also something that like I’ve-- the, the other REXIS trend I’ve been c- I’ve been covering a lot, uh, from like the YouTube side, even xAI’s, uh, REXIS has been LLM-based REXIS, right? Uh, which I think you are also effectively using liquid models for, but they are just throwing transformers at, at the problem.
    And maybe this is, uh, eh, the sort of hybrid architecture shift that will happen in order to accommodate the kind of long context and, and lo- and high efficiency that, that you need. I don’t really have a strong opinion there, like apart from I would highlight to anyone the, the, the work that the LLM base-- LLM-based REXIS community is doing is, is also very interesting there.
    [01:06:22] Mikhail Parakhin: Yeah. The-- again, the thing to get you excited is that it’s not just LLMs looking at things, it’s also HSTU model doing that counterfactual analysis- Yeah ... where we model the whole, uh, enterprise as an entity and, and its actions and then see what, what will, what will happen.
    [01:06:39] swyx: Overall, I think it, it pre-- this all presents like, uh, an enormous like...
    I think, uh, you know, uh, there, there was not that deep of a AI story to Shopify when it started. Uh, it was just a WordPress plugin, right? But now, you know, you are the sh- the, the storefronts, uh, e-commerce, you know, uh, guardians to s- like so many, so many people, and you’re, you’re really like applying all the AI, uh, methods and the state-of-the-art stuff.
    Uh, so like I, I think, you know, our conversation like today has like really, uh, oh, I guess opened my eyes to a lot. So thank you for doing this. Uh, this is a really amazing, um, overview of, uh, what you’re doing.
    [01:07:15] Mikhail Parakhin: Okay. Thank you for saying that, Shawn, and, uh, thank you for having me. Of course, it’s always a pleasure to talk to people who, you know, deeply technical and know what they’re talking about.
    [01:07:25] swyx: Yeah. I mean, uh, very few people are as technical as you but at least I can, I, I can like somewhat fo-- uh, vaguely follow along. Yeah. So, so, okay, um, there, there is a hi- there’s a hiring call, uh, you know, uh, any, any particular roles that you’re looking for that you’re like, “Okay, if you know the-- how to solve, um, this problem, uh, reach out”?
    [01:07:45] Mikhail Parakhin: Yeah. Uh, the, the things I would definitely call out that if you’re an ML person or if you’re data science person and, uh, uh, we, we, we have huge need for more, more people munching data, so to speak. Or surprisingly, if you’re a distributed database person and, uh, uh, you know, we, we think that there is a way to use LLMs to reimagine how we do distributed databases, and we’re working a lot with Yugabyte there.
    And so if you’re-- have interest in those areas, we’ve-- like ShortFi might be the best place in the world for you. That’s pretty good place for other, you know, other disciplines as well.
    [01:08:24] swyx: Cool. Um, I think that that was all the questions I had. I said I, I have one sort of a bonus thing if you, if you wanna indulge in, uh, some Bing history.
    What is your, uh, I guess, takeaways or any, any fun anecdotes about Sydney?
    [01:08:38] Mikhail Parakhin: Any fun anecdotes about Sydney? Well-
    [01:08:41] swyx: Yeah, it was a very interesting, you know-- I, I think it, like, woke up people to, like, this personality that, that, that it w-- emerged.
    [01:08:48] Mikhail Parakhin: The, the funny thing, like, I mean, the, the most interesting anecdote is that Sydney was first shipped, uh, in India for, uh-- and, uh, it was, uh, not noticed for a long time.
    And first implementation of Sydney didn’t even have OpenAI model under it. It was, it was, uh, Turing Megatron, um, Microsoft, uh, and NVIDIA collaboration model. Uh, and there were, uh, yeah, exactly. That’s, that’s the, that’s the one people thought it was a prank, uh, because it was, like, not many people were familiar with the LLMs at, at that point yet, and thought like, “That cannot be automatic.
    You, you must have, uh, you know, people thinking.” And then even they were complaining that, “Oh, the-- my-- this, this chatbot is gaslighting me.” And then, then people like what, what almost everybody doesn’t fully realize is that it wasn’t by accident that, uh, Sydney was Sydney. I mean, we spent a lot, a lot of effort on personality shaping.
    Uh, we-- I mean, it, it was a bit of my Yandex legacy, where previously we did this Alice, uh, uh, digital assistant, uh, which we learned the- Chatbot, yeah ... yeah. We, we learned the importance of, uh, personality shaping, and so here we brought, did a lot of personality shaping. Uh, so it was not fully an emerging scenario.
    It was, it was also a little bit edgy. What, what we learned in, in those experiments is you want to be polite, but you want to be a little bit on edge, and that draws people in. I haven’t seen, ever since the, uh, kind of those days, I haven’t seen anybody trying exactly that mode. I think we will see, we will see more of this at some point, but, uh, yeah.
    A lot, lots of good memories, you know. And by the way, the very first Sydney dev lead Is, uh, uh, Andrew McNamara is working in ShopFind, uh, and the head of Sidekick and, and our-- and the Pulse- Oh. And lots of these are actually, yeah, in his pur-purview.
    [01:10:53] swyx: Oh, okay. Uh, I-- That, that’s another fun fact. You’re, you’re- Yeah
    assembling the team again. Yeah. Yeah, it’s cool. Like, I think a lot of, uh, people woke up to the, the idea of AI personality for the first time there. And, like, I think now with maybe OpenClaw, like explicitly prompting a, a fun personality, I think that, that is a real selling point for, for people, right? And then I, I guess maybe the only other time that it’s like really emerged into public consciousness is Go to Gate Clawed.
    But yeah, I think, uh, you know, hopefully someday we’ll get Shopify Sydney.
    [01:11:23] Mikhail Parakhin: Well, we have Sidekick. It’s a- Yeah ... it’s a different, different thing a little bit. Yeah.
    [01:11:28] swyx: Yeah. Si-Sidekick was like your, your original big launch for, for AI stuff. Uh, yeah, cool. Uh, amazing. Uh, thank you so much. You guys do amazing work.
    Uh, honestly, if I was a Shopify customer, Shopify investor, um, hearing all the work that you guys are doing o-on this technical side, it, like, m-makes me feel more confident in like, okay, just choose Shopify, right? Like, like you’re never gonna do this in-house, which is obviously what you want. But like, uh, yeah, I mean, like, that-that’s, that’s what an ideal platform is, like, that you’re doing all the things that no individual could do at their scale, but you can at your scale.
    Uh, very exciting problems.
    [01:12:01] Mikhail Parakhin: Exactly. Exactly. Yeah. And creating network effect and hard to disagree. If you’re not using Shopify, you should.
    [01:12:09] swyx: Yeah, amazing. Okay, well, that’s it. Thank you so much.


    This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe
  • Latent Space: The AI Engineer Podcast

    🔬 Training Transformers to solve 95% failure rate of Cancer Trials — Ron Alfa & Daniel Bear, Noetik

    20/04/2026 | 1 h 25 min
    Today, we explain this piece of “clickbait” from our guest!
    TL;DR: 95% of cancer treatments fail to pass clinical trials, but it may be a matching problem — if we better understood what patients have which tumors which will respond to which treatments, success rates improve dramatically and millions of lives can be saved — with the treatments we ALREADY have.
    See our full episode dropping today:

    Why Big Pharma is licensing AI Models
    Tolstoy famously wrote, ‘All healthy cells are alike; each cancer cell is unhappy in its own way.’ Or something like that. Cancer might be the most misunderstood disease out there. It’s not one disease, it’s a family of diseases. Hundreds, maybe thousands, of unique diseases each with its own underlying biology. With this lens, saying you’ll “cure cancer” is like saying you’ll solve legos.
    We keep hearing AI will cure cancer, but sadly it may not be so easy. Today’s guests — Ron Alfa and Daniel Bear from Noetik — thinks they can use AI to break through a core bottleneck in the treatment development process.
    GSK recently signed a $50M deal for their technology that also includes an (undisclosed) long-term licensing deals for Noetik’s models like the recently announced TARIO-2, an autoregressive transformer trained on one of the largest sets of tumor spatial transcriptomics datasets in the world. Whole-plex spatial transcriptomics is the richest way to read a tumor, and approximately ~0% of cancer patients going through standard care ever get one — and TARIO-2 can now predict an ~19,000-gene spatial map from the H&E assay every patient already has.
    Most big AI plays in BioTech have focused on discovery, and usually result in an in-house development effort (meaning tools companies usually become drug companies). This deal stands out in that it is a software licensing deal, and represents a commitment to a platform rather than a drug.
    With attention on other software tools for drug development (see the Boltz episode and Isomorphic for example), it is starting to look like the appetite of Pharma for biotech tools has finally started to grow. Why the sudden interest?
    Cancer is hard
    Biology is hard, cancer is harder. But despite this, we’ve made incredible progress. So many cancers that would have been death sentences twenty years ago are routinely survivable. It used to be our main strategy was just chemotherapy — poison you and hope the tumor dies before you do. Now, there are many treatments that actually kill a tumor and leave the rest of you intact! Immune checkpoint inhibitors like Keytruda and Opdivo target the defenses of dozens of tumor types. CAR-T therapy adds modified T-cells to your blood that can target B-cell malignancies very accurately. Antibody Drug Conjugates such as Trastuzumab combine a drug with an antibody, allowing it to target very specific (cancer) cells. We truly live in marvelous times.
    With that said, we still have a long way to go. For every type of cancer with a miracle treatment, we have many more that are still death sentences. The world spends $20-30 billion a year trying to cure cancers, with hundreds of clinical trials yearly.Yet, progress is slow with a 95% failure rate in clinical trials.
    The lab doesn’t translate to the clinic
    Are we leaving something on the table? Enter Noetik and Ron Alfa. Ron’s core thesis is that many of these “failed” treatments actually work! But we’re not looking at the right patients with the right tumors. If only we had a way to really understand the unique types of cancer biologies and which patients will respond to which treatments, we might be able to show a much higher success rate. Millions of lives (and billions of dollars) may ride on this.

    The Hard part: Blind Faith in Data Collection
    Ron and Noetik had the conviction to spend almost two years just collecting data. Lots, and lots, and lots, of data. Noetik has acquired thousands of actual human tumors, and collects a large multimodal dataset of hundreds of millions of images that allows them to create a detailed map of the cell makeup in the local environment. These are real human tumors, not frankenstein mouse models or immortal cell lines.
    This data is then fed into a massive self-supervised model, creating a “virtual cell”. This model has a deep understanding of cancer biology — Noetik has worked carefully to show it can distinguish different types of tumors. Maybe even tumors we didn’t identify as distinct previously! More recently they figured out how to scale up their model and data, and see no limit in their scaling laws!
    Noetik’s models can simulate how a patient will respond to experimental treatments. They are working with partners to test promising drugs that were demonstrated to be safe, but not effective. If these models work as hoped, Noetik will bring new cancer treatments to patients without developing a new drug! Their models will also guide the discovery process towards drugs that are more likely to make it through clinical trials. You can imagine why this is so attractive to GSK.
    We’ll see…
    Ron and Dan make pretty persuasive arguments that their models will truly assist in cohort selection in useful ways and this seems valuable. And we think it’s pretty clear that
    * Translation from lab to clinic is the biggest bottleneck for drug development.
    * Better cohort selection using biomarkers is likely to improve translation from lab to clinic.
    Noetik has already had some success here. We’ll see if they’re able to translate that into a reliable advantage.
    Stepping back a bit from the technology, curing cancer is a pretty unambiguously positive application of AI. It is also a very hard problem to solve. Our guess is that most people have been impacted by cancer or will be at some point soon. And we hope that learning about the amazing work that companies like Noetik are doing will inspire a generation of AI engineers to work on the hardest and most exciting problems that society faces.

    Full Video Pod:



    This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe
  • Latent Space: The AI Engineer Podcast

    Notion’s Token Town: 5 Rebuilds, 100+ Tools, MCP vs CLIs and the Software Factory Future — Simon Last & Sarah Sachs of Notion

    15/04/2026 | 1 h 17 min
    For all those who missed out on London, see you in Miami next week!
    Notion, the knowledge work decacorn, has been building AI tooling since before ChatGPT, with many hits from Q&A in 2023 and unified AI in 2024 and Meeting Notes in 2025. At the end of their last Make user conference, Ryan Nystrom teased Notion 3.0’s Custom Agents - and they are finally embracing the Agent Lab playbook!
    Sarah Sachs and Simon Last of Notion join us for a deep dive into how Notion built Custom Agents, why it took years and multiple rebuilds to get right, and what it means to turn a productivity tool into an agent-native system of record for enterprise work.
    We go inside the product, engineering, evals, pricing, and org design decisions behind one of the most ambitious AI product efforts in software today — from early failed tool-calling experiments in 2022 to agent harnesses, progressive tool disclosure, meeting notes as data capture, and the long-term vision for software factories and agentic work.
    We discuss:
    * Sarah and Simon’s path to launching Notion Custom Agents, and why the feature was rebuilt four or five times before it was ready for production
    * Why early agent attempts failed: no tool-calling standard, short context windows, unreliable models, and too much complexity exposed to the model
    * The “Agent Lab” thesis: not just wrapping a model, but understanding how people collaborate and building the right product system around frontier capabilities
    * How Notion thinks about roadmap timing: not swimming upstream against model limitations, but also building early enough that the product is ready when the models are
    * Why coding agents feel like the kernel of AGI, and how Notion is thinking about “software factories” made up of agents that spec, code, test, debug, review, and maintain codebases together
    * How Sarah runs AI engineering at Notion (“notes from Token Town”): objective-setting over idea ownership, low-ego teams comfortable deleting their own work, and a culture designed to swarm around fast-changing opportunities
    * The “Simon Vortex,” company hackathons, and why security gets pulled in early rather than late
    * How Notion organizes AI: core AI capabilities and infrastructure, product packaging teams, and a broader company mandate that every product surface must increasingly work for both humans and agents
    * Why prototypes have become much easier to build internally, and how “demos over memos” changes product development inside a tool the whole company already uses every day
    * Notion’s eval philosophy: regression tests, launch-quality evals, and “frontier/headroom” evals that intentionally only pass ~30% of the time so the company can see where model capabilities are going
    * What a “Model Behavior Engineer” is, and why Notion treats eval writing, failure analysis, and model understanding as a distinct function rather than just software engineering
    * The changing role of software engineers in the age of coding agents, and why the new job looks less like typing code and more like supervising a rigorous outer system of agents, PRs, and verification loops
    * How the “software factory” should work: specs, self-verification, bug flows, subagents, and minimizing human intervention while preserving the invariants that matter
    * A live walkthrough of a Notion Custom Agent handling coworking space tenant applications by triaging email, enriching applicants with web search, and writing structured data into a Notion database
    * How agents compose inside Notion: shared databases as primitives, agents invoking other agents, “manager agents” supervising dozens of specialized agents, and memory implemented simply as pages and databases
    * Notion’s take on MCP vs CLI: why Simon is bullish on CLI’s self-debugging nature, where MCP still makes sense, and how Sarah thinks about capability, determinism, permissioning, and pricing alignment
    * The evolution of Notion’s internal agent harness: from early JavaScript coding agents, to custom XML, to Markdown and SQL-like abstractions, to tool definitions, progressive disclosure, and a much shorter system prompt
    * Why Notion cares about teaching “the top of the class,” building for sophisticated operators rather than abstracting away too much capability for everyone
    * How agent setup works today: agents that can configure themselves, inspect their own failures, and edit their own instructions — with guardrails around permissions
    * How Notion prices Custom Agents: credits as an abstraction over tokens, model type, serving tier, web search, and future sandbox costs; why usage-based pricing was necessary; and how “auto” tries to match the right model to the right task
    * Why Notion is not eager to train a foundation model, where they do fine-tune and optimize today, and why retrieval/ranking is one of the most important investment areas as more searches come from agents rather than humans
    * Why Meeting Notes became one of Notion’s strongest growth loops: not just as transcription, but as high-signal data capture that powers search, custom agents, follow-up workflows, and the broader system of record for company collaboration
    * Why Notion is more interested in being the place where collaboration data lives than in building hardware themselves — and how wearables or other capture devices may eventually feed into that system
    Sarah SachsLinkedIn: https://www.linkedin.com/in/sarahmsachsX: https://x.com/sarahmsachs
    Simon LastLinkedIn: https://www.linkedin.com/in/simon-last-41404140X: https://x.com/simonlast

    Full Video Episode
    Timestamps
    * 00:00:00 Introduction and launching Notion Custom Agents
    * 00:01:17 Why Notion rebuilt agents four or five times
    * 00:03:35 Building for where models are going, not just where they are
    * 00:05:32 The Agent Lab thesis, wrappers, and product intuition
    * 00:08:07 User journeys, leadership, and low-ego AI teams
    * 00:13:16 The Simon Vortex, hackathons, and bringing security in early
    * 00:16:39 Team structure, demos over memos, and building for agents
    * 00:20:25 Evals, Notion’s Last Exam, and the Model Behavior Engineer role
    * 00:27:37 Evals as an agent harness and the changing role of software engineers
    * 00:30:42 The software factory: specs, verification, and agent workflows
    * 00:32:18 Live demo: a custom agent for coworking space applications
    * 00:35:08 Composing agents, manager agents, and memory as pages
    * 00:38:15 Notion Mail, Gmail, native integrations, and tools
    * 00:39:43 MCP vs CLI and the cost of capability
    * 00:44:13 When Notion uses MCP vs building its own integrations
    * 00:47:43 The history of Notion’s agent harness rebuilds
    * 00:55:35 Power users, public tools, and the setup agent
    * 00:58:01 Self-fixing agents, permissions, and “flippy”
    * 01:01:13 Pricing, credits, and choosing the right model automatically
    * 01:09:01 Why Notion isn’t training its own frontier model
    * 01:14:07 Retrieval, ranking, and search built for agents
    * 01:17:27 Meeting Notes as data capture and workflow automation
    * 01:21:18 Wearables, hardware, and Notion as the system of record
    * 01:23:45 Outro
    Transcript
    [00:00:00] Alessio: Hey everyone. Welcome to the Latent Space podcast. This is Alessio founder of Kernel Labs and I’m joined by swyx, editor of the Latent Space.
    [00:00:11] swyx: Hello. Hello. We’re back in the beautiful studio that, uh, Alessio has set up for us with Simon and Sarah from Notion. Welcome.
    [00:00:18] Sarah Sachs: Thanks for having us.
    [00:00:19] Alessio: Thanks for having us. Yeah.
    [00:00:20] swyx: Congrats on the launch recently the custom agents, finally it’s here. How’s it feel?
    [00:00:26] Sarah Sachs: We ship things slowly. So it had been in Alpha for a little bit and at the point at which is it’s an alpha, um, there’s a group of people that are making sure it’s ready for prod, and then there’s a group of people working on the next thing.
    So sometimes some of these launches are a bit delayed satisfaction, so it’s quite nice to remind yourself all the work you did because we do have a habit of like. Being two or three milestones ahead. Uh, just ‘cause you have to be, you know, you can’t get complacent. Um, but it’s been great that people understood how this is helpful.
    And I think that’s just easier in general building AI tools today than it was two, three years ago. People kind of get it and so that user education, um, there’s just, it was our most successful launch in terms of free trials and converting people and things like that. It was really successful, so yeah.
    But there’s a lot to build.
    [00:01:12] swyx: Making it free for three months helps.
    [00:01:16] Sarah Sachs: Yep.
    [00:01:17] Simon Last: It was definitely super exciting for me because it’s probably the fourth or fifth time that we rebuilt that.
    [00:01:22] swyx: Yes.
    [00:01:23] Simon Last: And I mean,
    [00:01:24] swyx: you’ve been building this since like 20, 22.
    [00:01:26] Simon Last: Yeah, I mean, like, it was even right when we got access to like GPT four in late 20 22, 1 of the first ideas we had is like, oh, okay, let’s make an agent that I, we used the word assistant at the time, there wasn’t really the word, the word agent yet, but, oh, we’ll give an access to all the tools the notion can do, and then it, we run in the background like, like do work for us.
    And then we just tried that many times and it just. Was too early. Um,
    [00:01:48] swyx: I need to force you to like double click on that. What is too early? What didn’t work?
    [00:01:52] Sarah Sachs: We were fine to, like, before function calling came out. We were trying to fine tune with the Frontier Labs and with fireworks, like a function calling model on notion functions.
    This is right when I joined. I joined because, um, we needed a manager as Simon was needed to be able to go on vacation. So, uh, that’s, that’s around when I joined, so you can speak much more to it.
    [00:02:11] Simon Last: Yeah, we did partnerships with both philanthropic and open AI at different times, uh, to try to, at the time the, I mean, when we first tried, there wasn’t even a constant of like tools yet.
    We, we sort of designed our own like, like tool calling framework and then we tried to fine tune the models to, uh, to use it over multiple turns. Um, and because it, it didn’t work well out the box, I think. Yeah. The models are just too dumb and the context thing was also way too short.
    [00:02:37] Alsesio: Yeah.
    [00:02:37] Simon Last: Um, and yeah, we just kind of banged our head against it for a long time.
    Uh, unfortunately it was always like, there was always like sort of. Glimmers that it was working, but um, it never felt quite robust enough to be like a useful, delightful thing. Um, until I would say, uh, the big unlock was probably like Sonic 3.6 or seven, uh, early last year. And that’s when we started working on our agent, which we shipped last year.
    Um, and then, and then uh, uh, custom agents, kinda a similar capability and that, that one just took longer because we, we just wanted to get the reliability up a lot higher. ‘cause it’s actually running in the background.
    [00:03:14] Sarah Sachs: And the product interface of like permissions and understanding, you know, this custom agent is shared in a Slack channel with X group of people and has access to documents that are surfaced to Y group of people.
    And the intersect experts, Y might not be whole. And so how do you build the product around making sure administrators understand that permissioning took multiple swings.
    [00:03:35] Alsesio: Everything is hard back at the end of the day. Yeah. I’m curious, like when the models are not working, how do you inform the product roadmap of like, okay, we should probably build, expecting the models to be better at some reasonable pace, but at the same time we need to, you know, you had a lot of customers in 2022.
    It’s not like you were a new company or like no user base.
    [00:03:54] Simon Last: Yeah, I mean I think there’s always the balance of, you know, like you want to be a GI pilled and thinking ahead and building for where things are going. Uh, but also you wanna be like shipping useful things. And so we always try to like, like keep a balance there.
    You know, we. We try to take clear, like a portfolio approach. You know, we’re always working on multiple projects and, and we’re always trying to work on, you know, maintaining things where that have already shipped, like, like shipping new things that are like eminently working well and make them really good.
    And, and then we wanna always have a few projects that are a little bit crazy. Um,
    [00:04:23] Alsesio: and what are the a GI peel projects that you have today? I’m curious about, uh, you don’t have to share exactly what you’re working on, but I’m curious what are things today that maybe in 18 months people will be like, oh, obviously this was gonna work
    [00:04:35] Sarah Sachs: 18 months.
    [00:04:37] Alsesio: Yeah, 18 months is, you know,
    [00:04:37] Sarah Sachs: it’s a long time and Yeah. Yeah.
    [00:04:39] Simon Last: I mean, there’s a number of things happening. I think one thing that’s becoming more clear is I think like, like, uh, coding agents are the kernel of EGI, sort of, everything is a coding agent. Mm-hmm. I think that’s, that’s sort of one, one direction.
    Um, and then, yeah, the exciting thing about that is sort of your agent can sort of bootstrap its own software and capabilities and actually debug and maintain them. And so yeah, we’re, we’re, we’re thinking a lot about that. And then, yeah, like, like another category of things that I’m, I’m really excited about is like, uh, we call the software factory also.
    People are using this, uh, this, this sort of word. Um, basically it just means can you create sort of like a, as automated as possible, a workflow for developing debugging. Mm-hmm. Merging, reviewing, and maintaining a code base and a service where there’s a bunch of agents working together inside, and like, like how does that work?
    [00:05:28] Sarah Sachs: If you think back to your initial question, like, why did this take so long? I think something,
    [00:05:32] swyx: I didn’t say that, but Yes. Okay. Go ahead.
    [00:05:34] Sarah Sachs: Why, what, what changed over the three and half years of trying
    [00:05:37] swyx: it? Exactly. Right. Because most people always say like, it didn’t work yet. Then reasoning models came, then it worked.
    I was like, okay, let’s go a little
    [00:05:43] Sarah Sachs: bit. That’s, I mean, that’s part of it, but I think the other part of it that I actually think is really what will set notion apart for every new capability is we have like. Two skills that are crucial when it comes to frontier capabilities. One is not letting yourself swim upstream.
    So like quickly realizing if you’re just pressing against model capabilities versus not exposing the model to the right information, not having the right infrastructure set up. That and of itself is the skill of intuition. And the second is to see, okay, you’re not swimming upstream. Which direction is the river flowing and what is like, how do we think ahead about the product and start building it even if it’s not great yet, so that when it is there, we’re ready for it.
    Right? And like those can sometimes feel like counterintuitive things. Like we can be trying to fine tune a tool calling model when they don’t exist yet. And that the trick is to not do that for too long, but realize that there was something there. And we’ve had a lot of things which like, um, we’re just like not swimming in the right direction with the streams.
    I think we had multiple versions of transcription before we got meeting notes, right? Oh, I gotta talk
    [00:06:39] swyx: about that. Yeah.
    [00:06:40] Sarah Sachs: Yeah. Um, and so. I, I, I think that like we, we really closely partner with the Frontier Labs on capabilities and we also have to have strong conviction on, as those capabilities move.
    Notion is about being the best place for you to collaborate and do your work. And how does that narrative change if the way that we work changes?
    Yeah.
    [00:06:58] swyx: Yeah. You told me you were a fan of the Agent Lab thesis, and this is, this is kind of it, right?
    [00:07:02] Sarah Sachs: Right. I show that thesis to so many candidates. Like I have it as like micro chrome autofill.
    Um, at this point, like it’s one of my most visitations
    [00:07:10] swyx: because like, is this the, here’s why you should work in notion and not open, open eye. I, it’s like,
    [00:07:14] Sarah Sachs: here’s, here’s what’s different about it.
    [00:07:16] swyx: Yeah.
    [00:07:16] Sarah Sachs: And here’s why. It’s not just a rapper. I actually think more and more people understand it’s not just a wrapper.
    [00:07:21] swyx: Yeah.
    [00:07:22] Sarah Sachs: Um, and by the way, like in the beginning, parts of what we build are wrappers on functionality. That works well, of course, but that’s not really the most, um. I would say that’s not the product that, that drives revenue. And that’s not necessarily always what users need.
    [00:07:35] swyx: I mean, you know, notion is the AWS wrapper, but like the, the wrapper is very beautiful and like very, very well polished.
    So
    [00:07:40] Sarah Sachs: like the analogy,
    [00:07:41] swyx: like
    [00:07:42] Sarah Sachs: the analogy that I’ve been coming back to his Datadog in AWS
    [00:07:45] swyx: Yeah.
    [00:07:46] Sarah Sachs: So, uh, Datadog could not exist with, without cloud storage. Right. That it’s kind of fundamental that that works. Um, and AWS has like a CloudWatch product, but Datadog is an expert on understanding how people want observability on the products they launch.
    And we’re experts in understanding how people wanna collaborate, and that’s really where our expertise lies.
    [00:08:04] swyx: Totally.
    [00:08:04] Sarah Sachs: Um, regardless of the tools that we use,
    [00:08:07] Alsesio: I’m kind of curious how you think about implicit versus explicit expertise. I feel like Datadog is half and half implicit and explicit. It’s like they understand across markets and industries what engineering teams usually look for.
    With notion, it’s almost like more of the expertise is at the edge because you as a platform, you’re like so horizontal that the end user is not really the same. Mm-hmm. Like with Datadog, the end user is always like, yeah, an engineering lead, a kinda like SRE related person with notion. It can be anything.
    So I’m curious how you put that expertise into a product versus, you know, obviously it, WS cannot build notion. It’s, that doesn’t quite work in this case, but
    [00:08:44] Simon Last: it’s, it’s a little bit differently shaped. I think, you know, a classic vertical SaaS, like the data is kind of like that. They understand their individual customer very deeply.
    It’s kinda a narrow slice, um, notion has always been super horizontal. And our, our task has always been to sort of balance these two somewhat opposing forces of like, we’re listening to our customers and what they want us to build. It’s a broad slice. And then also we’re thinking about like, okay, how do we decompose what they want into, uh, nice primitives that are, that are really nice to use and we’ll, we’ll get us like as much bang for the buck as possible.
    And then, you know. Maintain the whole system, make it all like, like super clean and nice to use.
    [00:09:22] Sarah Sachs: We still have user journeys. I mean, we still focus on like core. I actually think the failure of our team is when we focus too much on what are cools that are, what are tools that are
    [00:09:31] Simon Last: mm-hmm.
    [00:09:31] Sarah Sachs: Cool tools. I actually think that’s when we make have the least velocity because you still need some sort of focus on a user journey.
    So like for instance, we’ll all sit down every Friday and look at the P 99 of like the most token exhaustive custom agent transcript and just look at why it didn’t do well and cut a bunch of tasks. Like we still focus on like, this has, like this should work. Email triaging should work. Mm-hmm. Right. And similarly, like when we’re talking about before building, um, chatting, um, before we started filming about, okay, how can I do PDF export?
    Well that’s functionality that then merits. Maybe we should build a tool that has access to a computer sandbox in a file system and the ability to write code. Right? Right. Um, but it’s because we’re thinking about the fact that our users to do their, to do their daily work, need to export PDFs, not because we’re like, Hmm, I think a computer tool could be cool.
    Like, let’s just see what happens. Mm-hmm. Like we, we have to focus on some user journeys, otherwise we just don’t have like, enough strategy to, to prioritize.
    [00:10:29] swyx: I think there’s a lot of like really strong opinions that you’ve had. Do you have like sort of like a towel of Sarah Sachs? Like, you know, like what, how do you run your team?
    Like I feel like you just have accumulated all these strong opinions. Obviously part, part of this is your, your token town thing.
    [00:10:43] Sarah Sachs: I think the TAs working with Service X is, um, you’d have to, it depends who you ask. Um, I think it depends if you’re on my team or a partner Right. Or a vendor.
    [00:10:54] swyx: Yeah. There other people want to run their teams the way that you’re Yeah.
    You’re like bringing these things. And then also similarly, uh, Simon, when you did the custom agents demo, you had like, well, we’ve been using custom agents and here’s the super long list of everything that we do. No humans ever read it. Right? That’s what you said. I was like,
    [00:11:07] Sarah Sachs: yeah. So I think for, for me, um, something that I learned very quickly and became very comfortable with was that my job was not to be the ideas per person or the technical expert.
    My job was to make it so that everybody understood the objective, had a resource to help prioritize what they should work on, and had an avenue to prioritize what they thought was important. And I think that’s true with all, all leadership, but I think especially on the AI team. Almost all of our best ideas come from prototypes, from people that have a cool idea because they saw a user problem, and it’s a huge disservice if all of those ideas have to pass, like the sniff test of what me and a product partner or Simon and Ivan decided were the direction, right?
    Because a lot of what we’re doing is leaning into capabilities, so. I think that’s the first thing is like, I don’t really view like the role of engineering leadership as like, uh, hierarchical, nor has it ever been, but especially now, like very willing to change direction based on, um, like proof is in the pudding.
    Yeah. And like, and I think we have rebuilt our harness three or four times. And when you do that, then the second rule of engineering leadership is like you need to build a team that’s comfortable deleting their own code and is very low ego and is driven by what’s best for the company. And, um, doesn’t write design docs because they think it’s their promotion packet.
    Right. And that’s a culture that notion had long before I joined, but like our willingness to just swarm on different problems and um, redo things that we’ve built before because something has changed. Like, there’s a lot of friction that can happen at companies when you do that. And it doesn’t happen at Notion.
    And because it doesn’t happen when new people join. Like they don’t wanna be the ones that are saying, we shouldn’t do this. I wrote that code. So then it’s, you know, you, you create a culture that everyone thoughts and that culture comes directly, I think from Simon and Ivan though, um, because they’re very open-minded.
    [00:12:50] swyx: Anything that you,
    [00:12:50] Simon Last: you’d add? I’m not a manager, like, like, like Sarah is. Um, a lot of my role is really to try to think a little bit ahead, make sure that we’re, we’re building on the right capabilities and then like the prototyping stuff. And yeah, it’s really, really critical to always just be starting again.
    It’s like, okay, this is new thing. What does this mean? What if we just rethought everything or wrote everything? And so I, I’m, I’m basically just doing that in a loop every six months.
    [00:13:16] swyx: Yeah. Do you believe in internal hackathons for this stuff?
    [00:13:19] Sarah Sachs: I think there’s like two different versions. So one is like, we just have a, a, a solid bench of senior engineers that come and go on what we call the Simon Vortex and Productionizing what we built, right?
    Because when you’re in the Simon Vortex, the velocity is super high. The direction changes daily, and it’s meant to be like the equivalent of a SC Works lab. We don’t need to do hackathons for that. We need to have senior engineers that we trust to come in and out of those projects. For instance, like management boundaries are really loose.
    Like you report to him, but you work for her right now. Yeah. That’s something that when we hire managers, it’s important they don’t care about because we tend to form more structures. Yeah. Don’t be too
    [00:13:54] swyx: territorial.
    [00:13:55] Sarah Sachs: We form more. It’s after we ship things, not not before, just historically. Um, the second thing is we do have companywide hackathons.
    Actually we just had our demos day for the hackathon we had last week this morning. That’s more for people that aren’t directly working on the project, feeling like they have the time to pause and learn how to make themselves more productive or how they would use notion custom agents to build something.
    Or part of the hackathon was actually encouraging everyone across the company to build their own agentic tool loop, calling from scratch. Follow like an every blog post on how to do what I think because we want
    [00:14:26] swyx: just with the compound engineering one. Yeah.
    [00:14:28] Sarah Sachs: We want everyone to use cloud code in the company or whatever the coding agent they please and understand that fundamental.
    So we set aside a day and a half. We’re all leadership, encourage everyone on their teams across the company to do it. So we have hackathons like that. I would say like kind of facetiously, like everything we build is a little bit like a hackathon until it graduates and puts on big boy pants and as a product ops rollout leader and has a assigned data scientists and stuff like that,
    [00:14:54] swyx: security review enterprise stuff,
    [00:14:56] Sarah Sachs: actually security reviews one of the things that we bring in first because it just slows us down way more and, um, causes a lot of tension and they build better product if they’re involved early.
    So, um, that is probably the first person to get involved in something that’s the
    [00:15:09] swyx: right PR approved answer.
    [00:15:10] Sarah Sachs: No, but it’s not just PR approved. It like, um, um, it’s
    [00:15:13] swyx: actually real. It’s actually real. It’s like, um, I’m just saying scar
    [00:15:15] Sarah Sachs: tissue.
    [00:15:15] swyx: Yeah,
    [00:15:16] Sarah Sachs: because like, you know, my background’s also, I worked at Robinhood for a number of years.
    Yes. So like, uh, compliance and things like that, um, are a little bit more, you learn the hard way when it doesn’t come naturally.
    [00:15:26] Simon Last: Yeah. I think the. The hackathon is really important for uplifting the general population, but like, if that’s the only way you can build new things, you’re kind of toast. I mean, it, it has to be like the daily processes, like, you know, building these new things.
    Um, and it has to be about, I think like, I think in the AI era a lot more leverage accumulates to the most curious and excited people. And so it’s like we’re all about just like activating that energy. You know, like if someone’s protesting something on the weekend that they’re excited about and it’s important, that should be the main thing that we’re doing.
    Yeah. Um, it’s not a hackathon that we schedule once a quarter, it’s just like, yeah. Daily process. Part of the culture.
    [00:16:02] Sarah Sachs: I mean, that’s how we shift image generation and notion now. It was always this thing that would be kind of nice to have, but it wasn’t really clear where that was necessarily aligned in product priorities.
    It’d be a lot of work. And we had someone on the database collections team, Jimmy, who was like. I really wanna do image generation for cover photos and inside notion. And we’re like, if you wanna build it, like it’s, do it please. Like we encourage you. We gave ‘em all the resources of working directly with Gemini and being able to like track the token usage and it working through endpoints.
    We gave them eval, support, everything, and then became a, a full project.
    [00:16:34] Alsesio: Yeah.
    [00:16:35] Sarah Sachs: That’s why you can’t have like ego as a, a leader. Like that’s, that’s how we work.
    [00:16:39] Alsesio: What’s the size of the team today, both engineering and overall?
    [00:16:43] Sarah Sachs: I manage, uh, the team. That’s what we’ll call it. Core AI capabilities and infrastructure.
    That’s about 50 people. But then we have per i partner teams that do packaging. So how it shows up in the corner chat versus custom agents versus meeting notes, that’s another 30, 40 people. And, and then every team that has a product service at Notion that a user can interface with owns the tool that the agent interfaces with the editor team.
    The team that did CRDT for offline mode is the same team that handles how two agents, um, edit competing blocks. Mm-hmm. Right? It’s the same problem. The team that built the underlying SQL engine is the same team that owns how the agent asks it to run a SQL query, and it does it performantly. And so from that regard, anyone working on product engineering is tasked with making them work for customers that are humans and agents because over time the majority of our traffic will be coming from agencies using in our interface, not humans.
    And so. Our objective is to make it so that the whole product org is building for agents.
    [00:17:40] Alsesio: Yeah. How has it changed internally? The activation bar is kind of lowered a lot. Like anybody can kind of create a prototype very, somewhat easily, especially if you’re like an existing code base. Have you raised the bar on like what type of prototype people need to bring forward to gonna be taken?
    Not like seriously, but like, you know what I
    [00:17:58] Simon Last: mean? Yeah. I think the bar is lowered in many ways. Be like, one thing our, uh, our team built that is really cool is our, uh, our, our design team made a whole separate GitHub repo, uh, called the, the design Playground. And it’s basically just to create a bunch of like, like helper components and you, uh, for, for quickly a throwing together UIs.
    And it’s become like actually quite sophisticated. Like it has like an agent in there and like, uh, that’s pretty fun. So like, we pretty much, like, they don’t do mocks, they just make like, like full, full prototypes.
    [00:18:27] swyx: Here it is. It works.
    [00:18:28] Simon Last: They give you like a u rl. They’re like, okay, all right. So we have to make the, like the real production version of that.
    Um, and then for engineers. A prototype looks like just making it a feature flag that actually works. Like that’s sort of the bar.
    [00:18:39] Sarah Sachs: Something to understand that’s really unique about notion. One of the reasons I joined we’re super lucky is no one uses Notion in their job as much as people that work at Notion.
    [00:18:46] Simon Last: Of course.
    [00:18:47] Sarah Sachs: So I think there’s very few companies, maybe if you worked on Chrome I guess, but like everything that we ship, we ship internally first and get a lot of really quick feedback. And also sometimes our dev instance is totally borked and you have to change a bunch of flags to get things done. And that’s kind of like, but everyone, so people that do it ticketing, people that do supply chain procurement, recruiting, everyone is using the same instance of notion with like a lot of flags on for these prototypes people build.
    Um, and so we have this, Brian Levin, one of the designers on our team, I think evangelize this concept of demos over memos.
    [00:19:18] swyx: Ooh, too
    [00:19:20] Sarah Sachs: good. Um, which has been, uh, very good for building demos, and I think it’s put a big pressure point on us to have really strong product conviction, because if anything can be demoed, you really need a strong filter of making sure that if you know, you’re doing X amount of work, you’re making the, you’re, you’re focusing on one tower, you’re not just building a really flat hill.
    Right. That’s actually where I think there has to be more conviction from our PMs, um, and our designers and, and well, the company really to have conviction of what journey we’re going on.
    [00:19:52] Simon Last: But overall, I feel like it works pretty well. Like people, almost all the engineers have good enough taste to realize that like, this prototype doesn’t actually make sense in the product, or, or it does.
    So it’s not that common that I would see a prototype. It’s like, oh, this makes no sense. Mm-hmm. It’s like, you know, people are doing reasonable things and, and, and then it’s just a matter of. Which things we build first and then often just, just figuring out how to turn it on and off. There’s our, in the, in our like experimental chat ui, there’s this, there’s probably like, like a hundred check boxes in there.
    [00:20:22] Sarah Sachs: Kills me
    [00:20:23] Simon Last: the things you could turn on and off.
    [00:20:25] Sarah Sachs: Uh, but I think that, okay, so that is kind of true, Simon, but like being the person that manages the evals team, like there is a level of intensity that it adds to the platform team. So, you know, if we’re gonna do image generation and notion, all of a sudden the way that we do attachments and the way that we, um, our LLM completion like cortex talks and expects tokens back and now it’s getting images back.
    Like there’s a lot of platform work that we do need to, like solidify a little bit. So sometimes it’ll be in dev for a couple weeks before it makes it to prod just because we still have to like, make it robust, make it HIPAA compliant, ZDR compliant, figure out the right contracting with the vendor, whatever it is.
    And we need to eval it because we want the team. To still maintain what they build. That’s the one thing is like if we have a bunch of prototypes, it can’t just be like a small group of people that then maintain whatever end prototypes. So we have invested a lot of people in an eval and model behavior understanding teams that, we call it agent dev velocity.
    So your dev velocity building agents can be faster if we invest in that platform. And so we have a whole org dedicated to Asian, um, platform velocity so that you can build your own eval and then maintain it once you ship it. So if a new model release comes out and we, every
    [00:21:38] swyx: team maintains their own eval,
    [00:21:40] Sarah Sachs: we maintain the eval framework.
    Every team owns their own evals and a lot of them we’ve integrated to Optin, to ci, or we run them nightly and we have a team, uh, a custom agent that triggers to a team to look at the major failures. That’s really critical because if we have like all these different surfaces now, a lot of it’s on the same agent harness, so it’s easier to maintain.
    It’s just packaging of different agent harnesses, but new functionality of the agent. Let’s say that like we wanna update like. Uh, you know, they deprecated, sonnet, um, four or whatever it is and we need to auto update. Are
    [00:22:11] swyx: they already? That’s so, okay. Yeah. Actually wasn’t that long ago.
    [00:22:14] Alsesio: They
    were
    [00:22:14] Alsesio: just 3.5.
    [00:22:15] Sarah Sachs: 3.537. Just got deprecated.
    [00:22:18] swyx: 3 7, 5 0.2 or, yeah. No,
    [00:22:20] Sarah Sachs: it’s not. 5.2 is five point. Five point no. Yeah, five four is 40% more expensive than five two. So if they deprecated five two, you would hear they can, you would hear from me about that one. Um, but, uh, another conversation to have.
    [00:22:35] swyx: I have a cheeky evals question for you.
    Have you noticed any secret degradation from any of the major model providers?
    [00:22:40] Sarah Sachs: Secret degradation,
    [00:22:42] swyx: like. During the War Bay, when it’s high traffic, it suddenly gets dumber.
    [00:22:47] Sarah Sachs: Yeah. I mean, not just between the, I mean, we definitely notice flakiness, we’ve definitely noticed, particularly for some providers, that things are slower during working hours and
    [00:22:57] swyx: there’s a latency argument.
    Yes. Not a quality argument.
    [00:22:59] Sarah Sachs: No. I think the quality difference that’s interesting is, um, even though companies that say they’re selling the same, a, it’s really into like quanti quantization, but like companies that say they’re selling the same model through different vendors, whether it be through first party or Bedrock, Azure, et cetera.
    We do see different qualities sometimes, and that’s not necessarily what’s advertised.
    [00:23:21] swyx: Yeah. Kidney went to the point of like, if we, they shipped like this, like eval across all the providers and it was like very obvious we were secret equalizing and it was very,
    [00:23:28] Sarah Sachs: yeah. But
    [00:23:29] swyx: that’s very embarrassing.
    [00:23:30] Sarah Sachs: You know, um, we hire Subprocess to figure that out for us.
    So we just wanna understand where it’s regressing or where it’s optimized. And sometimes we’re okay with regressions that optimize latency if they’re the appropriate regressions. Our job is to make sure we have the evals to understand the changes that are important to us. And even like when we’re partnering with labs on pre-releasees of models, they’ll send us multiple snapshots.
    And this is less about quantization, but more just regressions. Like they have shipped models that were not the snapshots that we wanted, and they have changed the snapshots that they shipped based on the feedback that we give. Because our feedback tends to be more enterprise work focused and not coding agent focused.
    And definitely those can be bummers, like, you know, uh, we know that this wasn’t the version you wanted, but we’ll help you make it work. I mean, we always make it work, but that definitely happens.
    [00:24:16] Alsesio: Yeah. Do you have, um, failing evals that you’re just hoping, oh, that will have success eventually when a good model comes out?
    [00:24:23] Sarah Sachs: Uh, I mean, yeah. So I think. I mean, I could talk about this for 60 minutes, so I will limit myself. I think it’s a real issue when people say evals and it’s just like, that’s quality, that’s like unit, I mean, it’s like saying testing. It’s not just unit tests, right? So. We have the equivalent of unit test.
    Regression test. Those live in ci, those have to pass a certain percent, you know, within some stochastic error rate. Then we have, as you’re building a product, evals of these aren’t passing right now, and this is launch quality. So we have a report card and we need to, on these categories, you know, be it 80 or 90% of all of these user journeys to launch, and then what we have what we call frontier or headroom evals, where we actively wanna be at 30% pass rate.
    And that’s actually been a effort that we took in partnership with philanthropic and OpenAI in the past maybe two or three months, because we actually hit a point where our evals were saturated and we weren’t able to really give insightful feedback other than it wasn’t worse. And not only is that not helpful for our partners, it’s not helpful for us to understand where the stream is going.
    You know, going back to that analogy. And so we spent a lot of time thinking about. What notions last exam looks like, right? Mm-hmm. Not just humanities, last exam. Ooh, notions last exam. Mm-hmm. And, um, there’s a lot of, you know, dreams about what that would look like. I know we’ve talked a lot about benchmarking, um, swix, but, uh, yeah.
    Notions last exam is a big thing inside the company and we have people, full-time staff to it exclusively. Mm. We have a data scientist, a model behavior engineer, and an full-time, um, evals engineer just dedicated to the evals that we pass 30% of the time.
    [00:25:56] swyx: What you’re hiring for
    [00:25:57] Sarah Sachs: MBEs? I am hiring
    [00:25:58] swyx: What is an MBEA
    [00:25:59] Sarah Sachs: model?
    Behavior Engineer Model. Behavior engineers started with a title data specialist before I joined when they were working with Simon on like, uh, Google Sheets and like Simon just needed someone to look through Google Sheets and say, yes, no, this looks bad. This looks good. Right? And so we hired people with kind of diverse linguistics background.
    We had like a linguistics PhD dropout. Mm-hmm. And a Stanford ate new grad. And they’re amazing. And they formed a new function basically. And over time we’ve built a whole team, um, with a manager who’s now kind of reinventing what that role is with coding agents. So they used to be kind of manually inspecting code.
    Now they’re primarily building agents that can write evals for themselves or LLM judges. There’s a really funny day I can send you the picture where Simon, about a year and a half ago, was teaching them how to use GitHub. Um, and they’re on the whiteboard and it was like, okay, I think it would be so much faster if our data specialists learned how to use GitHub and like learned how to commit these things in Dakota.
    And, and that was then and now I think, you know, coding has been a lot more accessible. Um, but moving forward it’s this mix of like data scientist PM and prompt engineer because there’s craft in understanding like even like what models can and can’t do things. How do we define like that headroom? How do we define like what a good journey is?
    Um, is this model better or not? Why is this failing? There’s some qualitative work, but then there’s also like a lot of instinct and taste to it, and that’s not necessarily software engineering. And so we have like very firm conviction and we have had for a number of years now that that is its own career path and we have always welcomed the misfits, so to speak.
    So we really firmly believe that you don’t need an engineering background to be the best at this job. And that’s what’s quite unique about this particular role.
    [00:27:37] Simon Last: Yeah, this is something that I’ve been pretty excited about recently is we made an effort basically to treat the eval system as like an agent harness.
    So if you think about it, like, you know, you should be able to have an agent end-to-end, download a dataset, run an eval, iterate on a failure, debug, and, and then implement a fix. And ultimately you should be able to, you know, drive the full time process with a human sort of observing the, you know, the outer uh, system.
    So yeah, we went, went pretty hard on that. And that’s, that’s worked extremely well so far. It’s like basically just to turn it into a coding agent, uh, uh, problem.
    [00:28:11] swyx: Your coding agent or just whatever
    [00:28:13] Simon Last: harness No coding agent. Yeah, code, cloud code. It should be totally general. Yeah. I think if it would be a mistake to like, like fix it on any, any particular coding agent.
    At the end of the day, it’s just like CLI tools.
    [00:28:21] Sarah Sachs: It’s like the same way that you would’ve a coding agent write the unit test. You should have a coding agent write the eval.
    [00:28:26] swyx: Yeah.
    [00:28:26] Sarah Sachs: But there’s a lot of supervision in that still. We just don’t believe that supervision has to come from software engineers because a lot of it is like, um, kind of you XREE and whatever, and these are the people that also triage failures and tell us where we should be investing next.
    [00:28:40] swyx: Yeah. I’m gonna go ahead and ask a spicy question. Is there a data, there are no software engineers at Notion.
    [00:28:46] Simon Last: Um,
    [00:28:46] Sarah Sachs: what does it mean to be a software engineer?
    [00:28:47] swyx: Exactly.
    [00:28:48] Simon Last: I mean, I think the way things are going is like we’re on some continuum where. If, if you look back three years ago, humans were typing all the code and then we had auto complete, you’re typing list of the code.
    Then we had sort of like filling agents, filling lines, and now we’re getting into like agents doing longer range tasks where you can debug and implement a fix and then verify it works and you know, get your, get your PR even like, like Merion deployed. I think we’re sort of just moving up the abstraction ladder and then the human role becomes more about observing and maintaining the outer system.
    There’s a string of agents flowing through, like me prs what’s going off the rails. Like what do I need to approve? Is there like a learning or memory mechanism that that works? So it’s kind of a hard engineering problem. There’s a, you know, there’s, there’s a lot to do there. I think we’re just sort of moving up stack
    [00:29:34] Sarah Sachs: the same transition machine learning engineers have made, right?
    Like I haven’t looked at a PR curve in a while.
    [00:29:39] swyx: Yeah. You used to do this stuff and now, um, auto research can do it,
    [00:29:42] Sarah Sachs: right? Like I think it depends on what you define as a software engineer.
    [00:29:46] swyx: Yes. It’s, that’s changing for sure.
    [00:29:49] Sarah Sachs: I think every software engineer in notion this summer went through like this, um, sheer, um, one of our engineering leads of the company called it, like every software engineer is going through the, the, uh, identity crisis that every manager goes through, where all of a sudden they realize their ability to write code is less important than their ability to delegate in context switch.
    And I think that is a transition out of being a software engineer. But
    [00:30:12] Simon Last: yeah. Yeah, there’s a critical difference to being a manager, which is that like, it is actually very deeply technical. The problem, you know, humans are very like, like, like fuzzy and you can’t like treat a team of humans like a, like a rigorous system where like, you know, prs like, like flow through and can be in like a block status and then what happens when they’re blocked, right.
    With a set of agents, you actually can do that. And, and, and I think it’s actually, there’s a lot of interesting technical rigor that that goes into that it’s like it’s a technical design problem. Ultimately.
    [00:30:42] Alsesio: What is the design of the software factory that you’re building?
    [00:30:46] Simon Last: Yeah, I mean, I think we’re. Trying a lot of different things.
    I mean, ultimately you want to design a system that requires as little human intervention as possible, but like still maintaining the in variance that, that you care about. So yeah, we’re exploring a lot different ideas there. I mean, I think I could talk about a few things I think are important there.
    Like, one thing I think is really important is, um, having some kind of like specification layer you can just commit marked on files. Mm-hmm. That works pretty well, but
    [00:31:15] swyx: it’s nice to be notion man. I’m just saying like the spec, like Yeah. The natural home for specs is notion.
    [00:31:21] Simon Last: Yeah. Right. It can be a database of pages.
    Yeah. I mean, it needs to be something that is, you know, human readable and I viewable and I think that’s pretty key. Another really key component is like the, the self verification loop. Yes. You need really, really good testing layers, basically. And that’s a really deep, uh, uh, problem. But by getting that right, you know, and then, and then it’s kinda like the workflow of like.
    What happens when there’s a bug? How does it flow into the system? Like, is it like a subagent working on it? How does it make a PR and how does that get reviewed? And me, and then, you know, so there’s like the, the flow or process.
    [00:31:56] swyx: Yeah. Cool. Uh, you know, one thing we did work out before you guys came in was this demo or this
    [00:32:01] Simon Last: agents
    [00:32:02] swyx: agent demo.
    Uh,
    [00:32:03] Simon Last: so every,
    [00:32:04] Alsesio: every time we do an episode, we try the product. Right. I don’t think there’s ever been an episode that I haven’t tried. Yeah. Um,
    [00:32:11] swyx: and we, we try, try is a, a big word. Like since day one lane space has been on Notion, but this is the, this is the net new thing. Yes.
    [00:32:18] Alsesio: So this is for Nel Labs, which is the space we’re in.
    So next week we’re opening applications for tenants. So there’s a web form, let me, we got this form done here. Uh, so, uh, before. Uh, the workflow would be I get an email, then I look at the person. It was like, should I spend time talking to this person? Then I respond, they respond back. So I build this. So the name it came up for on its own.
    Can you maybe h how do, how does it come up with its own name?
    [00:32:43] Simon Last: Yeah, that’s a pretty app name. It’s, it, it is just a random, it’s a random, a name generator.
    [00:32:47] Alsesio: Oh, that’s funny. It just came,
    [00:32:49] Simon Last: the fact that it picked that is, is kind of hilarious. I’m pretty sure it’s just determined,
    [00:32:54] Sarah Sachs: resilient collector. I, I think I’ve never looked at the code for that.
    I’ve never second guessed it. I think it’s kind of like a madlib situation.
    [00:33:00] Simon Last: Yeah, I think you’re right. Yeah. It’s, it’s totally a, a deterministic. Oh, I thought it was great. Yes. Although, although when the, if you use the AI to set itself up, it can update its own name, so. Okay. Um,
    [00:33:11] Sarah Sachs: how did you create it? It, did you just do
    [00:33:12] Alsesio: classroom?
    I,
    [00:33:13] Sarah Sachs: okay.
    [00:33:13] Alsesio: I did, yeah. I’ll say just check my inbox for applications for a coworking space. Keep a people, so it created the database for me. Which I have here. And I guess database is like an notion table because everything is notion. Um, and then whenever um, an email comes in, like here, it just creates a new role for the person.
    Mm-hmm. And then it uses web search to enrich the mm-hmm. The profile. So it kind of like searches the web and it’s like, this is who this person is, this is when they say they wanna move in and kind of updates everything else. This is, I mean, it’s not a GI, but to me, I don’t wanna do this work. So it feels like, I mean, it took me maybe like 15 minutes to set up the whole thing.
    Um, and I really like that most of the information should live here. You know, it is not like some other tool asking me
    [00:34:01] Sarah Sachs: Yeah.
    [00:34:01] Alsesio: To like, bring my stuff there. It’s like I would’ve probably already created an ocean thing.
    [00:34:06] Sarah Sachs: Mm-hmm.
    [00:34:06] Alsesio: So
    [00:34:07] Sarah Sachs: most of our biggest use cases and gains are from. That extra layer of human involvement in the process to make it so right.
    And so like one of our biggest use cases is bug triaging. So if someone posts something in Slack, can you just have a custom agent that lives there that has its own routing constitution of what team this belongs to, creates a task in your task database and then posts in that Slack channel, right? Like that’s like one of the first things that we built internally, I think.
    And it’s completely changed the way that notion functions as a company. Nothing falls through, well, most things don’t fall through the crack. We don’t know what we don’t know. But it’s not replacing people, it’s replacing processes.
    [00:34:44] Alsesio: Yeah.
    [00:34:44] Sarah Sachs: Right.
    [00:34:45] Alsesio: And I’m curious how you think about composability of these things.
    So the other one I was working on is like a. These filler. So whenever somebody signs up as a tenant, kind of he’ll sell the lease for them. There should probably some agent that is like office manager agent mm-hmm. That can handle the request, make the lease, and then, uh, give them a ADA access to the office and all of that.
    How do you think about that feature?
    [00:35:08] Simon Last: Yeah, so I mean, there’s, there’s two ways you can compose. One way is by using like the data primitives. So you can, you know, you, you could give, you have one agent, uh, be writing to the database and there’s another agent that’s walked in the database. So that’s, that’s one way that they, they can coordinate that’s like a little bit more decoupled and mm-hmm.
    Works really well. Or you, you can couple them. So I, I think it’s actually not released yet. Releasing it like next week is, uh, in the settings for an agent, you can give access to invoke any other agent.
    [00:35:34] swyx: Hmm.
    [00:35:34] Simon Last: So you can have them just. Just, uh, uh, talk directly. So
    [00:35:37] swyx: you, was there a limit on like, number of recursions or just,
    [00:35:40] Simon Last: um, probably,
    [00:35:42] swyx: you know what I mean?
    Like, you can just get an infinite loop that way there’s
    [00:35:45] Simon Last: some kind of Yeah,
    [00:35:46] Sarah Sachs: I think it’s, there is actually a number somewhere.
    [00:35:49] swyx: I believe I’m just, you know, like, you’re, you’re, someone’s gonna screw up. You
    [00:35:51] Simon Last: should you try to see
    [00:35:53] swyx: Yeah. I mean, everything’s gonna be paperclips.
    [00:35:55] Simon Last: Oh, yeah. Yeah. But, uh, but, but that’s really useful.
    Yeah. So we, you know, like I just, I, I helped, uh, someone internally the other day, they had, they had built like over 30 custom agents for, uh, for our go to market team doing all kinds of different things. You know, for example, like researching, you know, like, like filling information about, about a customer or like, like triaging customer feedback or like, uh, something like that.
    Literally over 30 of them. And, and then he, and then he even made like a database of all the agents and then he is like, okay, and, and now I’m getting 70, over 70 notifications per day with just the agents are blocked on various things. Uh, and then I was like, oh, okay, cool. You know, the obvious thing to do there is to make a manager agent,
    [00:36:32] Sarah Sachs: right?
    [00:36:33] Simon Last: That’s gonna sort of blocks be another abstraction layer in between your, your, uh, uh, 30 agents. Uh, so yeah, we, we send out with like a manager agent and then has access to invoke all the other agents and it’s sort of like, like watching and observing them and then it sort of, it just creates a layer of abstraction.
    So instead of 70 notifications per day, it’s like, like five. And then, and then the manager agent can help like, uh, debug and fix any problems with the,
    [00:36:54] swyx: does this is a concept of like an inbox or something like piece, you’re basically saying that they can message each other?
    [00:37:00] Simon Last: Yeah.
    [00:37:01] Sarah Sachs: Well
    [00:37:01] swyx: they use the system of record, which, which is
    [00:37:02] Sarah Sachs: notion, so we
    [00:37:03] Simon Last: actually, yeah, we didn’t make any special concepts at all.
    [00:37:06] swyx: They’re interested to the motion notifications that I would’ve got,
    [00:37:09] Sarah Sachs: they can just like write a task to a database that the other agent’s task to listening to, or they can actually call a web book to the agent, like they can just add the agent. Okay.
    [00:37:17] Simon Last: Yeah, I mean, this is something that, that we’re still working on.
    I, I think we, you know, like, like generally, generally the way we do these things is, you know, you first make it possible, maybe like a sort of janky way. So I, I, I think the way I set ‘em up is like, you know, we created like a new database that was sort of like issues mm-hmm. That the custom agents were, were experiencing, and then gave them all access to file an issue and then the manager has access to, to read the issues.
    Um, and that works pretty well, essentially like, like give it its own like internal issue tracker just for the agents. And then, you know, if that becomes a, a concept that seems useful, generally maybe we will think of how to package it in. But I mean, generally we try to just keep it to composing the primitive if we can.
    You know, another example of this is we have no built-in memory concept. Memory is, is just pages and databases. And so if you wanna give a memory, just give it a page and give it. Edit access to that page and the
    [00:38:03] swyx: human can edit it. Agent can edit
    [00:38:04] Simon Last: it. Yeah. And so that works, that pattern works extremely well on it.
    And you know, depending this case, you can have it be just a page or it could be an entire database with, you know, or, you know, I can have sub pages is is pretty on what you can do with that.
    [00:38:15] Alsesio: So when I was setting this up, uh, I connected my inbox and it was like, do you wanna use Gmail or Notion Mail? And I’m like, I don’t wanna use Eater, I just want you to do it.
    I’m curious how you think about, you know, notion, mail, notion, calendar, all of these kind of ui ux interfaces, full stack
    [00:38:29] Simon Last: notion.
    [00:38:30] Alsesio: Yeah. When like at the same time you have the agents abstracting them away from you in a way, you know, how do you spend like the product calories so to speak?
    [00:38:37] Simon Last: Yeah, I mean, I think it’s pretty important that you don’t have to use, not your mail to connect to the mail capability.
    So we can just connect to Gmail or, or whatever you want, uh, to use. And we’re thinking of the mail service as being really great to the extent that it’s really agent built, right? So maybe the mail app is just sort of a prepackaged agent that helps you automate your, your inbox.
    [00:39:00] Alsesio: Yeah, the auto labeling is great.
    Think
    [00:39:03] Sarah Sachs: the, when we, um, integrate with Gmail for instance, we have a series of tools available that are available via MCP or API to Gmail. When we integrate with Notion Mail, we have the Notion Mail engineering team to build us the, um, exact right tools that optimize latency, optimize performance and quality.
    They own that quality. Um, there’s product leads there. They’re directly thinking about the user problems that happen in mail. So it tends to be when we build integrations and connections, we build natively first. Um, and then think about, um, extending them generally just because it’s also easier. Mm-hmm. Um, um, to build natively first.
    Um, so that tends to be how we phase things out.
    [00:39:43] swyx: Talking about integrations, you prompted me, so I gotta ask. M-C-P-C-L-I. What’s going on? What’s the
    [00:39:48] Simon Last: Yeah. Opinion. I think, I mean, I’m, I’m definitely bullish and excited about cli. I think there’s a few really cool things about cli. So one really cool thing is like, um, is that it’s in the terminal environment, so it gets a bunch of extra power.
    So it, you know, for example, it can like, like paginating and cursor through like long outputs. Um, and it has a progressive disclosure inherently. Uh, so, you know, you don’t see all the tools at once. It’s just, you see the CLI wrapper and you can like use the, the help commands and, and, and read files. And then I think the most important thing that’s, that’s super cool is that there, it’s also inherently a, a bootstrapped.
    So if there’s an issue, uh, the agent can debug and fix itself within the same environment that it uses the tool.
    [00:40:30] swyx: Mm.
    [00:40:30] Simon Last: Right. Like, you know, I think I saw a tweet this morning. Someone said, you know, my agent didn’t have a browser, so I asked it to make all a browser tool and within a hundred lines of code, it gave itself a little browser, like, like wrapping the, the, the chromium API, um.
    That’s pretty incredible. And then if there was a bug, it would just immediately try to fix it. Mm-hmm. Right. On the other hand, if you use an, you know, if you use like of, of the Chrome dev tools, MCP, I’ve had this issue where like, like sometimes the transport gets like messed up. If it gets messed up, the agent has no way to fix itself.
    It, it no longer has a browser, it’s, it’s not broken. Right. I think that’s, that’s pretty fundamental, but I would say like a lot of the, the bad things about it can be fixed. Uh, so I think like, as a progressive disclosure, that can be fixed with, with right harness. Like, it, it obviously doesn’t make sense to show it all the tools all the time.
    That’s not really inherent to the MCP protocol. It’s just like how you wrap it and use it.
    [00:41:16] swyx: There’s many poorly built MCPs because we didn’t know.
    [00:41:19] Simon Last: Yeah, yeah. I mean it was just early, like, like the obvious thing is, uh, you know, to start with is, is to just show it all the tools and it’s like, okay, now we have a hundred tools.
    Yeah. And like the tool calling actually works. So let’s of
    [00:41:28] swyx: your success
    [00:41:29] Simon Last: give it a way to like, like filter to source the tools. So yeah, I would say like broadly speaking, I’m really bullish on cli. I’m still bullish on CPS and in a certain environment. I think in, in particular, CP is really great for when you want sort of like a narrow, lightweight agent.
    I think there’s, there’s definitely a lot of use cases where, where you don’t want like a full coding agent with a compute run time. And also you want it to be like more tightly permissioned. MCP inherently has a really strong permission model, like all you can do is call the tools. A CLI is a little bit murkier.
    It’s like, can I access the, if PI token are you, like, properly sort of like re-encrypt the token so it can’t like exfiltrate it, it introduce a lot of like, like new issues, which are. Real and hard to solve. And MCP is just like the dumb simple thing that works and it that it’s pretty good.
    [00:42:12] Sarah Sachs: I’ll add two more perspectives, not from it working well for Notion, but how notion like commits to both platforms.
    Notion is dedicated to being the best system of record for where people do their enterprise work. So we will always support our MCP and so far as other people are using cps, right? So regardless of our perspective, we’ve put a lot of effort into our MCP and we have a fantastic team that we’re building, um, to do more there.
    And the second thing I’ll say, I think, um, we all think a lot, but lately I’ve been thinking a lot about making sure there’s a value alignment and pricing, um, with capability.
    [00:42:43] swyx: Literally our next question
    [00:42:44] Sarah Sachs: and. Needing language to execute deterministic tasks feels wasteful and requiring on a language model to interface with third party providers seems wasteful for tasks that don’t require it.
    And particularly because our custom agents are using usage-based pricing. We think of pricing as like the barrier of entry for use of our product, and we’re quite committed to making sure that it’s not wasteful. Um, not just because it’s a bad deal for our customers, but it’s also bad business. We wanna have as many buyers, like there’s a, there’s an elasticity of demand and so if we can have our agents properly execute code that calls on CLI deterministically, it’s a one-time cost, right?
    Versus constantly having a language model integrate with an MCP over and over and over and paying those like repeated token fees and it’s happening outside the cash window, then you’re paying for it over and over and over and it’s just kind of unnecessary and less deterministic when it doesn’t have to be.
    [00:43:36] Alessio: Yeah, the open-endedness I think is like, the main thing is like, well, if I go write code to just call an API, I would never use an MCP. But then you need an NCP sometimes when you know what to call, but you don’t want it to restart versus like, I think the it built a browser from scratch is like, it’s great when you’re doing it on your own, but like if your customers were having your AI write a browser from scratch every time and you had to pay the token cost of that, yeah.
    You’d be like, no, no. The Chrome dev tools CP is actually pretty great. Just use that. I’m curious, how do you make that decision? Like should it be. Just straight API call very narrow. Should it be an MCP? Should it be super open-ended?
    [00:44:10] Sarah Sachs: Do you mean for when we ship notion capabilities or when we add capabilities to
    [00:44:13] Alessio: notion
    [00:44:14] Sarah Sachs: AI or,
    [00:44:14] Alessio: I mean, you might have a capability that the only way to do is an open-ended agent, like an agent with a coding sandbox.
    [00:44:21] Sarah Sachs: Yeah. In Notion ai they’re not explicit, not We also ship an MCP.
    [00:44:24] Alsesio: Yeah. Yeah. In B,
    [00:44:25] Sarah Sachs: yeah.
    [00:44:26] Alsesio: Internally. Okay. Like is there ever a discussion of like, we’re not gonna ship it because we’re not able to tie it down? Or are you happy to just like,
    [00:44:33] Sarah Sachs: um, no. I mean, there are a lot of things where we choose not to use MCP because we wanna add more high touch to quality.
    I think search an agent to find is like the largest instance of that, where we have. Um, slack and linear and Jira search and notion that is not using necessarily the search MCP functionality that is provided by those companies. And that’s because it’s quite critical we think, to how our agent trajectories work is for us to have a little bit more control on the functionality of the search journey.
    And so it usually comes from quality and there’s a long tail of things and that’s why we built an MCP client or an MCP server, excuse me, so that people can connect whatever they want. There’s that long tail, right. But we, for search particularly, I would say that’s like the primary entry point, but there are other connections as well that it’s a little bit of secret sauce about when we are okay with like MCP functionality and user driven off.
    And when we actually want to wanna carry a lot more ourselves.
    [00:45:31] Simon Last: I think that there’s not really a conflict here. There’s just like different layers of the stack and different abstractions. I mean, if we were to like map it out, it’s like, you know, you’ve got CPS give you a, a way to, it’s a protocol for gaining access to tools.
    It’s an open protocol, so you can, you can easily get like a long tail, many things. So if you open up our, like in the tool settings, oh, that’s saw the trigger. Actually, actually, that’s something that MCP can’t do. So if you scroll down and you, and yeah. The, the tools and access, so you’re gonna a connection.
    Yeah. MCP is a really great way to gain access to tools or really well, but you just looked at the, the trigger why, for example, there’s no trigger protocol. And so those are things we had to build ourselves. And then there’s, there’s some integrations where we use MCP. Like, so for example, I think the, you know, the linear and the GitHub
    mm-hmm.
    [00:46:20] Simon Last: Use M ccp, but, but the Slack mail, er, those are actually ones they built in house. And we spent a lot of time really fine tuning all the tools to make the really good and also like building out the triggers. So it’s just like different layers of the stack. Some things make sense sometimes. And then, you know, we just have to like, like harness the right tool at the right time.
    I don’t think there’s an inherent like. Strong conflict between these things.
    [00:46:40] Alsesio: Do you have a canonical representation of these tools internally where like you wrap these things together, the MCP plus, the custom built?
    [00:46:46] Simon Last: Yeah. Yeah. We have like internal abstractions for like what is a tool, what is an agent, what is a completion call?
    Yeah.
    [00:46:55] Sarah Sachs: We even have internal obstructions for like, what is a chat archetype, whether it be from teams or Slack.
    [00:47:02] swyx: Yeah.
    [00:47:02] Sarah Sachs: Right.
    [00:47:02] swyx: It’s like the only
    [00:47:03] Sarah Sachs: way a to
    [00:47:03] swyx: build with, with ai ‘cause everything’s moving so quickly, you would have to attract it so that you can swap things up.
    [00:47:09] Simon Last: Yeah. I mean, there’s always a dance.
    We, we probably rebuilt our, our framework like, like I said, like, like five different times. Um, it’s always a dance of like, okay, how does this new thing work? Right? What should the abstraction be? Like, what is OpenAI giving us? What is that therapy giving us? Um, you know, like we’re trying to wrap over it. I think.
    I think we’ve been pretty successful with that. It, it’s just a matter of like, like staying nimble. Yeah. And making sure that you always have like the simplest, dumbest obstruction you can, that you know, that the maps are different things. Yeah. So, so we have like a tool integration abstraction, for example.
    And then MCP is like a, a type of integration.
    [00:47:41] swyx: Yeah.
    [00:47:42] Simon Last: That’s, that’s one of the,
    [00:47:43] swyx: this might be a big ask, uh, um, but I’m gonna try, uh, which is, you said, you’ve said multiple times, you rebuild a few times, like five times through, I don’t know if the, what the right number is. Is there like a brief history of what was the each rebuild doing and Yeah, I know it,
    [00:47:56] Simon Last: I can try to do that.
    I
    [00:47:57] swyx: mean,
    [00:47:58] Simon Last: yeah, there’s
    [00:47:58] swyx: interesting, you need, you need to rag over
    [00:48:00] Sarah Sachs: archeology.
    [00:48:00] Simon Last: I mean, the first version, the first version that we started building in like late 2022. Oh my gosh. Well, there’ve been many versions actually. Okay. Well the writers, the,
    [00:48:08] swyx: I like the highlights. The,
    [00:48:09] Simon Last: the
    [00:48:10] swyx: like,
    [00:48:10] Simon Last: oh
    [00:48:10] swyx: wow.
    [00:48:10] Simon Last: I mean the, the first version we built was actually a coding agent.
    Yeah. So we’re like, oh, instead of building tools, let’s make everything be JavaScript and then we’ll just give it JavaScript APIs and we’ll just write code. And that’s how it speaks to the tools. Um, but at the time. It just sucked at writing code. It wasn’t that good. Uh, so then we moved to, uh, more of like a tool calling obstruction.
    A tool calling didn’t exist yet, so we created this whole XML mm-hmm. Of representation. And a big, a big learning in that version is we were catering way too much to what made sense for notion and notions data model versus what the model wants. So as an example, we created this whole, uh, XML, uh, format that can losslessly mapped in notion blocks.
    And the transformation between them is super easy to do. Uh, and then we created this sort of like mutation operations to, to add to pages. Um, but it sucked because the model didn’t know the XML format and also the, and you have to prompt it
    [00:49:04] swyx: in and
    [00:49:04] Simon Last: Yeah, to prompt it in and the team just more convenient.
    And so yeah, we’re like, okay, well it has to be marked down. Uh, uh, the model’s no markdown, you know. So, uh, we did a whole project around basically, uh, uh, creating a notion flavored markdown where, uh, you know, the whole goal was like, it has to be just simple markdown at the core, and, and then we can add some enhancements.
    And it doesn’t have to be a, a full lossless conversion. That was a big one we did. And, and then we did a whole similar learning to, uh, the, the database layer. So, so to query a database, I mean, in the notion API, the way you query a database is there’s a crazy JSON format and it’s, you know, kind of limiting, but it maps nicely to like how we represent things internally.
    We scrapped all that and we’re like, okay, let’s just make it SQL light. Everything is a SQL Light database. You, you can query it just like a SQL light query. And the models are super good at that. So
    [00:49:51] swyx: give the models what they want.
    [00:49:52] Simon Last: That was another one. Yeah. Yeah. Give us what they want. I mean, that was, I would say that was a big learning is just, you know, really be, be savvy and really careful thinking about what the model wants in terms of, you know, its environment and, and, and cater around that.
    And really try so hard not to expose it to any complexity about your system that, that’s unnecessary.
    [00:50:12] swyx: Notions underlying database is Postgres, right? Not sql, right? Yeah. So I don’t know if there’s any mismatch there.
    [00:50:18] Simon Last: That one was kind of a fortuitous thing because we actually already, um, had a big project, uh, going where, so, so we have this, um, when you query Notion database, it’s actually querying this like, uh, cluster of SQL databases.
    [00:50:34] swyx: Mm-hmm.
    [00:50:35] Simon Last: That’s something that we’d already been working on even before the agents came around.
    [00:50:38] swyx: Yeah. You know, you guys had a fantastic blog post about it and like it’s, it is actually a really good database engineering knowledge to have that from you guys because where else would we get it?
    [00:50:47] Simon Last: Yeah, yeah.
    It’s a, it’s, it’s a crazy engineering problem when you want to have like millions and billions of tiny databases or where, where some of them are tiny, but some of ‘em are, are very large and want everything to be very fast.
    [00:50:57] swyx: Yeah. And also like, not that hierarchical sometimes, you know, uh, so somewhat of a graph.
    [00:51:02] Simon Last: Mm-hmm.
    [00:51:03] swyx: I do like that history because I think that shows the evolution that you guys went through and the work that went into it,
    [00:51:09] Sarah Sachs: that he just ended you a year and a half ago.
    [00:51:11] swyx: Oh, okay. Okay. Oh,
    [00:51:13] Simon Last: I need to, I need
    [00:51:13] swyx: to hit continue.
    [00:51:14] Sarah Sachs: If you’re curious. I mean, we can keep going. Just saying like, that’s really,
    [00:51:18] Simon Last: that’s another one.
    Yeah.
    [00:51:19] Sarah Sachs: I lemme think. Well, no. ‘cause there was tool calling and then there was research mode, which wasn’t a fully agentic tool calling. Um, then we moved away from few shot prompting entirely to tool definitions. Um, and now we’re thinking about Agent 2.0.
    [00:51:34] swyx: So no fusion prompts ever. Right.
    [00:51:35] Sarah Sachs: Uh,
    [00:51:36] swyx: okay. No, maybe not.
    [00:51:37] Sarah Sachs: I know never, but
    [00:51:38] Simon Last: yeah, that kind of went away. It’s an interesting thing,
    [00:51:40] swyx: right?
    [00:51:41] Simon Last: Yeah. I mean, so
    [00:51:41] swyx: these just instruction follow really well,
    [00:51:44] Simon Last: I would say if there’s been like a general arc where, you know, it’s like you gradually strip away everything. And it, it looks more a GI like. And so, you know, it it, it started out as like, it’s a one shot, one prompt.
    There’s a few shot examples. And it became like, okay, actually let’s give it, let’s give it tools, but it’s still a few shot examples. And then it became actually like, no, no, no, let’s just give it a whole bunch of tools. One big, big shift that, uh, that we I’ve been working on recently that’s about to ship is, um, you know, what happens when you have a lot of tools?
    [00:52:13] swyx: Yeah.
    [00:52:13] Simon Last: So then tool search. Yeah. So then a, a progressive disclosure becomes really important. So, you know, we were, we sort of hit a bottleneck where our, our agent worked really well. Um, we hit a bottleneck where, um, it, it, it became pretty hard to. Add new tools. Mm-hmm. And we, and we became sort of worried about it, like, like breaking the model.
    It’s like, okay, someone No, I
    [00:52:32] Sarah Sachs: just heard it was like saying hello was like thousands and thousands and thousands
    [00:52:35] Simon Last: Yeah.
    [00:52:35] Sarah Sachs: Tokens. It was really slow.
    I
    [00:52:37] Simon Last: can see you’re the efficiency person here. Yeah. It’s, it was too many tokens. But also it’s a quality issue because it meant that like any engineer could introduce this, this new tool for some like, like niche feature.
    And it would kind of like, like Nerf, the overall model by like causing it to call the tool too much and stuff like that. And so, um, it, uh, yeah, so we, uh, we had an effort basically to, to make our harness. Uh, implement progressive disclosure in, in a nice way. Um, that’s a big shift.
    [00:53:00] Sarah Sachs: You said earlier, like everyone says reasoning models was the big shift.
    Like what’s more there? When we went away from few shots to describing the goal of the tool in like goal-driven, basically moving from a DAG to like a, a true system with feedback, that’s something we could distribute tool ownership to the teams. Much better because when it was all few shots, it was everyone truly editing one string and things would o would compete.
    And like the order, there were all this, all these papers about, oh, you know, not all context is created equal. The higher up it is in your examples, like the more the model listens and we’re trying really hard to like fight against the order and the selection of the few shot. And that really had to be a center of excellence and it didn’t scale with the number of people for the need the company had.
    It was really just five or six people that were allowed to even touch that or had to approve it rather in our code base. And then now we can actually, with the right eval, setup, distribute, um, so that everyone owns their tool and their tool definition. And sometimes we have crazy things where like we write two tools that have the same title and the agent crashes and stuff like that.
    So like, you know, there are issues actually, believe it or not, um, Andro couldn’t take it. Sonic couldn’t handle two tools with the same name and open AI GPT five point. Two, it was like, I can figure this out. So that was an interesting one that we learned by accident through a, a sev.
    [00:54:17] swyx: But I mean, then, you know, the underlying representation is that’s a addict, right?
    Clearly. Like that’s a safety. Yeah,
    [00:54:23] Sarah Sachs: exactly. Exactly. Um, but so that was like a big shift for the company and velocity not immediate because the AI team that was the center of Excellence team that owned, you know, that one file of few shop prompts had to become a platform team overnight, and that wasn’t natural.
    Yeah. Yeah. But I would say that in terms of like the velocity of how we contribute to the agent, beyond coding tools, obviously being a big velocity lever, um, being able to distribute tools and not have to all collaborate on like one very select string of system prompt is truly, I would say the biggest lever on how we’ve scaled.
    [00:54:57] Simon Last: We’re fighting to keep the prompt as short as possible now and then, yeah. Yeah. It’s, uh, in the latest version of the agent, I, it’s not in custom agents yet, but it will be like, like next week, a week after or so, um, there’s now like over a hundred tools. Just for all, all the crazy notion stuff. So we’re able to, to really go deep and like,
    [00:55:11] swyx: would you list those tools publicly?
    Is this like IP or, uh,
    [00:55:15] Simon Last: no, no, no. It’s, it’s totally public. You can ask,
    we
    [00:55:17] Sarah Sachs: can fine
    [00:55:19] Simon Last: just ask. You can just ask the agent and, and we’ll tell you.
    [00:55:21] swyx: I find,
    [00:55:21] Sarah Sachs: and we’re gonna post a bench. I mean, like you’re
    [00:55:23] swyx: post bench.
    [00:55:24] Sarah Sachs: We don’t think our system prompt is our secret sauce.
    [00:55:26] swyx: Yeah. Mm-hmm.
    [00:55:27] Simon Last: Great. We don’t try to hide the tools at, at all.
    I think it’s, I think it’s kinda important actually as an operator, you know?
    [00:55:32] swyx: Yeah. As a power user, I wanna be like, oh, I can do this, this, this. Great.
    [00:55:35] Simon Last: Yeah. Yeah. I mean, one thing that, one phrase we say internally in lot is to, to teach at the top of the class. You know, we wanna build like, like the customization’s, kind of like a power tool.
    I mean, we try to make it as easy as possible to set up, but we want it to be pretty deep and sophisticated. And I think a huge part of that is the operator needs to be able to interrogate. The way the system works. And a big part of that is like, what are the tools? How do they work? You know, like, like how should I prompt it to use the tools in the right way?
    [00:56:00] Sarah Sachs: I’d actually say we don’t try and make it as easy as possible to use. ‘cause the more we do that, the more we abstract away that interpretability, that Simon’s talking about, that basically nerfs the model or nerfs the agent from being super capable. So a huge. I would say turning point, I can think about like the week and a half that we all came together on this as we were building custom agents, was that alignment that we’re not trying to build for everyone here.
    We’re not trying to build the model that, um, or build the user experience that anyone can figure out how to use. ‘cause the more we do that, the more we just diminish its capabilities. And that was a big, you know, everyone in a couple Slack messages aligned on that, that actually made us all work faster again.
    Right? ‘cause we all were like more centralized on who we were building for
    [00:56:40] Alsesio: what does the meta prom generator look like? So I looked in the system prompt that it, gen, for example, uses emojis. That’s not a, you know, obvious thing to be doing.
    [00:56:50] swyx: Wait, did you just
    [00:56:51] Alsesio: ask it? What’s your system prompt? Oh no. This is how to generate prompts.
    [00:56:54] swyx: The
    [00:56:54] Alsesio: prompts generate prompts.
    [00:56:55] Sarah Sachs: We call it set. Then it’s
    [00:56:56] Alsesio: a set.
    [00:56:56] Simon Last: Well, well, so this is actually just the agent. So, so one thing we did that, that I really like with the custom agents is it can set itself up. So we not only give it access to use the tools than it has access to like send your emails or whatever, um, but it has more tools to set itself up and to debug itself.
    And so when you ask it to write system prompt, it’s just your agent itself is doing that.
    [00:57:16] Alsesio: So this is just the model preference. You’re not really injecting and then into the model too much.
    [00:57:21] Sarah Sachs: No, no. We haven’t guide the same thing. Makes a good custom agent and Yeah.
    [00:57:23] Alsesio: Yeah.
    [00:57:24] Sarah Sachs: And things like that. And then, and, and it’s really nice too because like if it fails, you can ask it, why did it fail?
    And then say, okay, update your instructions so it doesn’t fit again. Obviously we should build product of self-healing that’s, that’s next on our roadmap. But um, it actually, it creates a nice system.
    [00:57:40] Simon Last: Yeah. We do essentially give it like a development guide. Here’s, you know, here’s how to make a custom agent.
    Here’s how to like, like help the user test it end to end, you know, to, to help them gain confidence that it works. Stuff like that.
    [00:57:49] Alsesio: Mm-hmm. Yeah. Yeah. The fixing thing work, I mean, it wasn’t automatic, but I, I miss set something up and then there works like a fix button and then just, yeah,
    [00:57:58] Simon Last: yeah, yeah. One thing where
    [00:57:59] Alsesio: fix agent makes more,
    [00:58:01] Simon Last: it’s, it’s actually, it’s an interesting sort of permission problem.
    So like, right. The thing about custom agents. That is that by default it has no permission to do anything and then you have to explicitly grant it all its permissions and that’s what lets you trust it can work in the background. Right? Like you can know like, oh, it, it can read my email but not send email.
    Okay, I can trust that. Right. If you let it fix itself, you know, you’re, you’re breaking that, that version there, it, it is not allowed to edit its own permissions. But as, so, you know, in the current product you can sort of click a button to fix, but now you’re entering sort of an admin mode where, where, where you’re in a synchronous chat and, and you can, and you can see what it’s doing.
    [00:58:35] Sarah Sachs: Yeah. And it, and it confirms before it
    [00:58:37] Alsesio: changes.
    [00:58:37] Sarah Sachs: Yeah.
    [00:58:37] Alsesio: The thing that I really like that most people don’t do is like, the editing chat is the same thing as the using chat. Like you can message the agent to both edit it and use it, versus a lot of other products are like, I think
    [00:58:49] Simon Last: that’s really key. I think, I
    [00:58:50] Sarah Sachs: think a lot of designers will feel so happy you said that.
    Yeah. ‘cause we spent, we, we call this flippy, um, uh,
    [00:58:55] Simon Last: yeah. What is
    [00:58:56] Alsesio: this?
    [00:58:56] Sarah Sachs: What do you mean? This,
    [00:58:57] Simon Last: this view of, well, yeah, so if you sort of, if you close that in like open settings, you can see sort of Yeah. This is, we. We call it flippy because you know, we started with sort of like the settings were the sort of the main page and then you can test the agent.
    The a GI pill way to think about it is like, oh, it is just the agent. Everything’s the agent, right? It can set itself up, it can test itself and it can run the workflow that they want to run. Uh, so we flipped it. So the main view I was looking at is the chat and, and then the settings is more just like a side panel at, at sort of previewing the changes that it’s making.
    So you can introspect on them or, or you can also make changes manually if you’d like. But, but we wanna design the experience from the get go. So you don’t have to ever any of the settings manually, you can just talk to it.
    [00:59:39] Sarah Sachs: And the inside baseball is like how this works was probably the launch blocking part of this build.
    Right. Um, especially ‘cause we had a lot of early adopters that were used to the old way and that’s like the benefit of adopting in public. But then changing how people think about setting up custom agents when they already had this flow in and of itself was difficult. Um,
    [00:59:57] Simon Last: I mean that’s really fun ‘cause the, we, we ended up sort of uh, uh, painfully delaying the launch.
    Mm-hmm. By.
    [01:00:04] Sarah Sachs: A month?
    [01:00:04] Simon Last: A few weeks. Yeah, definitely. Like, like a month or so. Um, but the whole team was super enthusiastic about it though. ‘cause it was just so much better. It was like, oh yeah, obviously you have to chat with it, right? Yeah, yeah, yeah. To set itself up. And everyone was super bullish on that, so it was like, like painful for a second.
    But then everyone’s like,
    [01:00:19] Sarah Sachs: right, and like back to, you know, organization design, which I probably care about more than Simon, but like the people that built this are three engineers from three different teams. Because we’re like, we need to launch this and we need to fix this. And then we’ve just built a company where then we just put people on it and no one complains, the manager doesn’t complain.
    And we were able to unblock and just ship it.
    [01:00:37] Alsesio: Yeah, yeah. But being in a failure chat and asking it to just fix yourself is amazing. Versus I gotta copy this and put in the settings chat. Mm-hmm. Mm-hmm. To do
    [01:00:49] Simon Last: it. So yeah. Interesting. Like a trade off in there that, that we’re trying to explore, which is, you know, we wanna be like a business enterprise safe agent where you can delegate something and, and trust that it’s gonna work.
    But also we want to get some of that sort of bootstrapping power that, that you feel like when you’re coding it is making a browser, like for itself, right. There’s something there. I think that’s, that’s really important. So it’s, we’re trying to sort of. Navigate that, that, that trade off and try to get you both.
    [01:01:12] Alsesio: Now it’s free, it’s amazing. Uh, I’m worried about when I have to start paying. How do you think about, so you have notion credits as a payment for this, which is like separate from the usual tokens, uh, that the model generates. How do you design pricing, value-based pricing based on the task and things like that.
    [01:01:30] Sarah Sachs: So they are, um, the credits and payment structures associated with the token usage. The reason that we had to make it not just throughput of tokens is that it’s not always priced that way. Like our, um, fine tuned and open source models are served on GPUs, right? Web search is priced differently. You know, if we were to host sandboxes, those are priced differently.
    So we had to think of an abstraction above tokens. And it’s also not just tokens, it’s the token model. Um, and serving tier trade off, right? Mm-hmm. Because we can have priority tier processing, we can have asynchronous processing. The cash rate could be different, um, depending on who uses it when, right?
    And so we wanted to, um, from the get go commit to making sure that customers were getting the fair deal. Not necessarily that we were making a ton of money off of it, but that customers were paying for what was reasonable. That’s the fundamental of where we started. And also, you know, we’re selling enterprise sa, so if we sell credit packs and you get discounts if you’re an enterprise and you buy a certain amount of credit packs and things like that.
    So it also just helped the sales motion, um, work a little bit easier. So that’s the answer on the abstraction of credits to dollars. Now was the question how we decide how to price it or?
    [01:02:34] Alsesio: Yeah, like, I mean, I think there’s, all tokens are not made equal, but yeah, we obviously get charged mostly equal. Like you can ask, uh, codex to create you a dumb tool for like, I created one for our StarCraft two land for people to like find a game.
    Uh, but then people create it to build features and like billion dollar companies. But the token price is the same.
    [01:02:53] Sarah Sachs: Yeah.
    [01:02:54] Alsesio: Like for you, I can ask this to update my favorite recipes doc. I’ll do it, but I could ask it to like respond to an email from an investor and like the value is like very different, you know, and you could charge more, but you’re not necessarily doing it.
    So I’m curious if there was any discussion.
    [01:03:11] Sarah Sachs: I think, I think that, um, that’s not where the market is right now. Um, number one, the second reason that we’re not doing that, as it ended up being kind of complicated to figure out what was complicated or not. So we at first we were like, let’s just charge on agent runs.
    And you know what, you went through all the different versions that ultimately just brought you back to a lot of complexity that mapped directly to token throughput. And so it, it’s also just simpler. Um, it’s quite difficult, um, to build those pricing systems. And, um, I actually think that one of the biggest reasons we want had usage based pricing for this capability is.
    We’ve had our core agent for a while with a model picker and there were certain models, um, or certain functionality that we had margins to maintain. And if we wanted to ship this functionality, uh, you, we couldn’t afford it, it would bankrupt the company. If we let, for instance, like autofill or the database autofill feature, we’ll soon be agentic That will be associated with usage based pricing.
    Because if every single autofill action was an agent running on Opus on every single database sell, it would be billions of dollars, right? And so we had to find a way for the customers that wanted to do more and wanted to give us their money and pay more to find the outlet for them to do it, that we didn’t have to apply to the lower end of the curve.
    And also, not all knowledge work is equal. Like there’s different points. A lot of the agent workflows here really saturate model capabilities. Like you don’t need a complicated model for it. And so charging based on token usage, um. It, we couldn’t just decide for you that you wanted your email client to be dumb or not, right?
    Like, we want you to decide if you want to have Opus Auto Triage all of your emails, we will actually give you nudges in the product to rethink if that’s the right choice. Right. Um, because also not every user, um,
    [01:04:52] Alsesio: understand.
    [01:04:53] Sarah Sachs: You’d be surprised in user interviews. People would be like, oh, I didn’t know that.
    So now we actually have a little hover that tells you like if it’s expensive or not. Yeah. I mean, it’s also slower. So the thing that’s interesting is like people don’t care about speed and custom agents. And so the incentive of like, uh, haiku being faster, people don’t care when it’s asynchronous. Um, and so we want to only provide the service of extra, extra benefit that people want.
    And the best way to do that is to incentivize them because it’s their own own money.
    [01:05:21] Alsesio: It must be confusing for people that are not familiar. It’s like, why is there no 5.3. You know, you open this thing and it’s like, is there something missing? Manual. It’s not their fault. Not their fault.
    [01:05:30] Simon Last: Yeah. That’s just the world we live in now.
    [01:05:32] Alsesio: Yeah. It just radical jump point too, it’s like Cloud had that.
    [01:05:35] Sarah Sachs: I mean, but auto is heavily, I think what’s actually been hard for us is to tell convince people that auto is not just our cheapest, dumbest model, but actually the model that’s best for the task that you wanna do. Um, alright. Steve.
    [01:05:46] swyx: I mean,
    [01:05:48] Sarah Sachs: exactly.
    Nice. Um, and a lot of our job is actually figuring out auto because it’s like,
    [01:05:54] swyx: this is the agent lab. Every agent lab has an auto. Mm-hmm.
    [01:05:57] Sarah Sachs: Yeah. And
    [01:05:58] swyx: that’s the job.
    [01:05:58] Sarah Sachs: Exactly. Because if you think about, like I said, I come from Robinhood, like you could spend a lot of time keeping up with the markets or you could have a auto investing, right?
    And you can have an index fund or you can have
    [01:06:12] swyx: roboadvisors
    [01:06:12] Sarah Sachs: of the robo advisor. And so like at a certain point we also can be roboadvisors and like we have a lot of people figuring out what model is best for the right task. And we now, we’re not using auto as a, as a margin maker, we’re just using it to kind of reduce stress.
    It’s not opus, that’s for sure. Yeah. Because a majority of the tasks people are doing aren’t opus level, um, intelligence.
    [01:06:34] Simon Last: The other thing I would say is the, um, you know, unlike a lab, we aren’t fully incentivized just for you to use as many tokens as possible. We’re actually really interested in. Giving you the right tool for the job.
    A lot of the time, the right tool for the job is actually just writing code and not even using agent at all. So that’s, that’s something that we’re investing in a lot is like, you know, imagine your, your agent can actually automate itself out of a job. Right. We would love if that were true.
    [01:06:58] Sarah Sachs: I feel very strongly about this because I don’t necessarily feel like that’s the SKUs that Frontier Labs give you.
    I feel like they are just getting more and more capable and more and more expensive, which is fantastic for the use cases of when people wanna do really complicated things on Notion. Um, what’s difficult is like that market that I think right now is no man’s land of where reasoning models were six months ago, that the nano haikus, et cetera, haven’t caught up to, because now we’re just paying more for those, um, for like extra capability that we didn’t necessarily need and so are our customers.
    Mm-hmm. And, um, labs aren’t necessarily incentivized, um, right now with how few players there are to be meeting the market everywhere. They just need to be the cheapest. They don’t need to be at value that the customer wants.
    [01:07:41] swyx: Hmm.
    [01:07:42] Sarah Sachs: If no one’s cheaper than them, then they’re the cheapest and that’s good enough.
    Right. And so we’re doing a lot to make sure that we have the right optionality, um, to switch between models and also invest in open source because the open source models actually are, um, getting to be the place where reasoning models were three, four months ago. And, um, that’s what’s filling that gap right now.
    So you’ll see we offer Mini Max and, um, we are collaborating a lot with different open source labs to think about notion’s last exam and how they can do better on these types of tasks. Mm-hmm. So that we can offer them for that intelligence to price to latency trade off. Because, you know, in that triangle of intelligence, price, um, intelligence, price and latency, excuse me, um, users get to choose where they are, but right now, um, there’s not, the whole triangle isn’t filled with models, right?
    Yeah. And the more that different models build cluster triangle capability, everyone’s clustered in capability where everyone’s cluster. I mean, haiku’s not that much cheaper. No one’s really in the middle. Like people really tend to. Cluster round two. Mm-hmm. Like, this is really capable and it’s really fast made, it’s really expensive or whatever.
    Right. And so we just wanna make sure that that triangle’s filled, um, and we wanna offer the models that fill it and we wanna, um, gate guide users to understand when they need it. Yeah. Um, which one,
    [01:08:54] swyx: I mean, all I’m hearing is that someday you’re gonna change your model. You have lots of tokens.
    [01:09:01] Sarah Sachs: I don’t know if, what do you mean by train your model?
    You train
    [01:09:03] swyx: your
    [01:09:03] Sarah Sachs: own, train your own model. Don’t know. We have money to train a founda. I mean,
    [01:09:06] Alsesio: you go raise
    [01:09:07] swyx: it. Yeah. You, you can raise it.
    [01:09:09] Sarah Sachs: That’s your job, Simon. No, I, I don’t think that that needs to be our core competency.
    [01:09:14] swyx: This is usually the, the thought process that leads to like, well, no one else is doing it.
    We, we will take a crack. You know,
    [01:09:19] Simon Last: I think I’m, yeah. I mean, I feel like to the extent that we do anything like training in the other area I’m actually most excited about is, um. Less of like one big model for all the users, but like as, as, as it becomes more possible to do, you know, to make like a specific fine tuning that’s like really knows your context of, you know, your company, the people that work your company, what’s going on.
    I think that’s, that’s pretty interesting because if you, if you had a model that really knows your company, I think that would be like a huge quality uplift.
    [01:09:47] Sarah Sachs: We actually have some enterprise vendors that kind of ask about this, um, along with bring our own key. Like if I have a model that really understands like my enterprise that we’re training for all these reasons, these tend to be like quite large institutions thinking about how to let people bring their own models.
    But those models have to function with like
    [01:10:04] swyx: right
    [01:10:04] Sarah Sachs: understanding how to call our tools. And that’s where again, having, um, more. Public system prompt is like beneficial to notion, right? Um, we want all models to plug into notion as, as, as well as they can. Um, that being said, like of course there are certain aspects of notion where we do fine tune and do reinforce and fine tuning on our own capabilities.
    Um, but that’s not necessarily trained on user data. Um, you don’t need that, that much data, um, in the first place. And that’s where when we have like a data scientist and a, a model behavior engineer really understand where the capability gap is, that’s when we invest there.
    [01:10:38] Simon Last: I personally burned a lot of time trying to train models.
    Uh, and it’s tempting, right? It’s so tempting, retraining
    [01:10:46] Sarah Sachs: every day.
    [01:10:47] Simon Last: I was doing crazy amount. Yeah, I was doing a lot of different things. Um, and it, I
    [01:10:50] Sarah Sachs: was the budget person that came and found out and I showed up and I heard that that was happening time
    [01:10:55] Simon Last: out. You know, like a, a funny thing that ‘cause the sort of an arc that like looped on itself is, uh, you know, back when I was doing tons of training stuff, it takes a long time to do it.
    Any kind of training run. And so. You end up operating like, like 24 7 around the clock. Like it becomes very important that before you go to sleep, like everything is watch intensive board, all the experiments are, are started. And then as I stopped training, that kind of went away. But now the coding agents have totally brought this back.
    Mm-hmm. So now every night before I go to bed, I’m like, okay, did I start enough agents, you know, to get them done. I get everything done. So it, it’s, it’s a ding interesting heart,
    [01:11:26] swyx: this balance of like, you have to try polyphasic sleep so you can wake up every two.
    [01:11:29] Simon Last: Absolutely. Yeah. Yeah. We, uh, yeah, I have not gone there yet, but, but my goal these days is just to, before I go to bed.
    The agents are running, and I’m confident that they won’t be done by the time I wake up. Really
    [01:11:41] swyx: Eight
    [01:11:42] Simon Last: hours.
    [01:11:42] Sarah Sachs: There’s a, I won’t say which coding Frontier Lab, but there was a point where he had like outlived like the thread length and context length uhhuh that that coding agent provided. And I DMed you DMed them being like, Hey, I need, I need more.
    And our account rep DMed me directly and they’re like, is Simon trying to prove string theory? Like what is he doing?
    [01:12:00] Simon Last: Yeah. I, I had a single coating Asian thread going for I think it was like 17 days. Uh, pretty much continuously.
    [01:12:06] swyx: Don’t, don’t they just compress? I mean, yeah.
    [01:12:08] Simon Last: Yeah. It was actually just a bug.
    It was a harness bug. Yeah. It, it had done compaction like a hundred times probably.
    [01:12:13] swyx: Yeah. The
    [01:12:14] Sarah Sachs: other thing that um, reminded me about fine tuning that I think you and I have aligned on is that. Our tools change really frequently, and right now we spend a lot of time rethinking and building tools for capability and fine tuning a model, um, to understand your tool.
    Like we don’t have legal expertise or coding expertise. So if we were to fine tune a model, it would either be expertise about the enterprise and you know, we have ZDR, no data retention offerings for those enterprises. So we’d have to really rethink how we structure if an enterprise wanted to opt into that or it would be fine tuning and better capability on navigating our tools that doesn’t match with the velocity with which we create new tools.
    And so it actually really slow us down, um, to have a model that was fine tuned on our tools because we’d have to retrain it and cut a new model every time we did that. And that’s not how we’re set up right now. Um, particularly with the way that we’re changing our, I, I guess we could fine tune a model to like search for tools.
    It’s just. The, the amount of time it takes to do that, ship it, have the right system, you’re basically making a bet against a frontier capability not serving that, and the time it takes you to build it. Mm-hmm. And that, that time lag hasn’t happened for us yet. It hasn’t
    [01:13:17] Simon Last: been, yeah. It’s just the wrong trade off.
    I think. It’s just like you want Yeah. We literally change our tools every single day and if we notice an issue, we will, we’ll, we’ll, we’ll fix the problem. I think a, a good way to think about it, I think is pretty fruitful, is like, don’t focus too much on training. I would think of that as like, that’s an implementation detail.
    Like what’s the outer loop, right? Like, like the outer loop is you have a model and then some harness or, or system where it’s interacting with the system that needs to work. And you know, if there’s a problem, the way to solve the problem isn’t necessarily to train a model. It’s like, oh, maybe there’s just a bug in one of the tools.
    Right? And actually 99% of the time it’s a bug in one of the tools, right? And so just fix the bug. And then the outer loop thing that’s really fruitful to think about is like, how can you improve your, your velocity and robustness? Making really good tools, making a good harness, you know, like, like verifying it works.
    Hmm.
    [01:14:07] Sarah Sachs: The one place that we do invest more in model turning now necessarily though, is actually in retrieval because, um, we’re at a point right now in our business and enterprise, our AI enabled plans where. The search load and the search traffic. Majority of it’s coming from agents, not humans. And so for every query that’s hitting our elastic search or our vector indices, they’re not coming from humans.
    And the queries are structured differently. And what’s returned has a different re requirement. Positional ranking matters less, but top K retrieval mode matters more. Right.
    [01:14:34] swyx: Isn’t top KA form of position?
    [01:14:36] Sarah Sachs: Of course it is. But um, when you’re training on like click through rate, it’s really, you know,
    [01:14:41] swyx: yeah.
    [01:14:41] Sarah Sachs: It matters much less.
    Number one through number six is very different
    [01:14:44] swyx: Yeah.
    [01:14:44] Sarah Sachs: Than it needs to be in the top 100.
    [01:14:45] swyx: Like the slope is just,
    [01:14:46] Sarah Sachs: yeah.
    [01:14:46] swyx: Higher.
    [01:14:47] Sarah Sachs: It’s a different optimization function for retrieval, um, model. Similarly, uh, what snippet you include matters more or less. Right. So we are rethinking a lot of that functionality, um, to work with how the agents like to write queries and how, um, they wanna, uh, receive information.
    Yeah. So we are doing like another kind of reinvestment into rethinking not only search for, um, how do agents do searches versus how humans do searches. Um, but we’re also investing in like. Indexing different things now because, uh, how are, how do you index, uh, the setup generator for Notion agent? It kind of breaks our block model entirely, um, where all blocks are nested in each other.
    Same with meeting notes. Um, and so we do, we, I mean, so we’re hiring ranking engineers and model training engineers, but it’s primarily on ranking.
    [01:15:32] swyx: Yeah. Does ranking maps to res for you? It does, right. Recommendation systems.
    [01:15:36] Sarah Sachs: Yeah. Um, yes.
    [01:15:38] swyx: Right. Okay. Say this, but I’m trying to promote res more in general ‘cause I is weirdly unpopular.
    [01:15:45] Sarah Sachs: I don’t know why. Um, but the other thing is that, like, I I was just talking about this with a peer, like how much is ranking important versus like, uh, being able to do parallel exhaustive queries. Right. Um, so we’re also, they’re both important. They’re both important, but like they’re both two tools to the same user outcome or the same agent outcome.
    Uhhuh. Right. And so, um, that. That’s something that we’re also rethinking a lot even on, we just did an experiment on, um, notion ranking at this point, um, for notion retrieval, vector embeddings are less and less.
    [01:16:15] swyx: Did you see that? Yeah. Notion just, uh, to nine
    [01:16:19] Alsesio: so long it became dark mode.
    [01:16:21] Sarah Sachs: We’re working the night shift for you.
    Right? Looks
    [01:16:23] Simon Last: pretty good. I’m not seeing any bug.
    [01:16:24] swyx: You know, I worked on this like parallel search thing where you, you found out to eight different queries, right? Yes. And so you actually need to use the model to work on query diversity so that you get right. Investment space.
    [01:16:35] Sarah Sachs: And so like the people that are working on, um, ranking and retrieval are the same people working on what query generation is.
    It’s all one, uh, journey. Yeah. We call it age agentic find. And we’re actually realizing, for instance, that it’s less about a selection. Like we don’t spend a lot of time trying to optimize what vector embedding we use anymore. That was a period of time, but that’s just not the right lever of optimization.
    [01:16:55] swyx: Yeah. Right. Yeah. Okay. Uh, we’ve gone long. I have to talk about motion meeting minutes and then we’ll, we’ll, we can call it there. Uh, you, you, you just have a lot of comments. Uh, you, you, uh, I don’t know where you wanna start. Um, is it the audio side? Is it the sort of Oh, meeting notes, summarization? Yeah.
    [01:17:12] Simon Last: Sort of like what makes it work or
    [01:17:13] swyx: No, just like anything sort of interesting technically, right? Like I think you had, you had some, uh, book points. I always call these like check marks along the way when the, when a guest says something that we, they wanna return to later, I just like, check mark it. Yeah.
    I’m like, okay. We’ll back to it. Um,
    [01:17:26] Sarah Sachs: meeting notes was one of those things where at first we were nervous that we’d have to teach people a different way to work, and we were nervous that that was a lot of user friction. I think one of the reasons why, I mean, they’re one of our biggest growth lever. I think they’re one of the most like.
    In terms of virality of adoption and retention, quite strong. Um, and so we’ve invested more and more as we did that. I think what’s really powerful about it is, again, notion is the system of record of where and how you work. The way that I use meeting notes is every one-on-one and meeting I have is meeting notes.
    When I do my performance review for myself, myself, review, I say primarily look at all my conversations with my manager and like, write up what I did this year, right? Because if I didn’t talk about it in my one-on-one with my manager, it probably wasn’t relevant for my performance review. So it also just adds a ton of signal on prioritization that’s really helpful for a good system of record.
    That’s really helpful for like our agent. It’s also like caused a lot of scaling for search and for the agent. Um, and you know, it’s, it’s just an explosion of content when you have transcripts like that. Um, how we do compaction. A lot of that was triggered by meeting notes passed into context, things like that.
    Um, so it’s been a good impetus for us to think about. Longer form, um, content when you think of it as like a priority, primitive, but it’s been one of the most powerful signals for our agent. Um, because it’s
    [01:18:44] swyx: unsurprising. Right? Right. And
    [01:18:45] Sarah Sachs: you’re
    [01:18:45] swyx: capturing a whole new thing.
    [01:18:46] Sarah Sachs: So it’s like our own data. Like we want users like, or they’re creating their own data flywheel, right?
    [01:18:51] swyx: Like it serves me to prefer notion, uh, to put all my stuff because it has my other stuff.
    [01:18:57] Sarah Sachs: Totally. I mean, the way that, the way that like our teams run right now is. You know, there’s a custom agent that does a pre-read before standup. It looks through all of Slack and GitHub and just says, you know, it, it, it creates a summary and it creates a meeting note and it says Everyone do this pre-read.
    Then we just press play. We have the meeting, we talk through the pre-read, we talk about what needs to happen next, and then we have a custom agent integrated with our calendar and triggers that then files task for tomorrow or today based on what we spoke about. And, um, sends off Slack messages that we decided in the meeting needed to be follow ups.
    Like our meetings are hands off keyboard and we’re focused on, um, the root of the problem, not the bookkeeping around the problem.
    [01:19:32] Simon Last: One thing that, uh, the me, us team had recently that was, but I’ve been blowing my mind, is they, we, uh, uh, they made it so it actually, when it makes the summary, we’ll actually app mention the people that were referenced oof in it.
    So I, I, I now get notifications whenever someone talks about meeting. Yeah. I
    [01:19:46] Sarah Sachs: feel like that one
    [01:19:47] Simon Last: was, it’s like, it’s like, oh, you know. Simon is working on this. Okay, I’m gonna, it’s actually amazing how, because then I’m like, oh, okay, cool. I’m gonna go talk to them about that.
    [01:19:55] swyx: Right? What, what if they’re two Simons?
    [01:19:56] Simon Last: Um,
    [01:19:57] Sarah Sachs: no wait, so wait. It’s powered by the agent. So it’s doing agentic. So if you look at it thinking, I don’t know if this is shipped yet. It will be, when you look at it thinking when it’s doing the summarization, it’s saying, figuring out who Simon
    [01:20:07] swyx: is most probable Simon
    [01:20:08] Sarah Sachs: is. Yeah. Um, and we also have like a people to people similarity cash and stuff like that.
    Yeah, yeah. On the here’s we sort of like,
    [01:20:15] Simon Last: we also like generate a profile for each person and like, and use that. Um, yeah. I mean of course I can get it wrong, but the goal is for not to get it
    [01:20:22] Sarah Sachs: wrong. Meeting nuts is just like the agent primitive packaged on top of a transcription. Primitive. Yeah. Yeah. And then a vertical team.
    It’s probably one of the only teams at Notion that’s completely a vertical team around quality and product like UX design. ‘cause it’s still a Tiger team. Um, with a fantastic manager, Zach, that joined recently, um, from Embr, but, um,
    [01:20:40] swyx: Zachar.
    [01:20:41] Sarah Sachs: Yeah.
    [01:20:42] swyx: Yeah. I, uh, chatted with him when he was talking about with his working number.
    [01:20:45] Sarah Sachs: Yeah. So he’s, he’s managing that team now and thinking about it as data capture. That’s what meeting notes is, is data capture it, get
    [01:20:50] swyx: all
    [01:20:51] Sarah Sachs: the kinds of kind of reframing, um, where meeting notes are valuable as a data capture problem and then working inside, um, like the summarization used to not be age agentic.
    Yeah. Now it is because it does all the things like figure out who the right Simon is. And one day you can have a custom agent directly integrated in it that knows like what task database the meeting is referring to. And as you’re having the meeting perhaps update the tasks and things like that. Like there’s a, there’s a lot of that experience of where we do our work in meetings that we wanna invest in.
    Making more seamless.
    [01:21:18] swyx: Yeah. Uh, opening eyes, doing hardware. Uh, would you ever ship one of these?
    [01:21:22] Simon Last: Yeah, probably not,
    [01:21:23] Sarah Sachs: but one of those.
    [01:21:23] swyx: But you know, this, this is meeting notes in person.
    [01:21:25] Simon Last: Yeah. Yeah. I, I’d be excited about, I mean, I’m excited about that, that product category in general for sure. Yeah.
    [01:21:31] Sarah Sachs: I think it’s like, it’s a, it’s a mechanism and it.
    It, one of those needs to work really well with Notion. We would partner with whoever’s building one of those, I think. Yeah. This is
    [01:21:40] swyx: be they, they were bought by Amazon. I don’t know. I I can refer you.
    [01:21:43] Sarah Sachs: And there’s like, there’s some wild companies doing like really cool things that come to our partnerships team that I like to sit in on the demos of, of wearables.
    I always like to send in on the demos ‘cause I think they’re Oh, okay. Pretty cool. And all of them want to make sure, not just notion, but like you can imagine the ones that talk to you. Yeah, yeah. Um, being able to do search and build context. So like if you’re entering like a conference, um, being able to like do like look at your CRM and do things like that.
    Um, and you can utilize the Notion agent to do that. So we are in like the very beginnings of those partnerships. I think what’s unique about that particular technology is it goes against what I talked about with custom agents right now, which is the more simple it is, the harder it is to have like advanced controls over its capabilities.
    Right? And so that would be a great investment for data capture, but not necessarily like our agent is workflows.
    [01:22:26] Simon Last: It’s something with a different slice of the problem, I would say. Yeah. Like that’s gonna be deeply personal. Like, like your company’s not gonna force you to wear a risk. Wristband. Right. I, I think
    [01:22:35] Sarah Sachs: it’s good to hear that from me.
    From you. Yeah.
    [01:22:38] Simon Last: Yeah. The, the CEO’s gonna force everyone to wear a wristband look, I mean, the slice of the problem that, that we care about is like, you know, can the company have all the context of what everyone said at every single meeting, and then use that to, yeah. To, to derive value for themselves.
    [01:22:52] Sarah Sachs: It kinda reminds me, I remember once you.
    Very strongly reminded me, our job is to not make the best harness for agentic work. Our job is to be the best place where people collaborate. It’s like our job isn’t to build the best wearable to capture meeting notes. Our job is to build the best place where meeting notes live. Right?
    [01:23:11] swyx: Yeah. So it basically, you’re saying everyone else can just pipe to you and it’s fine, right?
    Yeah, yeah, yeah. That’s, that’s a reasonable thing. All I’ll say is that people, there’s people walking around with notion tattoos on them. They, they’ll wear notion anything. So just, I don’t know, do a limited run.
    [01:23:24] Simon Last: Yeah, yeah. No, I mean,
    [01:23:27] Sarah Sachs: we have such understated swag that the idea, like our swag has so few notion lay logos on it.
    The idea that people have notion tattoos is pretty antithesis to our design principles, so that’s pretty funny.
    [01:23:38] Simon Last: Yeah.
    [01:23:39] Sarah Sachs: Do you have one?
    [01:23:40] Simon Last: No, not, I do not have a notion Tattoo too. I’ve, I’ve seen them. Yeah.
    [01:23:44] swyx: Cool. Uh, well, thank you so much. This is such a great deep, deep dive. Actually. The chemistry between you two is amazing.
    Like, I, I can’t believe, like
    [01:23:51] Sarah Sachs: we work together a lot. Yeah. Different jobs. Work closely.
    [01:23:55] swyx: Yeah.
    [01:23:55] Alsesio: That’s it. Yeah. Thank you. Thank you.
    [01:23:57] Sarah Sachs: Thanks. Thank you.


    This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe
  • Latent Space: The AI Engineer Podcast

    Extreme Harness Engineering for Token Billionaires: 1M LOC, 1B toks/day, 0% human code, 0% human review — Ryan Lopopolo, OpenAI Frontier & Symphony

    07/04/2026 | 1 h 12 min
    We’re proud to release this ahead of Ryan’s keynote at AIE Europe. Hit the bell, get notified when it is live! Attendees: come prepped for Ryan’s AMA with Vibhu after.
    Move over, context engineering. Now it’s time for Harness engineering and the age of the token billionaires.
    Ryan Lopopolo of OpenAI is leading that charge, recently publishing a lengthy essay on Harness Eng that has become the talk of the town:
    In it, Ryan peeled back the curtains on how the recently announced OpenAI Frontier team have become OpenAI’s top Codex users, running a >1m LOC codebase with 0 human written code and, crucially for the Dark Factory fans, no human REVIEWED code before merge. Ryan is admirably evangelical about this, calling it borderline “negligent” if you aren’t using >1B tokens a day (roughly $2-3k/day in token spend based on market rates and caching assumptions):
    Over the past five months, they ran an extreme experiment: building and shipping an internal beta product with zero manually written code. Through the experiment, they adopted a different model of engineering work: when the agent failed, instead of prompting it better or to “try harder,” the team would look at “what capability, context, or structure is missing?”
    The result was Symphony, “a ghost library” and reference Elixir implementation (by Alex Kotliarskyi) that sets up a massive system of Codex agents all extensively prompted with the specificity of a proper PRD spec, but without full implementation:
    The future starts taking shape as one where coding agents stop being copilots and start becoming real teammates anyone can use and Codex is doubling down on that mission with their Superbowl messaging of “you can just build things”.
    Across Codex, internal observability stacks, and the multi-agent orchestration system his team calls Symphony, Ryan has been pushing what happens when you optimize an entire codebase, workflow, and organization around agent legibility instead of human habit.
    We sat down with Ryan to dig into how OpenAI’s internal teams actually use Codex, why the real bottleneck in AI-native software development is now human attention rather than tokens, how fast build loops, observability, specs, and skills let agents operate autonomously, why software increasingly needs to be written for the model as much as for the engineer, and how Frontier points toward a future where agents can safely do economically valuable work across the enterprise.
    We discuss:
    * Ryan’s background from Snowflake, Brex, Stripe, and Citadel to OpenAI Frontier Product Exploration, where he works on new product development for deploying agents safely at enterprise scale
    * The origin of “harness engineering” and the constraint that kicked off the whole experiment: Ryan deliberately refused to write code himself so the agent had to do the job end to end
    * Building an internal product over five months with zero lines of human-written code, more than a million lines in the repo, and thousands of PRs across multiple Codex model generations
    * Why early Codex was painfully slow at first, and how the team learned to decompose tasks, build better primitives, and gradually turn the agent into a much faster engineer than any individual human
    * The obsession with fast build times: why one minute became the upper bound for the inner loop, and how the team repeatedly retooled the build system to keep agents productive
    * Why humans became the bottleneck, and how Ryan’s team shifted from reviewing code directly to building systems, observability, and context that let agents review, fix, and merge work autonomously
    * Skills, docs, tests, markdown trackers, and quality scores as ways of encoding engineering taste and non-functional requirements directly into context the agent can use
    * The shift from predefined scaffolds to reasoning-model-led workflows, where the harness becomes the box and the model chooses how to proceed
    * Symphony, OpenAI’s internal Elixir-based orchestration layer for spinning up, supervising, reworking, and coordinating large numbers of coding agents across tickets and repos
    * Why code is increasingly disposable, why worktrees and merge conflicts matter less when agents can resolve them, and what it really means to fully delegate the PR lifecycle
    * “Ghost libraries”, spec-driven software, and the idea that a coding agent can reproduce complex systems from a high-fidelity specification rather than shared source code
    * The broader future of Frontier: safely deploying observable, governable agents into enterprises, and building the collaboration, security, and control layers needed for real-world agentic work
    Ryan Lopopolo
    * X: https://x.com/_lopopolo
    * Linkedin: https://www.linkedin.com/in/ryanlopopolo/
    * Website: https://hyperbo.la/contact/

    Timestamps
    00:00:00 Introduction: Harness Engineering and OpenAI Frontier00:02:20 Ryan’s background and the “no human-written code” experiment00:08:48 Humans as the bottleneck: systems thinking, observability, and agent workflows00:12:24 Skills, scaffolds, and encoding engineering taste into context00:17:17 What humans still do, what agents already own, and why software must be agent-legible00:24:27 Delegating the PR lifecycle: worktrees, merge conflicts, and non-functional requirements00:31:57 Spec-driven software, “ghost libraries,” and the path to Symphony00:35:20 Symphony: orchestrating large numbers of coding agents00:43:42 Skill distillation, self-improving workflows, and team-wide learning00:50:04 CLI design, policy layers, and building token-efficient tools for agents00:59:43 What current models still struggle with: zero-to-one products and gnarly refactors01:02:05 Frontier’s vision for enterprise AI deployment01:08:15 Culture, humor, and teaching agents how the company works01:12:29 Harness vs. training, Codex model progress, and “you can just do things”01:15:09 Bellevue, hiring, and OpenAI’s expansion beyond San Francisco
    Transcript
    Ryan Lopopolo: I do think that there is an interesting space to explore here with Codex, the harness, as part of building AI products, right? There’s a ton of momentum around getting the models to be good at coding. We’ve seen big leaps in like the task complexity with each incremental model release where if you can figure out how to collapse a product that you’re trying to.
    Build a user journey that you’re trying to solve into code. It’s pretty natural to use the Codex Harness to solve that problem for you. It’s done all the wiring and lets you just communicate in prompts. To let the model cook, you have to step back, right? Like you need to take a systems thinking mindset to things and constantly be asking, where is the Asian making mistakes?
    Where am I spending my time? How can I not spend that time going forward? And then build confidence in the automation that I’m putting in place. So I have solved this part of the SDLC.
    swyx: [00:01:00] All right.
    [00:01:03] Meet Ryan
    swyx: We’re in the studio with Ryan from OpenAI. Welcome.
    Ryan Lopopolo: Hi,
    swyx: Thanks for visiting San Francisco and thanks for spending some time with us.
    Ryan Lopopolo: Yeah, thank you. I’m super excited to be here.
    swyx: You wrote a blockbuster article on harness engineering. It’s probably going to be the defining piece of this emerging discipline, huh?
    Ryan Lopopolo: Thank you. It is it’s been fun to feel like we’ve defined the discourse in some sense.
    swyx: Let’s contextualize a little bit, this first podcast you’ve ever done. Yes. And thank you for spending with us. What is, where is this coming from? What team are you in all that jazz?
    Ryan Lopopolo: Sure, sure.
    Ryan Lopopolo: I work on Frontier Product Exploration, new product development in the space of OpenAI Frontier, which is our enterprise platform for deploying agents safely at scale, with good governance in any business. And. The role of VMI team has been to figure out novel ways to deploy our models into package and products that we can sell as solutions to enterprises.
    swyx: And you have a background, I’ll just squeeze it in there. Snowflake, brick, [00:02:00] stripe, citadel.
    Ryan Lopopolo: Yes. Yes. Same. Any kind of customer
    swyx: entire life. Yes. The exact kind of customer that you want to,
    Vibhu: so I’ll say, I was actually, I didn’t expect the background when I looked at your Twitter, I’m seeing the opposite.
    Stuff like this. So you’ve got the mindset of like full send AI, coding stuff about slop, like buckling in your laptop on your Waymo’s. Yes. And then I look at your profile, I’m like, oh, you’re just like, you’re in the other end too. Oh, perfect. Makes perfect.
    Ryan Lopopolo: I it’s quite fun to be AI maximalist if you’re gonna live that persona.
    Open eye is the place to do it. And it’s
    swyx: token is what you say.
    Ryan Lopopolo: Yeah. Certainly helps that we have no rate limits internally. And I can go, like you said, full send at this stay.
    swyx: Yeah. Yeah. So the Frontier, and you’re a special team within O Frontier.
    Ryan Lopopolo: We had been given some space to cook, which has been super, super exciting.
    [00:02:47] Zero Code Experiment
    Ryan Lopopolo: And this is why I started with kind of a out there constraint to not write any of the code myself. I was figuring if we’re trying to make agents that can be deployed into end to enterprises, they should be [00:03:00] able to do all the things that I do. And having worked with these coding models, these coding harnesses over 6, 7, 8 months, I do feel like the models are there enough, the harnesses are there enough where they’re isomorphic to me in capability and the ability to do the job.
    So starting with this constraint of I can’t write the code meant that the only way I could do my job was to get the agent to do my job.
    Vibhu: And like a, just a bit of background before that. This is basically the article. So what you guys did is five months of working on an internal tool, zero lines of code over a mi, a million lines of code in the total code base.
    You say it was cenex, more like it was cenex faster than you would’ve. If you had done it by end. So
    Ryan Lopopolo: yeah, that
    Vibhu: was the mindset going into this, right?
    Ryan Lopopolo: That’s right.
    [00:03:46] Model Upgrades Lessons
    Ryan Lopopolo: Started with some of the very first versions of Codex CLI, with the Codex Mini model, which was obviously much less capable than the ones we have today.
    Which was also a very good constraint, right? Quite a visceral feeling to ask the [00:04:00] model to build you a product feature. And it just not being able to assemble the pieces together.
    Which kind of defined one of the mindsets we had for going into this, which is whenever the model just cannot, you always pop open at the task, double click into it, and build smaller building blocks that then you can reassemble into the broader objective.
    And it was quite painful to do this. Honestly, the first month and a half was. 10 times slower than I would be. But because we paid that cost, we ended up getting to something much more productive than any one engineer could be because we built the tools, the assembly station for the agent to do the whole thing.
    [00:04:43] Model Generations, Build Systems & Background Shells
    Ryan Lopopolo: But yeah, so onward to G BT 5, 5, 1, 5, 2, 5, 3, 5 4. To go through all these model generations and see their kind of corks and different working styles also meant we had to adapt the code base to change things up when the model was revved. [00:05:00] One interesting thing here is five two, the Codex harness at the time did not have background shells in it, which means we were able to rely on blocking scripts to perform long horizon work.
    But with five, three and background shells, it became less patient, less willing to block. So we had to retool the entire build system to complete in under a minute and. This is not a thing I would expect to be able to do in a code base where people have opinions. But because the only goal was to make the Asian productive over the course of a week, we went from a bespoke make file build to Basil, to turbo to nx and just left it there because builds were fast at that point.
    swyx: Interesting. Talk more about Turbo TenX. That’s interesting ‘cause that’s the other direction that other people have been doing.
    Ryan Lopopolo: Ultimately I have. Not a lot of experience with actual frontend repo architecture.
    swyx: You’re talking that Jessica built the sky. So I’m like, I know the NX team. I know Turbo from Jared [00:06:00] Palmer.
    And I’m like, yeah, that’s an interesting comparison.
    [00:06:02] One Minute Build Loop
    Ryan Lopopolo: The hill we were climbing right, was make it fast.
    swyx: Is there a micro front end involved? Is it how how complex react
    Ryan Lopopolo: electron base single app sort of thing
    swyx: And must be under a minute. That’s an interesting limitation. I’m actually not super familiar with the background shelf stuff.
    Probably was talked about in the fight three release.
    Ryan Lopopolo: BA basically means that codex is able to spawn commands in the background and then go continue to work while it waits for them to finish. So it can spawn an expensive build and then continue reviewing the code, for example.
    swyx: Yeah.
    Ryan Lopopolo: And this helps it be more time efficient for the user invoking the harness.
    swyx: And I guess and just to really nail this, like what does one minute matter? Like why not five, okay, good. We want no. We
    Ryan Lopopolo: want the inner loop to be as fast as possible. Okay. One minute was just a nice round number and we were able to hit it.
    swyx: And if it doesn’t complete, it kills it or some something,
    Ryan Lopopolo: No.
    We just take that as a signal that we need to stop what we’re doing, double click, decompose a build graph a bit to get us to high back under so that we [00:07:00] can able the agent continue to operate.
    swyx: It’s almost like you’re, it’s like a ratchet. It’s like you’re forcing build time discipline, because if you don’t, it’ll just grow and grow.
    That’s right. And you mentioned that my current, like the software I work on currently is at 12 minutes. It sucks.
    Ryan Lopopolo: This has been my experience with platform teams in the past, where you have an envelope of acceptable build times and you let it go up to breach and then you spend two, three weeks to bring it back down to the lower end of the average low bed stop.
    But because tokens are so cheap Yeah. And we’re so insanely parallel with the model, we can just constantly be gardening this thing to make sure that we maintain these in variants, which means. There’s way less dispersion in the code and the SDLC, which means we can simplify in a way and rely on a lot more in variance as we write the software.
    [00:07:45] Observability, Traces & Local Dev Stack
    Vibhu: Lovely.
    [00:07:46] Humans Are Bottleneck
    Vibhu: You mentioned in your article, like humans became the bottleneck, right? You kicked off as a team of three people. You’re putting out a million line of code, like 1500 prs, basically. What’s the mindset there? So as much as code is disposable, you’re doing a lot of review. A lot [00:08:00] of the article talks about how you wanna rephrase everything is prompting everything, is what the agent can’t see.
    It’s kind of garbage, right? You shouldn’t have it in there. So what’s like the high level of how you went about building it, and then how you address okay, humans are just PR review. Like how is human in the loop for this?
    Ryan Lopopolo: We’ve moved beyond even the humans reviewing the code as well.
    [00:08:19] Human Review, PR Automation & Agent Code Review
    Ryan Lopopolo: Most of the human review is post merge at this point.
    But post, post merge, that’s not even reviewed. That’s just
    swyx: Oh, let’s just make ourselves happy by You
    Ryan Lopopolo: haven’t used fundamentally. The model is trivially paralyzable, right? As many GPUs and tokens as I am willing to spend, I can have capacity to work with my hood base.
    The only fundamentally scarce thing is the synchronous human attention of my team. There’s only so many hours in the day we have to eat lunch. I would like to sleep, although it’s quite difficult to, stop poking the machine because it makes me want to feed it. You have to step back, right?
    Like you need to take a systems thinking mindset to things and [00:09:00] constantly be asking where is the agent making mistakes? Where am I spending my time? How can I not spend that time going forward? And then build confidence in the automation that I’m putting in place. So I have solved this part of the SDLC, and usually what that has looked like is like we started needing to pay very close attention to the code because the agent did not have the right building blocks to produce.
    Modular software that decomposed appropriately that was reliable and observable and actually accrued a working front end in these things, right?
    [00:09:35] Observability First Setup
    Ryan Lopopolo: So in order to not spend all of our time sitting in front of a terminal at most, doing one or two things at a time, invested in giving the model that observability, which is that that graph in the post here.
    swyx: Yeah. Let’s walk through this traces and which existed first
    Ryan Lopopolo: we started with just the app and the whole rest of it. From vector through to all these login metrics, APIs was, I dunno, half an [00:10:00] afternoon of my time. We have intentionally chosen very high level fast developer tools. There’s a ton of great stuff out there now.
    We use me a bunch, which makes it trivial to pull down all these go written Victoria Stack binaries in our local development. Tiny little bit of python glue to spin all these up. And off you go. One neat thing here is we have tried to invert things as much as possible, which is instead of setting up an environment to spawn the coding agent into, instead we spawn the coding agent, like that’s the entry point.
    It’s just Codex. And then we give Codex via skills and scripts the ability to boot the stack if it chooses to, and then tell it how to set some end variables. So the app and local Devrel points at this stack that it has chosen to spin up. And this I think is like the fundamental difference between reasoning models and the four ones and four ohs of the past, where these models could not think so you had to put them in [00:11:00] boxes with a predefined set of state transitions.
    Whereas here we have the model, the harness be the whole box. And give it a bunch of options for how to proceed with enough context for it to make intelligent choices. So
    Vibhu: sales, so like a lot of that is around scaffolding, right? Yes. Previous agents, you would define a scaffold. It would operate in that.
    Lube, try again. That’s pivoted off from when we’ve had reasoning models. They’re seeming to perform better when you don’t have a scaffold, right? That’s right.
    [00:11:28] Docs Skills Guardrails
    Vibhu: And you go into like niches here too, like your SPEC MD and like having a very short agent MG Agent md.
    swyx: Yes. Yes.
    Vibhu: Yeah. So you even lay out what it is here, but I like
    swyx: the table contents.
    Vibhu: Yeah.
    swyx: Like stuff like this, it really helps guide people because everyone’s trying to do this.
    Ryan Lopopolo: This structure also makes it super cheap to put new content into the repository to steer both the humans and the agents.
    swyx: You, you reinvented skills, right?
    Vibhu: One big agents and
    swyx: skills from first princip holds
    Ryan Lopopolo: all skills did not exist when we started doing this.
    Vibhu: You have a short [00:12:00] one 100 line overall table of contents and then you have little skills, right? Core beliefs, MD tech tracker. Yeah. Yeah. The scale is over
    Ryan Lopopolo: The tech jet tracker and the quality score are pretty interesting because this is basically a tiny little scaffold, like a markdown table, which is a hook for Codex to review all the business logic that we have defined in the app, assess how it matches all these documented guardrails and propose follow up work for itself.
    Before beads and all these ticketing systems, we were just tracking follow up work as notes in a markdown file, which, we could spa an agent on Aron to burn down. There’s this really neat thing that like the models fundamentally crave text. So a lot of what we have done here is figure out ways to inject text
    swyx: into
    Ryan Lopopolo: the system right when we get a page, because we’re missing a timeout, for example.
    I can just add Codex in Slack on that page and say, I’m gonna fix this by adding a timeout. Please update our reliability documentation. To require that all network calls have [00:13:00] timeouts. So I have not only made a point in time fix, but also like durably encoded this process knowledge around what good looks like.
    swyx: Yeah.
    Ryan Lopopolo: And we give that to the root coding agent as it goes and does the thing. But you can also use that to distill tests out of, or a code review agent, which is pointed at the same things to narrow the acceptable universe of the code that’s produced.
    swyx: I think one of the concerns I have with that kind of stuff is you think you’re making the right call by making, it’s persisted for all time across everything.
    Yes. But then you didn’t think about the exceptions that you need to make, right? And that you have to roll it back.
    Vibhu: Part of it is
    swyx: also sometimes it can follow your s instructions too.
    Vibhu: It’s somewhat a skill, right? So it determines when it uses the tools, right? Like it’s not like it’ll run outta every call.
    It’ll determine when it wants to check quality score, right?
    Ryan Lopopolo: Yeah. And we do in the prompts we give these agents, allow them to push back,
    [00:13:51] Agent Code Review Rules
    Ryan Lopopolo: When we first started adding code review agents to the pr, it would be Codex, CLI. Locally writes the change, pushes up a PR on [00:14:00] those PR synchronizations of review agent fires.
    It posts a comment. We instruct Codex that it has to at least acknowledge and respond to that feedback. And initially the Codex driving the code author was willing to be bullied by the PR reviewer, which meant you could end up in a situation where things were not converging. So yeah, we had to,
    swyx: he’s just a thrash.
    Ryan Lopopolo: We had to add more optionality to the prompts on both of these things, right? The reviewer agents were instructed to bias toward merging the thing to not surface anything greater than a P two in priority. We didn’t really define P two, but we gave it, you
    swyx: did define P two.
    Ryan Lopopolo: We gave it a framework within which to score its output
    swyx: and then greater than P zero is worse, right?
    Yes. P two is very good.
    Ryan Lopopolo: P zero is you will mute the code place if
    swyx: you merch this
    Ryan Lopopolo: thing, right?
    swyx: Yeah.
    Ryan Lopopolo: But also on the code authoring agent side, we also gave it the flexibility to either defer or push back against review feedback, right? This happens all the time, right? Like I happen to notice something and leave a code review, [00:15:00] which.
    Could blow up the scope by a factor of two. I usually don’t mean for that to be addressed Exactly. In the moment. It’s more of an FYI file it to the backlog, pick it up in the next fix it week sort of thing. And without the context that this is permissible, the coding agents are gonna bias toward what they do, which is following instructions.
    swyx: Yeah.
    [00:15:19] Autonomous Merging Flow
    swyx: I do wanted to check in on a couple things, right? Sure. All the coding review agent, it can merge autonomously. I think that’s something that a lot of people aren’t comfortable with. And you have a list here of how much agents do they do Product code and tests, CI configuration and release tooling, internal Devrel tools, documentation eval, harness review, comments, scripts that manage the repository itself, production dashboard definition files, like everything.
    Yes. And so they’re just all churning at the same time, is there like a record that, that any human on the team pulls to stop everything
    Ryan Lopopolo: Because we are building a native application here. We’re not doing continuous deploy. So there’s still a human in the loop for cutting the release branch.
    I see. We require a blessed [00:16:00] human approved smoke test of the app before we promote it to distribution, these sort of things.
    swyx: So you’re working on the app, you’re not building like infrastructure where you have like nines of reliability, that kinda stuff?
    Ryan Lopopolo: That’s correct. That’s correct. Okay. And also like full recognition here that all of this activity took in a completely greenfield repository.
    There’s. Should be no script that this applies generally to
    swyx: this is a production thing, you’re gonna ship
    Ryan Lopopolo: to
    swyx: customers. Of course. Yeah, of course. So this is real
    Vibhu: And like one of the things there is, you mentioned you started this as a repo from scratch. The onboarding first month or so was pretty, it was like working backwards, right?
    Yeah. And then you had to work with the system and now you’re at that point where you know, you’re very autonomous. I’m curious like, okay, so what, how human in the loop is it? So what are the bottlenecks that you wish you could still automate? And part of that is also like, where do you see the model trajectory improving and offloading more human in the loop?
    We just got 5.4. It’s a really good,
    Ryan Lopopolo: fantastic model, by the way.
    Vibhu: Yeah. Yeah. It’s the first one that’s merged. Top tier coding. So it’s codex level coding and reasoning. So general reasoning both in one model. So
    Ryan Lopopolo: and
    Vibhu: computer [00:17:00] use vision.
    Ryan Lopopolo: Now we now with five four, I can just have Codex write the blog post, whereas for this one I had to balance between chat.
    swyx: Oh, I need to, I might be out of a job. Oh my God.
    Ryan Lopopolo: Oh,
    swyx: I know. You just gave me an idea for a completely AI newsletter that five four could do. Yeah, I get it Now.
    Ryan Lopopolo: This sort of thing is just one example of closing the loop, right? Like the dashboard thing you mentioned. We have Codex authoring the Js ON, for the Grafana dashboards and publishing them and also responding to the pages, which means when it gets the page, it knows exactly which dashboards are defined and what alerts.
    What alert was triggered by which exact log in the code base. ‘cause all of this stuff is collated together.
    swyx: It has to own everything.
    Yes. Yeah. Yeah.
    Ryan Lopopolo: And it means that if we have an outage that did not result in a page. It has the existing set of dashboards available to it. It has the existing set of metrics and logs and can figure out where the gaps in the dashboard are or [00:18:00] in the underlying metrics and fix them in one go.
    In the same way, you would have a full stack engineer be able to drive a feature from the backend all the way to the front end.
    Vibhu: So it, it seems like a lot of the work you guys had to do was you as a small team are fully working for a way that the model wants the software to be written. It’s like less human legible for better. Code legibility, agent legibility. How do you think that affects broader teams? So one at OpenAI, do liaison, like this is how software should be written. Like I can imagine, say you join a new team with this methodology, this mindset there’s ways that, teams do code review, teams write code, like teams are structured and a lot of it is for human legibility.
    So should we all swap? Like how does this play back one broader into OpenAI and then like broader into the software engineering, right? Is it like teams that pick this up will it’s pretty drastic, right? You have to make a pretty big switch. Should they just full send Yeah.
    Ryan Lopopolo: The mindset is very much that I’m removed from the process, right? I can’t really have deep code level opinions about [00:19:00] things. It’s as if I’m. Group tech leading a 500 person organization.
    Vibhu: Yeah.
    Ryan Lopopolo: Like it’s not appropriate for me to be in the weeds on every pr. This is why that post merge code review thing is like a good analog here, right?
    Like I have some representative sample of the code as it is written, and I have to use that to infer what the teams are struggling with, where they could use help, where they’re already moving quickly and I can pivot my focus elsewhere.
    Vibhu: Yeah.
    Ryan Lopopolo: So I don’t really have too many opinions around the code as it is written.
    I do, however, have a command based class, which is used to have repeatable chunks of business logic that comes with tracing and metrics and observability for free. And the thing to focus on is not how that business logic is structured, but that it uses this primitive ‘cause I know that’s gonna give leverage by default.
    Vibhu: Yeah.
    Ryan Lopopolo: Yeah, back to that sort of systems stinking,
    Vibhu: and you have part of that in your blog post, enforcing architecture and ta taste how you set boundaries for what’s used. There’s also a section on redefining [00:20:00] engineering and stuff, but yeah, it’s just, it’s interesting to hear,
    Ryan Lopopolo: and as the models have gotten better, they have gotten better at proposing these abstractions to unblock themselves, which again, lets me move higher and higher up the stack to look deeper into the future on what ultimately blocked the team from shipping.
    swyx: Yeah. You mentioned so you, this is primarily a, it is like a 1 million line of code base electron app. But it manages its own services as well, so it’s like a backend for front end type thing.
    Ryan Lopopolo: We do have a backend in there, but that’s hosted in the cloud.
    Yeah. This sort of structure is actually within the separate main and render processes
    Within the
    swyx: electric.
    That’s just how electronic works.
    Ryan Lopopolo: Yeah, of course. So have also treated like. MVC style decomposition with the same level of rigor, which has been very fun.
    swyx: I have a fun pun. This is a tangent, NVC is model view controller. Any sort of full stack web Devrel knows that.
    But my AI native version of this is Model view Claw, the clause the harness.
    Ryan Lopopolo: That’s right. That’s right. I do think that there is an interesting space to [00:21:00] explore here with Codex, the harness as part of building AI products, right? There’s a ton of momentum around getting the models to be good at coding.
    We’ve seen big leaps in like the task complexity with each incremental model release where if you can figure out how to collapse a product that you’re trying to build, a user journey that you’re trying to solve into code, it’s pretty natural to use the Codex Harness to solve that problem for you. It’s done all the wiring and lets you just communicate and prompts to let the model cook.
    Yeah. It’s been very fun. And there’s also a very engineering legible way of increasing capabil. It’s fantastic, right? Yeah. Just give you, just give the model scripts, the same scripts you would already build for yourself.
    swyx: Yeah.
    Yeah. So for listeners, this is Ryan saying that software engineering or coding against will eat knowledge work like the non-coding parts that you would normally think.
    Oh, you have to build a separate agent for it. No, start a coding agent and go out from there. Which open Claw has like it’s pie Underhood.
    Ryan Lopopolo: [00:22:00] Yes.
    Vibhu: Basically define your task in code. Everything is a coding
    swyx: agent by the way. Since I brought it up, it’s probably the only place we bring it up. Is any open claw usage from you?
    Any?
    Ryan Lopopolo: No. No. Not for me. I don’t have any spare Mac Minis rattling around my house.
    swyx: You can afford it? No. I just, I’m curious if it’s changed anything in opening eye yet, but it’s probably early days. And then the other, the other thing I, I wanna pull on here is like you mentioned ticketing systems and you mentioned prs and I’m wondering if both those things have to go away or be reinvented for this kind of coding.
    So the git itself and is like very hostile to multi-agent.
    Ryan Lopopolo: Yeah. We make very heavy use of work trees.
    swyx: But like even then, like I just did a, dropped a podcast yesterday with Cursors saying, and they said they’re getting rid of work trees ‘cause it still has too many merge conflicts.
    It’s still un too un unintuitive. But go ahead.
    Ryan Lopopolo: The models are really great at resolving merge conflicts. Yeah. And to get to a state where I’m not synchronously in the loop in my terminal, I almost don’t care that there are merge
    swyx: with disposable.
    [00:23:00] Yeah.
    Ryan Lopopolo: We invoke a dollar land skill and that coaches codex to push the PR Wait for human and agent reviewers Wait for CI to be green.
    Fix the flakes if there are any merged upstream. If the PR comes into conflict, wait for everything to pass. Put it in the merge queue. Deal with flakes until it’s in Maine. End. This is what it means to delegate fully, right? This is in a, very large model re probably a significant tax on humans to get PRS merged, but the agent is more than capable of doing this and I really don’t have to think about it other than keep my laptop open.
    swyx: Yeah. I used to be much more of a control freak, but now I’m like, yeah, actually you could do a better job of this than me. Yeah. With the right context. Yes.
    [00:23:47] Encoding Requirements
    swyx: Anything else in harness in general? Just this piece, I just wanna make sure we,
    Ryan Lopopolo: I think one thing that I maybe didn’t make super clear in the article that I heard on Twitter as an interesting, that’s respond [00:24:00]
    swyx: to them.
    What’s the chatter and then what’s your response?
    Ryan Lopopolo: Ultimately, all the things that we have encoded in docs and tests and review agents and all these things are ways to put all the non-functional requirements of building high scale, high quality, reliable software into a space that prompt injects the agent.
    We either write it down as docs, we add links where the error messages tell how to do the right thing. So the whole meta of the thing is to basically tease out of the heads of all the engineers on my team, what they think good looks like, what they would do by default, or what they would coach a new hire on the team to do to get things to merch.
    And that’s why we pay attention to all the mistakes, mistakes that the agent makes, right? This is code being written that is misaligned with some as yet not written down, non-functional requirement.
    swyx: Sorry, what? Did the online people misunderstand or
    Ryan Lopopolo: No,
    swyx: what
    you
    Ryan Lopopolo: responded to? Somebody just literally said that.
    I was like, oh yeah,
    swyx: okay,
    Ryan Lopopolo: This is the [00:25:00] thing. This is what I’ve been doing. Oh, you
    swyx: agree? Yeah. I see. Interesting.
    Ryan Lopopolo: One other neat thing, which I did totally did not expect is folks were just. Taking the link to the article and giving it to pi or Codex and say, make my repo this,
    Vibhu: you achi a whole recursion.
    Ryan Lopopolo: And it was wildly effective. Really? It was wildly effective. No
    Vibhu: way. It just actually is something I tried with five, four yesterday. I didn’t have time. Last time I was like out speaking of something, and this is one of my things, I was like, okay, I have this article. Can we just scaffold out what it would be like to run this?
    And I, I did it first as that and then I was like, okay, let me take another little side repo and say okay, if I was to fully automate this like this because I haven’t written a line of code, it’s
    Ryan Lopopolo: like over full, set
    Vibhu: it right. The side thing I’m doing of voice. TTS I’m just like, slobbing out, whatever.
    It’s nothing production. I’m like, how would I make this like this? And it’s actually like a really good way. It’s like a good way to learn what could be changed, what could be like, it’s just a good analyzing, right? You give it all the codes, you give it all the context, you give it the article and it walks you through it very well.
    That’s right. That’s right.
    [00:25:57] Inlining Dependencies
    [00:25:57] Dependencies Going Away & Brett Taylor’s Response
    swyx: I guess one more thing before we go to Symphony is I wanted to cover [00:26:00] Brett Taylor’s response. We had him on the show. He is your chairman, which is wild. Yeah. That he’s reading your articles as well and like getting engaged in it. He says software dependencies are going away.
    Basically they can just be like vendored. Yes. Response.
    Ryan Lopopolo: A
    swyx: hundred percent. A hundred percent agree. You still pro qr, you still pay Datadog. You still pay Temporal. Thank you.
    Ryan Lopopolo: Yep. The level of complexity of the dependencies that we can internalize is, I would say low, medium right now. Just based on model capability.
    What does the,
    swyx: what is medium?
    Ryan Lopopolo: I would say like a. A couple thousand line dependency is a thing that we could in-house No problem. Call in an afternoon of time. One neat thing about it is like probably most of that code you don’t even need. Like by in-house and abstraction, you can strip away all the generic parts of it and only focus on what you need to enable the specific thing.
    Yes. You’re building,
    swyx: I’ve been calling this the end of b******t plugins.
    Ryan Lopopolo: Yeah.
    swyx: Because there’s so much when I published an open source thing, I want to accept everything, be liberal. I want to accept, this is post’s law, but that means there’s so much bloat. Yes. There’s so much overhead.
    Ryan Lopopolo: One other neat thing about [00:27:00] this too is when we deploy Codex Security on the repo, it is able to deeply review and change. The internalized dependencies in a much lower friction way than it would be to like, push patches upstream, wait for them to be released, pull them down, make sure that’s compatible with all the transitive I have in my repo and things like that.
    So it’s also much lower friction to internalize some of these things if code is free. ‘cause the tokens are cheap sort of thing.
    swyx: Yeah. Yeah. I think like the only argument I have against this is basically scale testing, which obviously the larger pieces of software like Linux, MySQL, he calls up even the Datadog and Temporals and then maybe security testing where Yes.
    Classically, I think, is it linis tos, it said security open source is the best disinfectant.
    Ryan Lopopolo: Many eyes.
    swyx: Many eyes. And if inline your dependencies and code them up, you’re gonna have to relearn mistakes from other people that Yep.
    Ryan Lopopolo: Yep. And to internalize that dependency, you’re back to zero and you have to start.
    Reassembling all those bits and pieces to Yeah. Have [00:28:00] high confidence in the code as it is written. Yeah.
    Vibhu: Even part of the first intro of this, you basically mentioned like everything was written by codex, including internal tooling, right? So internal tooling, like when you’re visualizing what’s going on it’s writing it for itself.
    swyx: Yeah. I’m built internal tools way I now, and like I just show them off and they’re like, how long did you spend? And I didn’t spend any time. I just prompted it,
    Ryan Lopopolo: very funny story here.
    swyx: Yeah, go ahead.
    Ryan Lopopolo: We had deployed our app to the first dozen users internally had some performance issues, so we asked them to export a trace for us get a tar ball, gave it to our on-call engineer, and he did a fantastic job of working with Codex to build this beautiful local Devrel tool, next JS app, the drag and drop the tar ball in, and it visualizes the entire trace.
    It’s fantastic. Took an afternoon, but none of this was necessary. Because you could just spin up codex and give it the tar ball and ask the same thing and get the response immediately. So in a way, optimizing for human [00:29:00] legibility of that debugging process was wrong. It kept him in the loop unnecessarily when instead he could have just like Codex cooked for five minutes and gotten this same.
    swyx: Yeah, you verify your instincts here of this is how we used to do it. Or this is how I would have used to solve it.
    Ryan Lopopolo: Yeah. In this local observability stack. Like sure, you can de deploy Yeager to visualize the traces, but I wouldn’t expect to be looking at the traces in the first place because I’m not gonna write the code to fix them.
    swyx: Yeah. So basically there needs to be like this kind of house stack and owning the whole loop. I think that is very well established. And it sounds like you might be like sharing more about that in the future, right?
    Ryan Lopopolo: Yeah. I think we’re excited to do
    [00:29:36] Ghost Libraries Specs
    [00:29:36] Ghost Libraries & Distributing Software as Specs
    Ryan Lopopolo: We’re gonna talk about Symphony in a little bit, but like the way we distribute it as a spec, which I think folks are calling Ghost Libraries on Twitter.
    This is like a such a cool name. It does mean it becomes much cheaper to share software with the world, right? You define a spec, how you could build your own specifying as much as is required for a coding agent to reassemble it [00:30:00] locally. The flow here is very cool. Like we have taken. All the scaffolding that has existed in our proprietary repo spun up a new one.
    Ask Codex with our repo as a reference. Write the spec. We tell it. Spin up a team ox spawn a disconnected codex to implement the spec. Wait for it to be done. Spawn another codex and another team ox to review the spec com or review the implementation compared to upstream and update the spec so it diverges less.
    And then you just loop over and over Ralph style until you get a spec that is with high fidelity able to reproduce the system as it is. It’s fantastic.
    Vibhu: And you’re basically, you’re not really adding any of your human bias in there, right? That’s correct. A lot of times people write a spec and be like, okay, I think it should be done this way, and you’ll riff on something.
    And it’s no, the agent could have just handled it like you’re still scaffolding in a sense, right? I want it done this way. It can determine its spec better.
    swyx: That’s right. That’s right. Part of me it, I’m, I’ve been working a lot on evals recently, and part of me is wondering if [00:31:00] an agent can produce a spec that it cannot solve.
    Is it always capable of things that he can imagine or can you imagine things that it is impossible to do?
    Ryan Lopopolo: I think with Symphony, we, there’s like this there’s this axis where you have things that are easier, hard, or established or new, right? And I think things that are hard and new is still something that the models need humans.
    Yeah. Drive.
    swyx: Yeah. Yeah.
    Ryan Lopopolo: But I think those other quadrants are largely salt. Given the right scaffold and the right thing that’s gonna drive the agent to completion,
    swyx: it’s crazy that it solved,
    Ryan Lopopolo: but it means that the humans, the ones with limited time and attention get to work on the hardest stuff, like the problems where it’s pure white space out in front. Or like the deepest refactorings where you don’t know what the proper shape of the interfaces are. And this is where I wanna spend my time. ‘cause it lets me set up for the next level of scale.
    swyx: Yeah. Yeah. Amazing. Let’s introduce Symphony.
    I think we’ve been mentioning it every now and then. Elixir. Interesting option.
    Ryan Lopopolo: Yeah.
    swyx: Yeah. I’m not,
    Ryan Lopopolo: again, like the [00:32:00] elixir manifestation here is just a derivative. Is it a model
    swyx: chosen? Yeah.
    Ryan Lopopolo: Yeah. Yeah. And it chose that because the process supervision and the gen servers are super amenable to the type of process orchestration that we’re doing here.
    You are essentially spinning up little Damons for every task that is in execution and driving it to completion, which. Means the mall gets a ton of stuff for free by using Elixir and the Beam.
    swyx: I had to go do a crash course in Beam and Elixir, and I think most people are not operating at that scale of concurrency where you need that.
    But it is a good mental model for Resum ability and all those things. And these are things I care about. But tell me the story, the origin story of Symphony. What do you use it for? Is this, how did it form maybe any abandoned paths that you didn’t take?
    [00:32:46] Terminal Free Orchestration
    [00:32:46] Symphony: Removing Humans from the Loop
    Ryan Lopopolo: At the end of December we were at about three and a half PRS per engineer per day.
    This was before five two came out in the beginning of January. Everyone gets back from holiday with five two and no other work [00:33:00] on the repository. We were up in the five to 10 PRS per day per engineer. And I don’t know about y’all, but like it’s very taxing to constantly be switching like that. Like I was pretty tapped out at the end of the day, again, where are the humans spending their time? They’re spending their time context switching between all these active tmox pains to drive the agent forward.
    swyx: Yeah. No way. Yeah.
    Ryan Lopopolo: So let’s again, build something to remove ourselves from the loop. And this is what frantic sprinted adapt here to find a way to remove the need for the human to sit in front of their terminal.
    So a lot of experimentation with Devrel boxes and, automatically spinning up agents, like it seems like a fantastic end state here, where my life is beach. I open live twice a day and say yes no to these things. Yeah. And this is again, a super, super interesting framing for how the work is done.
    Because I become more latency and sensitive. I have [00:34:00] way less attachment to the code as it is written. Like I’ve had close to zero investment in the actual authorship experience. So if it’s garbage. I can just throw it away and not care too much about it. In Symphony, there’s this like rework state where once the PR is proposed and it’s escalated to the human for review, it should be a cheap review.
    It is either mergeable or it is not. And if it’s not, you move it to rework. The elixir service will completely trash the entire work tree NPR and start it again from scratch. Okay. And this is that opportunity again to say, why was it trash right? What did the agent do that was
    swyx: bad. Yeah.
    Ryan Lopopolo: Fix that before moving the ticket to
    swyx: end
    Ryan Lopopolo: of progress again.
    swyx: Yeah. Why is this not in codex app? I guess this, you guys are ahead of Codex app,
    Ryan Lopopolo: yeah, so the way the team has been working is basically to be as AI pilled as possible and spread ahead. And a lot of the things we have worked on have fallen out [00:35:00] into a lot of the products that we have.
    Like we were in deep consultation with the Codex team to. Have the Codex app be a thing that exists, right? To have skills be a thing that Codex is able to use. So we didn’t have to roll our own to put automations into the product. So all of our automatic refactoring agents didn’t have to be these hand rolled control loops.
    It has been really fantastic to be, in a way, un anchored to the product development of Frontier and Codex and just very quickly try to figure out what works and then later find the scalable thing that can be deployed widely. It’s been a very fun way to operate. It’s certainly chaotic. I have lost track very often of what the actual state of the code looks like.
    ‘cause I’m not in the loop. There was. One point where we had wired playwright directly up to the Electron app. With MCPM CCPs, I’m pretty bearish on because the harness forcibly injects all those tokens in the [00:36:00] context, and I don’t really get a say over it. They mess with auto compaction. The agent can forget how to use the tool.
    There’s probably only what three calls in playwright that I actually ever want to use. So I pay the cost for a ton of things. Somebody vibed a local Damon that boots playwright and exposes a tiny little shim CLI to drive it. And I had zero idea that this had occurred because to me, I run Codex and it’s able to, it’s oh, it’s better.
    Yeah. Like no knowledge of this at all. Uhhuh.
    [00:36:30] Multi Human Chaos
    Ryan Lopopolo: So we have had like in human space to spend a lot of time doing synchronous knowledge sharing. We have a daily standup that’s 45 minutes long because we almost have to. Fan out the understanding of the current state.
    swyx: Yeah, I was gonna say this is good for a single human multi-agent, but multi human, multi-agent is a whole like po like explosion of stuff.
    Ryan Lopopolo: Yeah. And that this is fundamentally why we have such a rigid, like 10,000 [00:37:00] engineer level architecture in the app because we have to find ways to carve up the space so people are not trampling on each other.
    swyx: Sorry, I don’t get the 10,000 thing. Did I miss that?
    Ryan Lopopolo: The structure of the repository is like 500 NPM packages.
    It’s like architecture to the excess for what you would consider, I think normal for a seven person team. But if every person is actually like 10 to 50. Then the like numbers on being super, super deep into decomposition and sharding and like proper interface boundaries make a lot more sense.
    swyx: Yeah. To me, that’s why I talked about Microfund ends and I, an anex is from that world, but Cool. It is just coming back to, to, to this I dunno if you have other, thoughts on. Orchestrating so much work coin going through this. Is this enough? Is this like any aha moments?
    Vibhu: It’ll be interesting to see like where, okay, so right now you pick linear as your issue tracker, right?
    swyx: Or it’s like a is it actually linear? This is actually linear.
    [00:37:55] Linear vs Slack Workflow
    Vibhu: Oh, that’s linear. It’s linear.
    swyx: Oh I never looked at
    Vibhu: video. The demo video I had to download to [00:38:00] run.
    swyx: So I, because I’m a Slack maxie, but Yeah, linear. Linear is also really good. Yes,
    Ryan Lopopolo: we do make a good use of Slack. We we fire off codex to do all these lotion, elasticity, fix ups, the things that like sync that knowledge into the repository.
    It’s super cheap. Yeah.
    swyx: Yeah.
    Ryan Lopopolo: Just do it in Codex.
    swyx: My biggest plug is OpenAI needs to build Slack. You need to own Slack. Build yours. Turn this into Slack.
    Ryan Lopopolo: I did read about it. You
    swyx: did?
    Ryan Lopopolo: Yeah.
    [00:38:25] Collaboration Tools for Agents
    Ryan Lopopolo: I would say that if we think that we want these agents to do economically valuable work, which is like this is the mission, right?
    We want AI to be deployed widely, to do economically valuable work, then we need to find ways for them to naturally collaborate with humans, which means collaboration tooling, I think, is an interesting space to explore.
    swyx: Yeah, totally. Yeah. GitHub, slack, linear.
    Vibhu: Yeah, that was my thing. Okay, where do we see right now Codex has started Codex Model, then CLI, now there’s an app, app can let me shoot off multiple Codex is in parallel, but there’s no great team collaboration for Codex.
    And it [00:39:00] seems like your team had some say into what comes out, right? So you talked to ‘em, codex kind of was a thing. From there, if you guys are on the bound, what stuff that like, you might not focus on, but what do you expect other people to be building, right? So people that are like five x 50 Xing.
    Should you build stuff that’s like very niche for your workflow, for your team? Should it be more general so other people can adopt? Is there a niche there? ‘Cause part of it is just okay, is everything just internal tooling? Do we have everything our own way? Like the way our team operates has our own ways that we like to communicate or is there a broader way to do it?
    Is it something like a issue tracker? Just thoughts if you wanna riff on that.
    [00:39:35] Standardizing Skills and Code
    Ryan Lopopolo: I think TBD we have not figured this out in a general way. I do think that there is leverage to be had in making the code and the processes as much the same as possible. If you think that code is context, code is prompts, it’s better from the agent behavior perspective to be able to look in a package in directory X, Y, Z, and it not to have to page so [00:40:00] deeply into directory if you C, because they have the same structure, use the same language, they have the same patterns internally.
    And that same like leverage comes from aligning on a single set of skills that you’re pouring every engineer’s taste into to make sure that the agent is effective. So like in our code base, we have, I think, six skills. That’s it. And if some part of the software development loop is not being covered, our first attempt is to encode it in one of the existing setup skills, which means that we can change the agent behavior.
    Yeah. More cheaply than changing the human driver behavior.
    swyx: Yeah.
    [00:40:39] Self Improvement via Logs
    swyx: Have you ever, have you experimented with agents changing their own behavior?
    Ryan Lopopolo: We do.
    swyx: Yeah. Or parent agent changing a subagents, behavior or something like that.
    Ryan Lopopolo: We have some bits for skill distillation. So for example, there’s one neat thing you can do with Codex, which is just point it at its own session logs to ask it to tell you how you can use [00:41:00] the tool pedal better.
    swyx: It’s like introspection
    Ryan Lopopolo: or ask it to do things. I use
    Vibhu: this session better. What skills should I
    swyx: high? I like the modification of, you can do, just do things to you can just ask agent to do things.
    Ryan Lopopolo: Yeah. You can just codex things. This is like a, this is like a silly emoji that we have, right? You can just codex things, you can just prompt things.
    It’s really glorious future we live in, but okay, you can do that one-on-one. But we’re actually slurping these up for the entire team into blob storage and. Running agent loops over them every day to figure out where as a team can we do better and how do we reflect that back into the repositories?
    Yes, though everybody benefits from everybody else’s behavior for free. Same for like PR comments, right? These are all feedback. That means the code as written, deviated from what was good, a PR comment, a failed build. These are all signals that mean at some point the agent was missing context. We gotta figure out how to
    swyx: Yeah.
    Ryan Lopopolo: Slurp it up and put it back in the reboot.
    swyx: By the way, I do this exactly right. I used to, when I use cloud code for [00:42:00] knowledge work, cloud cowork is like a nice product, right? Yes. In I think you would agree. I always have it tell me what do I do better next time? And that’s the meta programming reflection thing.
    So I almost think like you have six reflection extraction levels in symphony and almost like the zero of layer. So the six levels are PO policy, configuration, coordination, execution, integration, observability. We’ve talked about a couple of these, but the zero layer is like the, okay, are we working well?
    Can we improve how we work? Yes. Can I modify my own workflow without MD or something? I don’t know.
    Ryan Lopopolo: Yeah, of course. Yeah, of course you can. Like this thing is also able to cut its own tickets ‘cause we give it full access.
    Yeah. Make it a ticket to have it cut. Tickets you can.
    Put in the ticket that you expect it to file as on follow up work,
    swyx: like Yeah. Self-modifying. Yeah.
    Ryan Lopopolo: Yeah.
    [00:42:44] Tool Access and CLI First
    Ryan Lopopolo: Put, don’t put the agent in a box. Give the agent full accessibility over it. Domain.
    swyx: I had a mental reaction when you said don’t put the agent in a box. So I think you should put it in a box. Like it’s just that you’re giving the box everything it needs.
    Ryan Lopopolo: Yeah. Context and tools.
    swyx: But we’re like, as developers, we’re used to calling [00:43:00] out to different systems, but here you use the open source things like the Prometheus, whatever, and you run it locally so that you can have the full loop. I assume.
    Ryan Lopopolo: Yep.
    Vibhu: I think like
    Ryan Lopopolo: another, you wanna minimize cloud, cloud dependencies.
    Vibhu: You also want to make sure that you think about what the agent has access to. What does it see? Does it go back into the loop, like from the most basic sense of you let it see its own like calls, traces it can determine where it went wrong. But are you feeding that back in? So you know, just the most basic level of you wanna see exactly what’s input output, like does the agent have access to.
    What is being outputted, right? It can self-improve a lot of these things. It’s all
    Ryan Lopopolo: text, right? My job is to figure out ways to funnel text from one agent to the other.
    swyx: It’s so strange like way back at the start of this whole AI wave Andre was like, English is the hottest day programming language.
    It’s here, it’s just Yeah. The feature as well.
    Vibhu: A lot of, okay. Like a lot of software, a lot of stuff. There’s a gui, it’s made for the human. We’re seeing the evolution of CLI for everything, right? All tools have CLIs. Your agents can use [00:44:00] them well, do we get good vision? Do we get good little sandboxes?
    Like right now? It’s a really effective way, right? Models love to use tools. They love the best. They love to read through text. So slap a CLI let it go loose. That works for everything.
    Ryan Lopopolo: It does. Yeah. Yeah.
    [00:44:14] UI Perception and Rasterizing
    Ryan Lopopolo: We’ve also been adapting nont, textual things to that shape in order to improve model behavior in some ways, right?
    We want the agent to be able to see the UI agents do not perceive visually in the same way that we do. They don’t see a red box, they see red box button, right? They see these things in latent space. So if we want, Hey, yeah, I do. We have
    swyx: a ding if that goes off every time. Alien space
    Ryan Lopopolo: ding.
    Anyway if we wanna actually make it see the layout, it’s almost easier to rasterize that image to ask EOR and feed it in to the agent. Ha. And there’s no reason you can’t do both, right? To like further refine how the model perceives the object it’s [00:45:00] manipulating.
    swyx: Cool. Could we, you wanna talk about a couple more of these layers that might bear more introspection or that you have personal passion for?
    [00:45:07] Coordination Layer with Elixir
    Ryan Lopopolo: I will say that the coordination layer here was a really tricky piece to get right.
    swyx: Let’s do it. Yep. I’m all about that. And this is Temporal core.
    Ryan Lopopolo: This is where when we turn the spec into Elixir, where like the model takes a shortcut, right? Like it’s oh, I have all these primitives that I can make use of in this lovely runtime that has native process supervision.
    Which is I think, a neat way to have taken the spec and made it more choices achievable by making choices that naturally map
    swyx: Yeah.
    Ryan Lopopolo: To the domain, right? In the same way that like you would prefer to have a TypeScript model repo if you are doing full stack web development, right? Because the ability to share types across the front end and backend reduces a lot of complexity.
    And because
    swyx: that’s what graph kill used to be.
    Ryan Lopopolo: That’s right. And
    swyx: I don’t know if it’s still alive, but
    Ryan Lopopolo: [00:46:00] no humans in the loop here. So like my own personal ability to write or not write elixir. Doesn’t really have to bias us away from using the right tool for the job. It is just wild.
    swyx: Love it. I love it.
    Yeah. I wonder if any languages struggle more than others because of this? I feel like everyone has their own abstractions. That would make sense. But maybe it might be slower, it might be more faulty where like you’d have to just kick the server every now and then. I, I don’t know. I think observability layer is really well understood.
    Integration layer, CP is dead. I think all these just like a really interesting hierarchy to travel up and down. It’s common language for people working on the system to understand
    Ryan Lopopolo: The policy stuff is really cool, right? Yeah. You don’t really have to build a bunch of code to make sure the system wait for the, to pass
    swyx: it’s institutional knowledge.
    Ryan Lopopolo: Yeah. You just give it the G-H-C-L-I with some text that say CI has to pass. It makes the maintenance of these systems a lot easier.
    [00:46:57] Agent Friendly CLI Output
    swyx: Do you think that CLI maintainers need to be [00:47:00] do anything special for agents or just as is? It’s good because like I don’t think when people made the G GitHub, CLI, they anticipated this happening.
    Ryan Lopopolo: That’s correct. The GH CLI is fantastic. It’s great super industry.
    swyx: Everyone go try GH repo create GH pull and then pull request number, right? GH HPR, like 1 53, whatever. And then it like pulls
    Ryan Lopopolo: basically my only interaction with the GitHub web UI at this point is GH PR view dash web.
    Exactly. Glance
    swyx: at the diff
    Ryan Lopopolo: and be like Sure thing. Send it. Yeah. But the CLI are nice ‘cause they’re super token efficient and they can be made more token efficient really easily. Like I’m sure you all have seen like I go to build Kite or Jenkins and I could just get this massive wall of build output.
    And in order to unblock the humans, your developer productivity team is almost certainly gonna write some code that parses the actual exception out of the build logs and sticks it in a sticky note at the top of the page. And you basically [00:48:00] want CLI to be structured in a similar way, right? You’re gonna want to patch dash silent to prettier because the agent doesn’t care that every file was already formatted.
    Just wants to know it’s either formatted or not. So it can then go run a right command. Similarly, like in our PNPM distributed script runner, when we had one, when you do dash recursive, like it produces a absolute mountain of text. But all of that is for passing. Test suites. So we ended up wrapping all of this in another script
    swyx: to suppress the,
    Ryan Lopopolo: which you can vibe the channel only output the failing parts of the tests.
    swyx: You make a pipe errors versus the standard, standard out. I don’t know. Okay. Whatever. Too much thinking have to do that. The CII used to maintain SCLI for my company and yeah, this is like core, very core to my heart. But you’re vibing my job.
    Ryan Lopopolo: That’s right.
    swyx: Cool. Any other things?
    This is a long spec. [00:49:00] I appreciate that. It’s got a lot of strong opinions in here. Any other things that we should highlight? I think obviously you can spend the whole day going through some of these, but I do think that some of these have a lot of care or some of this you might wanna tell people, Hey, take this, but, make it your own.
    [00:49:15] Blueprint Spec and Guardrails
    Ryan Lopopolo: Fundamentally, software is made more flexible when it’s able to adapt to the environment in which it is deployed, which means that things like linear or GitHub even are specified within the spec, but not required pieces of it. There’s like a more platonic ideal of the thing that you could swap in like Jira or Bitbucket, for example.
    But being able to tightly specify things like the ID formats or how the Ralph Loop works for the individual agents. Basically means you can get up and running with a fully specified system quickly that you then evolve later on. I think we never intended for this to be a static spec that you can [00:50:00] never change.
    It’s more like a blueprint to get something worth a starting point up and running.
    swyx: Yeah.
    Ryan Lopopolo: For you then to vibe later to your heart’s content,
    swyx: you have like code and scripts in here where it’s oh, I think this is a really good prompt. It’s just a very long prompt.
    Ryan Lopopolo: Fundamentally, the agents are good at following instructions, so give them instructions.
    And it will, improve the reliability of the result. We, much like the way we use Symphony, we don’t want folks to have to monitor the agent as it is vibing the system into existence. So being very opinionated
    Very strict around what these success criteria are means that our deployment success rate goes up. Yeah. It means we don’t have to get tickets on this thing.
    Vibhu: Think it all goes back to that like code to disposable, right? Like early on when you had CLI or you’d kick off a Codex run, it would take two hours. You would wanna monitor okay, I’m in the workflow of just using one.
    I don’t want it to go down the wrong path. I’ll cut it off and, just shoot off four, like that was my favorite thing of the Codex app, right? Yeah. Just Forex it like, [00:51:00] it’s okay. One of them will probably be right, one of them might be better. Stop overthinking it. Like my first example was probably like deep research.
    When you put out deep research and I’d ask it something like, I asked it something about LLM, it thought it was legal something and spent an hour, came back with a report completely off the rails. And I was like, okay, I gotta monitor this thing a bit. No don’t monitor it. Just you want to build it so it’s that it, it goes the right way.
    And you don’t wanna, you don’t wanna sit there and babysit, right? You don’t want to babysit your agents
    Ryan Lopopolo: with that deep research query that you made. Looking at the bad result, you probably figured out you needed to tweak your prompt Yeah. A bit, right? That’s that guardrail that you fed back into the code base for the task, your prompt to further align the agent’s execution.
    Same sort of concept supply there too.
    swyx: When you talk, how are the customers feeling
    Ryan Lopopolo: for Symphony? I think we have none, right? This is a thing we have put out into the
    swyx: world. Symphony’s internal, right? As long as you are happy, you are the customer. That’s right. Just, what’s the external view?
    [00:51:53] Trust Building with PR Videos
    Ryan Lopopolo: I’d say folks are very excited about this way of distributing software and ideas in [00:52:00] cheap ways. For us as users, it has again, pushed the productivity five x, which means I think there’s something here that’s like a durable pattern around removing the human from the loop and figuring out ways to trust the output.
    The video that is shared here
    swyx: Yeah.
    Ryan Lopopolo: Is the same sort of video we would expect the coding agent to attach to the pr.
    swyx: Yeah.
    Ryan Lopopolo: That is created. Yeah. That’s part of building trust in this system and that’s, to me, like fundamentally what has been cool about building this is it more closely pushes that persona of the agent working with you to be like a teammate.
    I don’t shoulder surf you like for the tickets that you work on during the week. I would never think that I would want to do that.
    swyx: Yeah.
    Ryan Lopopolo: I wouldn’t want a screen recording of your entire session in Cursor or Claude code. I would expect you to do what you think you need to do to convince me that the code is good and [00:53:00] mergeable
    swyx: Yeah.
    Ryan Lopopolo: And compress that full trajectory in a way that is legible to me. The reviewer.
    swyx: Yeah.
    Ryan Lopopolo: It’s Stu. And you can just do that because Codex will absolutely sling some f you can just around. It’s great.
    swyx: Oh, F FM P is the og like God, CLI.
    Ryan Lopopolo: Yeah.
    swyx: Swiss Army Chainsaw. I used to say. There’s a SaaS, micro SaaS that’s called it in every flag in FFM Peg.
    Ryan Lopopolo: Oh, for sure.
    swyx: You know what I mean? For sure. Just host it as a service, put a UI on it. People who don’t know FM Peg will pay for it.
    Ryan Lopopolo: When we were first experimenting with this, it was a wild feeling to be at the computer with just like windows just popping up all over the place and getting captured and files appearing on my desktop, like very much felt like the future to have a thing controlling my computer for like actual productive use.
    Like I’m just there
    swyx: keeping it. Like awake, jiggling the mouse every once in a while. That’s what some office workers do. So they buy a mouse jiggler. That’s right.
    [00:53:59] Spark vs Reasoning Models
    Vibhu: One thing I [00:54:00] wanted to ask, so okay, as stuff is so CO is disposable is saying shoot off a budget of agents. One question is okay, are you always like a extra high thinking guy?
    And where do you see Spark? So 5.3 Spark, there’s a lot of me wanting to make quick changes. I’m not gonna open up a id, I’m not gonna do anything. But I will say, okay, fix this little thing, change a line, change a color. Spark is great for that, but am I still a bottleneck? Like, why don’t I just let that go back?
    I’m like, just riff on that. Is there,
    Ryan Lopopolo: spark is such a different model compared to the. The extra high level reasoning that you get in these, five Yeah. To clear for people.
    swyx: It is a different model, different architecture, different, like it doesn’t support
    Ryan Lopopolo: it, it just, it’s incredibly fast smaller model.
    I have not quite figured out how to use it yet. To be honest, I use faster. I was adapting it to the same sorts of tasks I would use X high reasoning for. Yeah. I, and it would blow through three compactions before writing a line of code.
    Vibhu: And that’s another big thing with 5.4 right.
    Million co context.
    Ryan Lopopolo: Yes, it’s
    Vibhu: fantastic. Which is huge [00:55:00] ingenix, right? Like you can just run for longer before you have to compact. The more tokens you can spend on a task before compacting, like the better you’ll do.
    Ryan Lopopolo: That’s right. That’s right. I’m not sure how to deploy spark. I think your intuition is right, that it’s very great for spiking out prototypes, exploring ideas quickly, doing those documentation updates.
    It is fantastic for us in taking that feedback and transforming it into a lint. Where we already have good infrastructure for ES links in the code base these sorts of things it’s great at and it allows us to unblock quickly doing those like anti-fragile healing tasks in the code base.
    swyx: Yeah, that makes sense.
    [00:55:38] What Models Can’t Do Yet
    swyx: So you are push, you guys are pushing models to the freaking limit.
    [00:55:41] Current Model Limitations
    swyx: What can current models not do well yet?
    Ryan Lopopolo: They’re definitely not there on being able to go from new product idea to prototype single
    swyx: one shot.
    Ryan Lopopolo: This is where I find I spend a lot of time steering is translating end state of a mock for a net new [00:56:00] thing, right?
    Think no existing screens into product that is playable with. Similarly, while this has gotten better with each model release, like the gnarliest refactorings are the ones that I spend my most time with, right? The ones where I’m interrupting the most, the ones where I am. Now double clicking to build tooling to help decompose monoliths and things like that.
    This is a thing I only expect to get better, right? Over the course of a month, we went from the low complexity tasks to like low complexity and big tasks in both these directions. So this is what it means to not bet against the model, right? You should expect that it is going to push itself out into these higher and higher complexity spaces.
    Yeah. So the things we do are robust to that. It just basically means I’ll be able to spend my time elsewhere and figure out what the next bottleneck is.
    Vibhu: I do think it’s also a bit of a different type of task, right? Codex is really good at codebase understanding, working with code bases. But companies like Lovable bolt, repli, they solve a very different [00:57:00] problem.
    Scaffold of zero to one, right? Idea of a product. And it’s there, there are people working on that and models are also pushing like step function changes there. It’s just different than the software engineering agents today, right?
    Ryan Lopopolo: Like I said, the model is isomorphic to myself.
    The only thing that’s different is figuring out how to get what’s in here into context for the model and for these white space sort of projects. I, myself, I’m just not good at it. Which means that often over the agent trajectory, I realize the bits that we’re missing, which is why I find I need to have this synchronous interaction.
    And I expect with the right harness, with the right scaffold, that’s able to tease that outta me or refine the possible space, right? To be super opinionated around the frameworks that are deployed or to put a template in place, right? These are ways to give the model. All those non-functional requirements, that extra context to acre on and avoid that wide dispersion of possible outcomes.
    swyx: Thank [00:58:00] you for that.
    [00:58:00] Frontier Enterprise Platform
    swyx: I wanted to talk a little bit about Frontier.
    Ryan Lopopolo: Yeah, sure.
    swyx: Overall you guys announced it maybe like a month ago. And there’s a few charts in here and it’s basic like your enterprise offering is what I view it. Is there one product or is there many,
    Ryan Lopopolo: I can’t speak to the full product roadmap here, but what I can say is that Frontier is the platform by which we want to do AI transformation of every enterprise and from big to small.
    And the way we want to do that is by making it easy to deploy highly observable, safe, controlled, identifiable agents into the workplace. We want it to work with your company native. I am stack. We want it to plug into the security tooling that you have. Oh, we want it to be able to plug into the workspace tools that you used,
    swyx: so you’re just gonna be stripping specs, right?
    Ryan Lopopolo: We expect that there will be some harness things there. Agents, SDK is a core [00:59:00] part of this to enable both startup builders as well as enterprise builders to have a works by default harness that is able to use all the best features of our models from the Shell tool down to the Codex Harness with file attachments and containers and all these other things that we know go into building highly reliable, complex agents.
    We wanna make that great and we wanna make it easy to compose these things together in ways that are safe, for example, right? Like the G-P-T-O-S-S safeguard model. For example. One thing that’s really cool about it is it ships. The ability to interface with a safety spec. Safety specs are things that are bespoke to enterprises.
    We owe it to these folks to figure out ways for them to instrument the agents in their enterprise to avoid exfiltration in the ways they specifically care about, to know about their internal company, code names, these sorts of things. So providing the right hooks to make the [01:00:00] platform customizable, but also, mostly working by default for folks is the space we are trying to explore here.
    swyx: Yeah. And this is the snowflakes of the world just need this, right? Yes. Your Brexit of the world stripes. Yeah, it makes sense.
    [01:00:11] Dashboards and Data Agents
    swyx: I was gonna go back to your, I think the demo videos that you guys had was pretty illustrative. It’s like also to me an example of very large scale agent management.
    Yes. Like you give people a control dashboard that if you play, if you like, play any one of these like multiple agent things, you can di dig down to the individual instant and see what’s going on.
    Ryan Lopopolo: Yes, of course.
    swyx: But who’s the user Is it let’s it like the CEO, the CTO, ccio, something like that.
    Ryan Lopopolo: At least with my personal opinion here, the buyer that we’re trying to build product for here is one and employees who are making productive use of these agents, right?
    That’s gonna be whatever surfaces they appear in the connectors they have access to, things like that. Something like this dashboard is for it. Your GRC and governments folks, your AI innovation office, your security [01:01:00] team, right? The stakeholders in your company that are responsible for successfully deploying into.
    The spaces where your employees work, as well as doing so in a safe way that is consistent with all the regulatory requirements that you have and customer attestations and things like that. So it is a iceberg beneath the actual end. It’s,
    swyx: yeah you jump every, I guess layer in the UI is like going down the layer of extraction in terms of the agent, right?
    Yep. Yeah. Yeah. I think it’s good.
    Ryan Lopopolo: Yeah. The ability to dive deep into the individual agent trajectory level is gonna be super powerful.
    Not only for from like a security perspective, but also from like someone who is accountable for developing skills. One thing that was interesting that we also blogged about shipping was an internal data agent, which uses a lot of the frontier technology in order to make our data ontology accessible to the agent and things like that to understand.
    What’s actually in the data [01:02:00] warehouse?
    swyx: Yeah. Seman layer Yes. Type things. Yes. I was briefly part of the, that, that world is it salt? I don’t know. It’s actually really hard for humans to agree on what revenue is. Yes.
    Ryan Lopopolo: Yes.
    swyx: What is an active user?
    Ryan Lopopolo: There’s what, five data scientists in the company that have defined this Golden.
    swyx: They, yeah. And no. And there’s also internal politics. Yes. As to attribution of I’m marketing, I’m responsible for this much, and sales is responsible for this much, and they all add up to more than a hundred. And I’m like you guys have different definitions.
    Vibhu: Yeah. And if you’re a startup, everything is a RR,
    swyx: So I think that’s cool.
    Oh, you guys blog about this. Okay. I didn’t see this. Yeah. Is this the same thing? I don’t know. This is what you’re referring to? Yes. Okay. We’ll send people to read this. This is our data.
    Vibhu: Him this one.
    swyx: Yeah. I don’t know if you’re you have any highlights? I
    Vibhu: No. In general from the playlist.
    Yeah. A lot of good things to read.
    swyx: Yeah. Yeah. Lot, lots of homework for people. No, but like data as the feedback layer, you need to solve this first in order to have the products feedback loop closed. That’s right. So for the agents to understand and this is not something that humans have not solved.
    This like, and
    Ryan Lopopolo: this is [01:03:00] how you build artists that do more than coding, right? Yeah.
    swyx: Yeah.
    Ryan Lopopolo: To actually understand how you operate the business.
    swyx: Yeah.
    Ryan Lopopolo: You have to understand what revenue is, what your customer segments are. Yeah. What your product lines are.
    [01:03:13] Company Context and Memes
    Ryan Lopopolo: Like one thing that’s in looping back to the code base that we described here for harnessing, one thing that’s in core beliefs.md is who’s on the team, what product we’re building, who our end customers are.
    Who our pilot customers are, what the full vision of what we want to achieve over the next 12 months is these are all bits of context that inform how we would go about building the software. Oh my God. So we have to give it to the agent too.
    Vibhu: I’m guessing that stuff is like pretty dynamic and it changes over time too, right?
    Like part of it was, it’s not just a big spec. You have it as one of the things and it will iterate.
    Ryan Lopopolo: One, one thing that I think is gonna break your mind even more is we have skills for how to properly generate deep fried memes and have Ji culture [01:04:00] and Slack. Because with the Slack Chachi PT app that you’re able to use in Codex, like I can get the agent to s**t post on my behalf.
    Just, it’s part of humor.
    swyx: Theme humor. Humor is part of EGI. Is it funny? It is pretty good, yeah. Okay. Yeah,
    Ryan Lopopolo: it’s pretty good at making
    swyx: Deep, it’s a lot of I think humor is like a really hard intelligence test, right? It’s like you have to get a lot of context into like very few words.
    This is why make references
    Ryan Lopopolo: is why five four is such a big uplift for our it’s the me. Yeah, for sure. Yeah. Yeah.
    swyx: It’s very cool.
    Vibhu: So five, four can two post. So that’s what we take over here.
    Ryan Lopopolo: Yeah. Maybe maybe when y’all are done here today, ask Codex to go over your coding agent sessions and to roast you.
    swyx: Love it. I’ll give it a shot. Give a shot. Coming back to the final point I wanted to make is, yeah I think that there, there are multiple other, like you guys are working on this, but this is a pattern that every other company out there should adopt. Yes. Regardless of whether or not they work with you.
    To me, this is I saw this, I was like, f**k, [01:05:00] every company needs this. This
    is
    swyx: multiple billions.
    Ryan Lopopolo: This is what it takes to get
    swyx: Yeah.
    Ryan Lopopolo: People to Yes. Yeah. Actually realize the benefits. Yes. And distribute.
    swyx: And it’s, it, I think it sounds boring to people like, oh, it’s for safeguards and whatever, but I think you to handle agents at scale like you are envisioning here I don’t know if it’s like a real screenshot, like a demo, but this is what you need.
    This is, or my original sort of view of what Temporal was supposed to be that you, you built this dashboard and you basically have every long running process in the company Yes. In one dashboard and that’s it. That’s right.
    Vibhu: Yeah. I think it’s pretty customized towards every enterprise, right?
    Like you care about different things.
    swyx: There’s a lot of customization, but there’ll be multiple unicorns just doing this as a service. I don’t know. I’m like very frontier field, if you can tell. Amazing. But it, it only clicked ‘cause obviously this came out first, then Harness eng, then symphony and only clicked for me that like, this is actually the thing you shipped to do that.
    Ryan Lopopolo: Yeah. Yeah. There’s a set of building blocks here that we assembled into these agents [01:06:00] and the building blocks themselves are part of the product, right? Yeah. The ability to steer revoke authorization if a model becomes misaligned, like all of this is accessible through Frontier. And there’s gonna be a bunch of stakeholders in the company that have the things they need to see in the platform Yeah.
    To get to. Yes. So we’ll build all of those in the frontier so that we can actually do the widespread the planet. Yeah. That’s the fun part.
    swyx: Yeah. I’m also calling back to there’s this like levels of EGI I don’t know if Opening Eye is still talking about this, but they used to talk about five levels of EGI and one of it was like, oh, it’s like an intern coding software patient.
    At some point it was AI organization and this is it. That’s right. This is level four or five. I can’t remember which, which level, but it’s somewhere along that path. Was this.
    Ryan Lopopolo: You know how I mentioned that my team is having fun sprinting ahead here. And we do this thing where we’re collecting all the agent trajectories from Codex to slurp them up and distill them.
    This is what it means to build our team [01:07:00] level knowledge base, happen to reflect it back into the code base. But it doesn’t have to be that way. And it doesn’t have to be bound to just codex. I want Chacha BT to also learn our meaning culture and also the product we are building and how so that when I go ask it, it also has the full context of the way I do my work and I’m super excited for Frontier to enable this.
    swyx: Yeah. Amazing.
    [01:07:21] Harness vs Training Tension
    swyx: What are the model people say when they see you do this? Like you have a lot of feedback, obviously you have a lot of usage, you have a lot of trajectories and don’t, I don’t imagine a lot of it’s useful to them, but some of it is,
    Vibhu: you have this too, you deploy a billion tokens of intelligence a day and this was, this was at the beginning of 2096.
    You’re Yeah. Cooking.
    Ryan Lopopolo: Yeah, there’s this fundamental tension, which I think you have talked about between whether or not we invest deeper into the harness or we invest deeper into the training process to get the model to do more of this by default. Yeah, and I think success for the way we are [01:08:00] operating here means the model gets better taste because we can point the way there and none of the things we have built actively degrade Asian performance.
    ‘cause really all they’re doing is running tests and like running tests is a good part of what it means to write reliable software. If we were building an entire separate rust scaffold around Codex to restrict its output, that I think would be like additional harness that would be prone to being scrapped.
    But yeah. Yeah. If instead we can build all the guardrails in a way that’s just native to the output that Codex is already producing, which is code, I think. No friction with how the model continues to advance, but also like just good engineering and that’s the whole point.
    swyx: Yeah. So I’ve had similar discussions with research scientists where the RL equivalent is on policy versus off policy.
    Yeah. And you’re basically saying that you should build an on policy harness, which is already within distribution and you [01:09:00] modify from there. But if you build it off policy, it’s not that useful.
    Ryan Lopopolo: That’s right.
    swyx: Super cool. Any, anybody thoughts, any things that we haven’t covered that we should get it, get out there?
    [01:09:08] Closing Thoughts & OpenAI Hiring
    Ryan Lopopolo: Just I’ve been super excited to benefit from all the cooking that the Codex team has been doing. Yes. They absolutely ship relentlessly. This is one of our core engineering values, ship relentlessly, and they, the team there embodies it. To extreme degree, yeah, I have five three and then Spark and five four come out within what feels like a month is just a phenomenally fast.
    swyx: It’s exactly a month ago it’s five three and yesterday was five four. Yeah. I mean it’s, do we have every month now is five five next? Exactly.
    Ryan Lopopolo: I can’t say that the poll markets would be very upset.
    swyx: I think it’s interesting that it’s also correlated with the growth. They announced that it’s 2 million users, but like almost don’t care about Codex anymore.
    This is it, this is the gay man. It’s like coding cool, soft like knowledge work.
    Ryan Lopopolo: That’s right. That’s right. This is the thing to chase after. Yeah. And this is one of things that my team is excited to support,
    swyx: get the whole like [01:10:00] self-hosted harness thing working, which you have done and like the rest of us are trying to figure out how to catch up, but then do things.
    You That’s right. With you
    Vibhu: do things.
    swyx: That’s right. You can just do things. That’s the line for the episode.
    Vibhu: That’s it. Any other call to actions. You’re based in Seattle, your team, I’m guessing. New Bellevue office.
    Ryan Lopopolo: New Bellevue office. We just had the grand opening yesterday as of the recording date which was fantastic.
    Beautiful buildings. Super excitedly part of the Bellevue Community building the future in Washington. And I would say that there is lots of work to be done in order to successfully serve enterprise customers here in Frontier. We are certainly hiring and if you haven’t tried the Codex app yet, please give it a download.
    We just passed 2 million weekly active users growing at a phenomenally fast rate, 25% week over week. Come join us.
    swyx: Yes. And I think that’s an interesting no. My, my final observation opening is a very San Francisco centric company. I know people who have been. [01:11:00] Who turned down the job or didn’t get the job ‘cause they didn’t want to move to sf and now they just don’t have a choice.
    You have to open the London, you have to open the Seattle. And I wonder if that’s gonna be a shift in the culture, obviously you can’t say, but
    Ryan Lopopolo: I was one of the first engineering hires out of our Seattle office, so Yeah.
    swyx: See I was very natural.
    Ryan Lopopolo: Its success has been part of what I have been building toward and it is, it has grown quite well, right?
    Yeah. We have durable products in the lines of business that are built outta there a ton of zero to one work happening as well, which is the core essence of the way we do applied AI work at the company to sprint after it new to figure out where we can actually successfully deploy the model.
    Yeah. Yes. A hundred percent. We also have a New York office too that has a ton of engineering presence.
    swyx: Yeah. Exact. Exactly. That’s these are my road roadmaps for a e wherever people hiring engineers, I will go. That’s right. Ra it’s
    Vibhu: a cool office to New York is a old REI building, I believe the REI office.
    swyx: It’s just No, you’ll never be as big. New York is you can’t get [01:12:00] the size of office that they need.
    Ryan Lopopolo: The New York office, Seattle user has a very office Mad Men vibe. It’s beautiful. The Bellevue one is very green, gold fixtures, very Pacific Northwest is very cool place to the vibe.
    Be local
    Vibhu: little, yeah. A lot of people are like there for people like New York. They wanna be in New York, right?
    Ryan Lopopolo: Yeah. Yeah. We have a fantastic workplace team that has been building out these offices. It really is a privilege to work here. Yeah. Excellent. Okay. Thank you for your time. You’ve been very
    swyx: generous and you’re, you’ve been cooking, so I’m gonna let you get back to cooking.
    It’s been amazing to be with you folks. Happy Friday. Happy Friday.


    This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe

Altri podcast di Economia

Su Latent Space: The AI Engineer Podcast

The podcast by and for AI Engineers! In 2025, over 10 million readers and listeners came to Latent Space to hear about news, papers and interviews in Software 3.0. We cover Foundation Models changing every domain in Code Generation, Multimodality, AI Agents, GPU Infra and more, directly from the founders, builders, and thinkers involved in pushing the cutting edge. Striving to give you both the definitive take on the Current Thing down to the first introduction to the tech you'll be using in the next 3 months! We break news and exclusive interviews from OpenAI, Anthropic, Gemini, Meta (Soumith Chintala), Sierra (Bret Taylor), tiny (George Hotz), Databricks/MosaicML (Jon Frankle), Modular (Chris Lattner), Answer.ai (Jeremy Howard), et al. Full show notes always on https://latent.space www.latent.space
Sito web del podcast

Ascolta Latent Space: The AI Engineer Podcast, STORIE DI BRAND e molti altri podcast da tutto il mondo con l’applicazione di radio.it

Scarica l'app gratuita radio.it

  • Salva le radio e i podcast favoriti
  • Streaming via Wi-Fi o Bluetooth
  • Supporta Carplay & Android Auto
  • Molte altre funzioni dell'app