Notes on Optimizing in Non-Differentiable Worlds (I)

Modern machine learning is built on gradient-based optimization methods, particularly backpropagation. All arguments that say neural networks alone will bring about superintelligence rest on one of two assumptions: 1. the world is learnable by continuous functions, or 2. a simplified, continuous representation of the world is enough to create intelligence more powerful than humans. If either one of these are true, then neural networks alone will bring about superintelligence. In this essay, I will argue that both of these conditions are likely false and reason from first principles on what developing superior intelligence looks like when these conditions are false. Interestingly, the approach I come up with can be found in nature.

1. Assumption One: The World Can Be Learned by Continuous Functions

The first condition underpinning arguments that neural networks alone can bring about superintelligence is the assumption that the world can be learned by continuous functions, that is, that the underlying structure of the world can be captured through smooth function approximators, like neural networks, which rely on small changes in input producing small changes in output. This assumption enables training via gradient descent: a model takes in observations, computes a loss based on how well its output matches a target, and adjusts its internal parameters through infinitesimal updates. Over time, with enough data and compute, it is expected that such a system can approximate optimal behavior even in complex environments. This logic drives end-to-end learning, model-based reinforcement learning, and world models: all assume that the world’s dynamics can be represented and improved upon through continuous transformations learned from data.

But this assumption glosses over a critical truth: much of the real world is not continuous in the ways that matter. Feedback is often sparse, delayed, or discrete. In robotics, for instance, a small motor command can cause a catastrophic failure, such as a slip, a collision, or a break, where the consequences are abrupt and irreversible. In language, changing a single word can invert meaning; in reasoning, a proof either holds or it doesn’t. These are not gradual shifts but structural breaks. In such domains, the relationship between action and outcome is governed by thresholds, binary outcomes, and discrete symbolic rules, not by smooth transitions. This undermines the foundational premise of continuous function learning: there is no reliable gradient to follow when the system jumps unpredictably from one regime to another.

Even world models, which attempt to simulate future states in latent space, inherit this flaw. They rely on internal representations that assume continuous transitions and learnable gradients between states. Yet the real world includes hidden variables, chaotic events, and discontinuities that are not captured in training data. A world model may interpolate beautifully within the training distribution, but fail catastrophically when faced with states outside of it: not because the model lacks scale, but because the world defies continuous approximation where it matters most.

In short, neural networks are powerful tools for learning smooth approximations of the world. But when the world’s structure is discontinuous, symbolic, or only partially observable, the assumptions underlying gradient-based learning collapse. Intelligence in such environments cannot be built solely on infinitesimal nudges through continuous landscapes. It must reason, search, and adapt in ways that transcend smooth function fitting.

2. Assumption Two: A Simplified Continuous Representation of the World is Enough

If the real world cannot be learned directly through continuous functions, as argued in Assumption One, the fallback hope in modern machine learning is that this doesn’t matter. Instead, perhaps we can construct a simplified, continuous approximation of the world that is smooth enough to train on, and close enough to reality that the resulting model will generalize effectively. This approximation acts as a surrogate for the world, and the assumption is that optimizing performance within this proxy space will transfer to intelligent behavior in the real world.

This logic underlies many of today’s most advanced machine learning systems. Large language models are trained to minimize differentiable losses like next-token prediction. Model-based reinforcement learning systems optimize smooth objectives such as expected future reward. Even when the real-world environment is discrete, delayed, or brittle, the training setup is restructured to look continuous by using simulators, embedding spaces, or differentiable reward functions. In this setup, backpropagation can proceed as if the world were well-behaved and smooth.

But this assumption inherits the same blind spots as Assumption One, just one level removed. A simplified, continuous representation of the world is always a lossy filter. It imposes structure where none exists, smooths over sharp transitions, and often discards the very dynamics that are most important for generalization. The more you tune the proxy for smooth learnability, the more it risks drifting from the actual causal and symbolic structure of the task.

Large language models highlight this failure mode vividly. While trained on next-token prediction, which is a continuous, gradient-friendly task, we expect them to reason, plan, and maintain coherence across long time horizons. But these are not next-token problems. They require discrete commitments, causal consistency, and symbolic reasoning. When pushed into these domains, LLMs often hallucinate, contradict themselves, or fail to recover from logical errors. This is not due to lack of scale, but due to a structural mismatch: the smooth proxy objective does not preserve the brittle, compositional structure of the real task.

The same principle applies beyond language. In robotics, policies trained in clean simulators often fail catastrophically in the real world: a phenomenon known as the Sim2Real gap. In reinforcement learning, agents trained on shaped, differentiable reward functions frequently exploit the proxy rather than solving the intended task: a form of reward hacking. In safety-critical systems, approximated models miss rare but high-impact edge cases. These are not implementation bugs. They are the predictable consequence of training systems on a smooth abstraction of a discontinuous world.

In short, even if we accept that the world is not learnable through continuous functions, and instead work with differentiable approximations, we are still operating under the same flawed premise: that smoothness is sufficient. But real-world intelligence depends on navigating sharp transitions, preserving symbolic constraints, and handling irreversibility, qualities that continuous proxies do not naturally encode. As the complexity of the world increases, so too does the likelihood that something essential gets lost in the smoothing process.

3. Evolutionary Search for Optimization in Non-Differentiable Worlds

This leaves us with a problem: if the world is discontinuous and therefore non-differentiable, and simplified representations are insufficient, how do we build superintelligent systems that can operate effectively in such environments? We need to create systems that can optimize in non-differentiable worlds, but what does that entail?

To start, here are characteristics that make the world non-differentiable:

Partial Observability: The optimizer does not have access to the full state of the environment. It must act based on limited, noisy, or indirect information. This makes it impossible to compute a true error signal at each step.
Delayed and Sparse Feedback: Rewards or outcomes do not arrive immediately after an action, and they may occur infrequently. This decouples cause from effect, making it difficult to assign credit or blame to any particular step in the process.
Irreversibility and Sharp Transitions: Many real-world environments contain cliffs: single decisions that irreversibly destroy future potential. These aren’t smooth valleys you can climb out of; they’re edges you can fall off. In such cases, the cost of a bad update can be catastrophic.
Combinatorial and Discrete Structure: Often, the space of strategies is fundamentally discrete or symbolic. There is no smooth interpolation between “write a valid proof” and “write gibberish”, only distinct categories of success and failure.

Our goal is to build systems that can optimize in environments that exhibit these characteristics. Given enough resources, these systems should be able to outperform human intelligence in the non-differentiable environments that we care about. From first principles, optimization in such environments requires two things: 1. exploration that does not rely on gradients, and 2. evaluation that works even when feedback is sparse, delayed, or discrete.

Let's reason through what this implies. In a world that is partially observable, you can’t trust any single viewpoint. So rather than relying on a single agent updating its internal model from limited observations, it becomes useful to deploy many agents, each with slightly different policies, each exploring the world from a different perspective.

In a world with sparse or delayed feedback, learning must happen through whole-trajectory evaluation. You can’t improve behavior by nudging a parameter based on immediate error. Instead, you need to evaluate complete candidate strategies based on their long-term outcomes and compare them to one another.

In a world with sharp transitions and irreversibility, risk becomes existential. A single misstep can destroy the agent’s ability to learn. This means you must preserve redundancy and diversity in the population. If one candidate fails, others must continue. Failure must be distributed, not centralized.

In a world that is combinatorial and discrete, the search space is not smooth. There are no gradients, only possibilities. This requires structured, combinatorial search over configurations. Mutation, not interpolation, is the way to traverse the space. What matters is not a small tweak in the right direction, but the ability to generate viable, testable alternatives.

We've seen this approach succeed in searching non-differentiable spaces most notably in nature and evolution. In evolution, organisms explore the world through a population of diverse agents, each with different genetic makeups. They explore the environment, evaluate their fitness based on survival and reproduction, and adapt over generations through mutation and selection. This process does not rely on gradients or smooth feedback; it relies on the ability to generate diverse candidates, evaluate them based on long-term outcomes, and preserve those that succeed.

This process can be viewed not just as a random walk through solution space, but as a form of tree search. Each generation branches into a new set of candidates, each with its own trajectory through the environment. Selection acts as a pruning mechanism, collapsing the tree toward regions of high fitness. Mutation introduces new branches. Over time, the search tree grows in structured directions, guided by evaluation, not gradients. In this light, evolutionary algorithms are population-based tree search algorithms, with stochastic branching and delayed feedback.

We also see this in AlphaEvolve and other evolutionary algorithms, which use population-based search to optimize complex problems. They generate a diverse set of candidate solutions, evaluate them based on their performance in the environment, and iteratively refine the population through mutation and selection. This approach does not require differentiable feedback; it relies on the ability to explore the space of possible solutions and evaluate them based on their long-term outcomes.

I've also seen this personally in my work done in preparation for Deepmind's Gemini Diffusion, where I developed a structured metacognition framework that does GRPO-embedded monte carlo tree search at test time to explore the space of possible discrete decoding policies. This approach is a hybrid between evolutionary search method and RL that evolves decoding policies to full maturity through tree search, with exploration of multiple candidates and evaluation of their long-term outcomes. My specific algorithm is an informed mutation form of evolution, where the mutation is guided by a learned policy that predicts the most promising candidates based on rollouts of the current population, instead of random mutation and letting the environment decide which policies make it to the next generation. I also see parallels in my work on trajectory space search scaling for diffusion models in robotics and vision.

4. Moving Forward

The key takeaway is that, in non-differentiable worlds, intelligence cannot rely on gradient-based optimization. While neural networks and differentiable models are powerful tools, they are not sufficient for building superintelligent systems that can operate effectively in non-differentiable environments. They will, however, play a large part in the agents that are used in evolutionary search, as we already see with works like AlphaEvolve and my structured metacognition framework, as they are great tools for autonomously creating diverse candidates, evaluating them based on their long-term outcomes, and converting environmental feedback to information that can be used to guide the search process. Without the advancements we have seen in neural networks, the evolutionary search algorithms we envision now would not be possible, but it is important to remember that neural networks are not the end goal. They are a tool that can be used to build more powerful and effective evolutionary search algorithms. Another takeaway for implementation: evolutionary search algorithms look a lot like tree search algorithms we already use.