Setup
Work in 2D first (the usual “city blocks” picture).
Take two points:
A = (x_1, y_1), \quad B = (x_2, y_2).
The Manhattan distance between them is defined as
d(A,B) = |x_2 – x_1| + |y_2 – y_1|.
Now consider any path from A to B made of horizontal and vertical line segments, each of arbitrary length, but always moving monotonically toward B (never backtracking in x or y).
We want to show:
No matter how you break that path into big steps or small steps, its total length in the Manhattan sense is always
|x_2 – x_1| + |y_2 – y_1|.
The key idea: instead of thinking about steps, think about how many grid lines you cross.
Step 1: Draw the enclosing rectangle
Draw the axis-aligned rectangle with opposite corners at A and B.
It has:
width |x_2 – x_1| (difference in x), height |y_2 – y_1| (difference in y).
Inside this rectangle lies any monotone L-shaped “Manhattan path” from A to B that only moves right/left and up/down without backtracking.
Now overlay the standard unit grid in the plane: vertical lines x = k and horizontal lines y = \ell for integers k,\ell.
Step 2: Count grid-line crossings, not steps
Focus on the vertical grid lines between the x-coordinates of A and B.
Assume for clarity that x_2 > x_1. Then the vertical grid lines you might cross between A and B are:
x = \lfloor x_1 \rfloor + 1, \ \lfloor x_1 \rfloor + 2, \ \dots,\ \lceil x_2 \rceil – 1,
but really we only care about the total horizontal displacement x_2 – x_1, so think of crossing from x_1 to x_2 as crossing |x_2 – x_1| “unit layers” in the x-direction.
Now here is the crucial observation:
Along any monotone path from A to B, each time you move horizontally to the right, you cross those vertical “layers” from left to right. Whether you do it in one giant horizontal step or many tiny horizontal steps, the path must cross each “layer” between x_1 and x_2 exactly once.
So the number of horizontal unit crossings is fixed:
\#(\text{horizontal unit crossings}) = |x_2 – x_1|.
Exactly the same logic applies to the y-direction:
Any monotone path from y_1 to y_2 must cross exactly |y_2 – y_1| “unit layers” vertically. Again, breaking vertical motion into big or small steps doesn’t change how many of those layers you cross.
So we also have:
\#(\text{vertical unit crossings}) = |y_2 – y_1|.
Step 3: Relate crossings to path length
Now define the length of a path in “Manhattan units” as:
1 unit for each unit crossing in the x-direction
plus
1 unit for each unit crossing in the y-direction.
But we just argued:
Every monotone axis-aligned path from A to B has exactly |x_2 – x_1| horizontal unit crossings, and exactly |y_2 – y_1| vertical unit crossings.
Therefore, the total Manhattan length of any such path is
\text{length} = |x_2 – x_1| + |y_2 – y_1|.
Crucially, we never used the size of each step — just the fact that the path connects A to B without backtracking. A long step that crosses 5 “layers” counts for 5 units; the same is true if you take 5 separate unit steps.
Step 4: Why step size doesn’t matter (explicitly)
Take one horizontal segment in your path from x = a to x = b with b > a.
Its contribution to the Manhattan length is |b-a|. If you subdivide it into two steps at some point c with a < c < b, you now have segments a \to c and c \to b. Their contribution is |c-a| + |b-c| = (c-a) + (b-c) = b – a = |b-a|.
So splitting or merging horizontal segments never changes the sum of absolute displacements. The same argument holds for vertical segments.
Since any change in “step size” is just repeated splitting/merging of segments, and each such operation preserves the total, step size cannot change the Manhattan length.
Step 5: Extension to higher dimensions (briefly)
In \mathbb{R}^n:
d(A,B) = \sum_{i=1}^n |x_i^{(B)} – x_i^{(A)}|.
The same argument applies coordinate-by-coordinate:
For each coordinate i, any monotone path from x_i^{(A)} to x_i^{(B)} crosses exactly |x_i^{(B)} – x_i^{(A)}| unit layers in that direction. Summing over coordinates gives a total Manhattan distance independent of how the path is chopped into steps.
Summary (the invariant viewpoint)
The invariant is: “number of unit layers crossed in each coordinate direction.” Any monotone, axis-aligned path from A to B crosses: exactly |x_2 – x_1| horizontal layers, exactly |y_2 – y_1| vertical layers. Therefore the Manhattan length is always |x_2 – x_1| + |y_2 – y_1|, no matter whether you walk in many small steps or a few giant steps.
Let’s extend the proof to paths with backtracking, and then I’ll connect that to why Manhattan distance is such a big deal in route optimization.
1. Allowing backtracking: length ≥ Manhattan distance
We’re still in 2D with points
A = (x_1, y_1), \quad B = (x_2, y_2).
Consider any path from A to B made of horizontal and vertical segments, but now we allow backtracking: you can go right, left, up, down in any order.
Write the path as a sequence of segments:
Horizontal displacements: h_1, h_2, \dots, h_m (each h_i can be positive = right, or negative = left), Vertical displacements: v_1, v_2, \dots, v_k (each v_j can be positive = up, or negative = down).
The total path length in Manhattan sense is
L = \sum_{i=1}^m |h_i| + \sum_{j=1}^k |v_j|.
But the net movement from A to B in each coordinate is:
x_2 – x_1 = \sum_{i=1}^m h_i, \quad y_2 – y_1 = \sum_{j=1}^k v_j.
Now apply the basic inequality |a_1 + \dots + a_n| \le |a_1| + \dots + |a_n| (triangle inequality) to each coordinate:
Horizontal: |x_2 – x_1| = \left|\sum_{i=1}^m h_i\right| \le \sum_{i=1}^m |h_i|. Vertical: |y_2 – y_1| = \left|\sum_{j=1}^k v_j\right| \le \sum_{j=1}^k |v_j|.
Add these inequalities:
|x_2 – x_1| + |y_2 – y_1| \le \sum_{i=1}^m |h_i| + \sum_{j=1}^k |v_j| = L.
But the left-hand side is exactly the Manhattan distance between A and B:
d_{\text{Man}}(A,B) = |x_2 – x_1| + |y_2 – y_1|.
So we have:
L \;\ge\; d_{\text{Man}}(A,B).
When do we get equality?
The triangle inequality |a_1 + \dots + a_n| = |a_1| + \dots + |a_n| holds if and only if all the a_i’s have the same sign (no canceling).
So:
For the x-coordinate, we get equality |x_2 – x_1| = \sum |h_i| iff all horizontal moves are in the same direction (all right OR all left, no back-and-forth). For the y-coordinate, we get equality |y_2 – y_1| = \sum |v_j| iff all vertical moves are in the same direction (all up OR all down).
Thus:
A path from A to B has length exactly equal to the Manhattan distance if and only if it has no backtracking in any coordinate.
The moment you backtrack in x or y, you add extra |h_i| or |v_j| that cancel in net displacement but still count toward length, making the path longer than the Manhattan distance.
This generalizes cleanly to \mathbb{R}^n:
For a path broken into pieces with displacement vectors \Delta \mathbf{x}_1,\dots,\Delta \mathbf{x}_N,
Path length in L1: \sum_{k=1}^N \|\Delta \mathbf{x}_k\|_1, Net displacement: \sum_{k=1}^N \Delta \mathbf{x}_k = B – A, Triangle inequality: \|B-A\|_1 \le \sum_{k=1}^N \|\Delta \mathbf{x}_k\|_1, Equality iff there’s no “sign conflict” in any coordinate (no backtracking in any axis).
2. Why Manhattan distance matters so much in route optimization
Now: why is this metric so central in routing problems?
(a) It’s the true cost on grid-like networks
In many real routing problems, you’re basically moving on a grid:
City streets laid out in blocks. Warehouse aisles. Chip design / VLSI wiring channels. Robots moving on a 4-connected grid map.
If you can only move horizontally/vertically and each unit of movement costs the same, the physical cost between two points is exactly their Manhattan distance.
So:
Shortest path length between two “intersections” on a perfect grid is exactly the L1 distance. Any optimal route from A to B in such a network has cost = Manhattan distance. Any route with detours (backtracking, going around obstacles more than necessary) will have cost > Manhattan distance.
So Manhattan distance is:
The natural metric for orthogonal, grid-like movement where diagonal moves aren’t allowed or cost more.
(b) It’s a guaranteed lower bound (admissible heuristic)
In more complex settings—obstacles, one-way streets, congestion, forbidden zones—you often can’t just “walk the rectangle.” But the Manhattan distance is still:
A lower bound on any feasible path cost (you can’t get there with less horizontal+vertical displacement than |\Delta x| + |\Delta y|). Easy to compute: just add absolute coordinate differences.
This makes it a fantastic heuristic in algorithms like A*:
Heuristic h(n) = Manhattan distance from node n to the goal. It’s admissible (never overestimates the true cost). And often consistent (triangle inequality holds nicely).
Result: A* explores far fewer nodes than something like Dijkstra’s, but still returns an optimal path.
So in grid-based route optimization, Manhattan distance is the standard heuristic because:
It’s cheap to compute. It’s tight enough to really help pruning. It’s guaranteed safe: never rules out the true optimal path.
(c) It matches actual constraints better than Euclidean distance
In many practical systems, you can’t go straight-line even if you want to:
Streets are orthogonal; you can’t cut diagonally through buildings. Aisles in warehouses constrain movement. Printed circuit layouts constrain wires to horizontal/vertical tracks.
Euclidean distance \sqrt{\Delta x^2 + \Delta y^2} is often too optimistic; it assumes you can go straight from A to B.
Manhattan distance:
Respects the fact that you must break motion into horizontal + vertical segments. Often correlates much better with true travel time or cost.
So, route optimization systems that respect such constraints often rely on Manhattan distance for planning and cost estimation.
(d) It induces a simple structure of optimal paths
In a pure grid with uniform cost and no obstacles:
Every optimal route from A to B is a path that uses exactly |\Delta x| horizontal moves and |\Delta y| vertical moves, in any order. That structure is combinatorially simple: number of shortest paths is \binom{|\Delta x| + |\Delta y|}{|\Delta x|}.
This simplicity helps:
Counting routes (for reliability / redundancy planning). Designing routing policies in traffic networks (e.g., many shortest paths). Analytical work in queueing networks or random walks.
(e) It generalizes well to higher dimensions and other domains
In higher dimensions, the same ideas power:
Routing for multi-layer chip layout (x, y, sometimes time or layer as extra dimensions). Warehouse problems with vertical lifts (x, y, z). Some clustering and assignment problems where \ell_1 cost is more robust.
In all these, Manhattan distance:
Encodes “total effort along constrained axes.” Works cleanly with linear programming and combinatorial optimization.
3. Putting it together
Mathematical side: For any axis-aligned path, path length in L1 metric is at least the Manhattan distance, with equality exactly when there’s no backtracking in any coordinate. Optimization side: That makes Manhattan distance: The exact optimal cost on ideal grid networks. A sharp lower bound on route cost in constrained, grid-like environments. A natural, admissible heuristic in search algorithms like A* for pathfinding on grids. A metric that respects real-world movement constraints better than Euclidean distance in many “corridor/road/aisle” settings.
Let’s generalize the Manhattan-distance logic to abstract solution spaces, where the “coordinates” are no longer literal x/y positions, but dimensions of transformation between partial solutions or states of knowledge.
This takes us from city blocks to state-space geometry — the core of AI search theory.
1. FIRST: What Manhattan distance really measures
Manhattan distance is not fundamentally about streets or grids.
It is about something deeper:
Manhattan distance measures the minimum total change needed along independent dimensions.
Those dimensions might be:
spatial axes (x, y, z), features in a configuration space, independent constraints in a problem, bits in a binary string, variables in an assignment, character edits in a sequence, steps in a logical derivation.
In any space where movement occurs by independent incremental adjustments, the Manhattan metric naturally appears.
This is why it easily escapes physical geometry:
it is the distance metric for “problems decomposable into atomic, axis-separated edits.”
2. Manhattan distance in abstract state spaces
2.1 States as vectors of decisions or attributes
In many AI search problems, each “state” has attributes:
s = (a_1, a_2, \dots, a_n)
A successor state changes one or more attributes by a small amount:
flip a bit change a variable assign a value move a symbol rewrite a token alter a constraint
If each step modifies only one dimension at a time, then:
The minimum number of steps to transform state A into state B
is ≥ Manhattan distance in attribute-space.
Because:
Each difference in each attribute must be corrected, and changes cannot be done “diagonally” (two dims at once) unless the operator allows it.
Thus Manhattan distance still measures minimum editing effort, not spatial distance.
2.2 Classical example: search on binary strings
A binary string like 101101 is just a point in \mathbb{R}^6 with coordinates 0 or 1.
Distance between two strings:
d_{\text{Man}}(x,y) = \sum |x_i – y_i| = \text{number of bits that differ}.
This is the Hamming distance, a Manhattan metric on the hypercube.
Why does it matter?
If you can flip only one bit per move, then Manhattan/Hamming distance is the minimum number of moves required.
Any backtracking (flip → unflip) adds more steps but cannot reduce the minimum.
2.3 Constraint satisfaction problems (CSPs)
A CSP state might consist of:
values assigned to some variables domains of remaining variables consistency flags partial constraint propagation
Two states differ in certain variable-values.
Minimum steps needed to go from one to the other (e.g., via local search or hill climbing) depends on how many independent changes you must make.
Thus the heuristic:
h(s) = \sum_{i=1}^n |v_i^{(s)} – v_i^{(goal)}|
gives a Manhattan-style lower bound on how many assignments must change.
This becomes an admissible heuristic for A*, IDA*, or heuristic CSP solvers.
3. Manhattan distance in AI planning and problem solving
3.1 STRIPS-style planning
A STRIPS state is a set of propositions (facts).
An action changes some facts (adds and deletes).
Each fact-difference between the current state and goal state is like a different “coordinate.”
If actions modify one proposition at a time, then:
Each needed fact-change contributes at least 1 to the cost. So Manhattan distance over “fact difference count” is a lower bound.
Planner heuristics like h_add, h_max, and delete-relaxation heuristics often implicitly compute a Manhattan-like sum over unsatisfied subgoals.
3.2 Rule-based inference and theorem proving
In a proof search problem:
Each logical requirement not yet derived is like a missing “coordinate.” Each inference rule can “move” one dimension toward satisfaction.
The number of unsatisfied conditions is a natural Manhattan-distance heuristic:
h(s) = \text{# of subgoals not yet derived}
This is the basis of many heuristics in:
AND/OR graph search resolution theorem proving backward-chaining rule systems
4. Manhattan distance in machine learning optimization landscapes
This is subtler but crucial.
4.1 Parameter spaces with L1-regularization
In LASSO, compressed sensing, or sparse coding:
\|\theta\|_1 = \sum_i |\theta_i|
is literally a Manhattan distance in parameter space.
Why useful?
Encourages sparse, axis-aligned solutions. Penalizes the total “amount of coordinate movement.” Makes the solution space geometrically polyhedral.
Thus the optimization prefers solutions with fewer nonzero coordinates—because the Manhattan ball is a diamond shape with sharp corners, forcing sparsity.
4.2 Neural network weight updates (approximate)
If we treat SGD steps as small axis-parallel perturbations:
updates adjust each weight independently, directions are composed of many coordinate-wise changes,
then the “effort” to move from parameter vector \theta_1 to \theta_2 is bounded below by their Manhattan distance.
This is not used as a direct heuristic, but it explains:
why sparse gradients matter, why coordinate descent methods behave like Manhattan routing, why L1 metrics often approximate difficulty of optimization jumps.
5. Manhattan distance in search heuristics (A*, IDA*, hill-climbing)
In pathfinding on abstract graphs (not 2D maps), Manhattan distance reappears when:
states have vector-like descriptors, moving to successor states modifies coordinates independently, each coordinate difference must be handled separately.
Examples:
(a) Sliding tile puzzles (8-puzzle, 15-puzzle)
Classic heuristic:
h = \sum_i |x_i – x_i^\*| + |y_i – y_i^\*|
It is exactly Manhattan distance between each tile’s current and goal positions.
Why? Each tile must move independently along orthogonal axes.
(b) Rubik’s Cube (with abstractions)
Pattern-database heuristics measure “how many facelet changes remain” — essentially Manhattan distance in a high-dimensional symbolic space.
(c) String edit distance (Levenshtein) and L1
If substitutions, insertions, deletions are decomposed into atomic steps:
Minimum edit cost ≥ Manhattan distance of differences, because each difference must be repaired independently.
6. THE KEY REASON: Manhattan distance expresses irreducible, coordinate-wise minimum effort
Physical space is only one example.
In any search space where:
progress occurs along separable dimensions, each operator adjusts only one dimension at a time (or a small fixed set), you cannot “do two improvements at once” unless the operator explicitly allows it,
then:
Manhattan distance gives the admissible lower bound on the effort to reach the goal.
This is why it is used across domains:
puzzle solving planning inference symbolic search optimization coding theory evolutionary algorithms heuristic-guided AI parameter tuning and metaheuristics
7. Why this matters so much in AI search
(1) It is an admissible heuristic
It never underestimates the minimum cost to fix coordinate differences.
(2) It decomposes problems into independent subproblems
Each coordinate difference is a subgoal cost.
(3) It gives a tight lower bound
Better than Euclidean in discrete/orthogonal domains.
(4) It aligns with real operator semantics
Most search operators modify one feature at a time.
(5) It shapes efficient pruning
A*, IDA*, hill-climbing, beam search, and heuristic BFS all benefit.
(6) It generalizes naturally
Physical routing → abstract attribute differences → symbolic transformation spaces.
Summary Explanation
Manhattan distance, when interpreted abstractly, is the metric of minimum coordinate-wise change.
Whenever a search problem decomposes into independent “dimensions of correction” — bits, facts, tiles, variables, constraints, parameters — the Manhattan metric gives the real lower bound on the minimum number of needed operations.
Thus:
The geometry of AI search is often Manhattan geometry,
because solution spaces are built from independent local moves,
not continuous straight-line motion.
It is the same mathematics whether you’re routing a taxi through Manhattan or routing your way to a solved CSP, to a goal state in a planning domain, or through layers of a proof search.
Below is a clear, systematic explanation of Manhattan (L1), Euclidean (L2), and Chebyshev (L∞) distance as geometries of abstract search spaces rather than physical geometry.
This will show:
What each metric really measures in terms of problem structure. Which types of search operators each metric corresponds to. Why different search heuristics implicitly assume one metric or another. Examples from AI domains (puzzles, CSPs, planning, graphs, optimization, ML).
1. The Three Metrics: The Deep Interpretation
Any state in an abstract search can be represented as a vector of decisions, features, or variables:
s = (s_1, s_2, \dots, s_n).
Let the goal state be g = (g_1, \dots, g_n).
The distance between them depends on how your search operators work.
1.1 Manhattan distance (L1)
d_1(s,g) = \sum_{i=1}^n |s_i – g_i|
This measures:
How many independent coordinate-wise edits must be performed to reach the goal,
assuming each edit modifies exactly one dimension and costs 1.
This is the geometry of:
bit-flip searches, local optimization with coordinate descent, sliding tile puzzles, edit operations where only one symbol changes per move, incremental assignment problems, CSP solvers where each variable change is atomic.
1.2 Euclidean distance (L2)
d_2(s,g) = \sqrt{\sum_{i=1}^n (s_i – g_i)^2}
This measures:
How far your state is “as the crow flies” if you could change all dimensions simultaneously and proportionally.
This only makes sense if your operators allow smooth, simultaneous multi-dimensional change, such as:
continuous optimization (gradient descent), control problems where movement is vector-valued, function minimization in differentiable landscapes, reinforcement-learning state embeddings in continuous spaces.
Euclidean distance assumes you can “move diagonally” in state space.
This is not true in most discrete combinatorial problems.
1.3 Chebyshev distance (L∞)
d_\infty(s,g) = \max_i |s_i – g_i|
This measures:
The minimum number of steps needed when each step can fix any number of dimensions at once, but by only 1 unit per step.
This is the geometry where:
each operator can simultaneously modify all coordinates by up to ±1, moves are “king moves” (like a chess king: can move diagonally for free), improvement is limited by the slowest-changing coordinate.
This is common in:
distributed updates, synchronous transitions, neural network batch updates (approximate), game AI with multi-action moves, high-level planning abstraction where many subgoals advance in parallel.
2. The Core View:
The metric that applies is the one consistent with your successor function.
The successor function defines what moves are legal in your search.
If your system allows:
Type of operator
Geometry
Metric
One dimension changes per step
orthogonal grid
Manhattan (L1)
All dimensions change proportionally in one move
diagonals allowed
Euclidean (L2)
All dimensions can change (small amounts) in parallel
“max step” geometry
Chebyshev (L∞)
This is the key to understanding search heuristics.
3. Visualizing in Abstract Search Space (2D examples)
Let s = (x,y) and goal g = (0,0).
3.1 Manhattan (L1)
Operators: move horizontally or vertically; one axis at a time.
Contours of equal distance are diamonds: /\ < > \/
Distance = |x| + |y|.
3.2 Euclidean (L2)
Operators: directly move toward goal proportionally in all axes.
Contours are circles: OOO O O OOO
Distance = sqrt(x² + y²).
3.3 Chebyshev (L∞)
Operators: can change x and y simultaneously, up to ±1 per step.
Contours are squares: +---+ | | +---+
Distance = max(|x|, |y|).
4. Consequences for Abstract Search
4.1 Choosing the wrong metric breaks heuristics
For A*, hill-climbing, or beam search:
Example 1: Sliding tile puzzles (8-puzzle, 15-puzzle)
Operators: each move shifts one tile’s position by one step horizontally or vertically.
Thus:
L1 is correct → sum of Manhattan distances of all tiles is admissible. L2 is too optimistic → thinks tiles can move diagonally (illegal). L∞ is even worse → assumes moving tile horizontally & vertically in one step (illegal).
Using Euclidean or Chebyshev distances here yields inadmissible heuristics.
Example 2: Continuous optimization (N-dimensional spaces)
Operators: gradient descent updates all coordinates simultaneously in real values.
Thus:
L2 matches the operator behavior. L1 is overly pessimistic (thinks updates occur 1 axis at a time). L∞ may approximate parallel updates but loses geometric meaning.
This is why Euclidean norms dominate in ML optimization.
Example 3: Constraint satisfaction with incremental assignment
Operators: one variable assigned at a time.
Thus:
L1 correctly counts minimum necessary assignments. L2/L∞ are not meaningful because variables don’t vary continuously or in parallel.
Example 4: Chess or gridworlds with diagonal moves allowed
Operators: can move up to ±1 on both x and y in one step.
Thus:
L∞ matches the movement semantics. L1 exaggerates cost (thinks diagonals are two moves). L2 is a close approximation but still not exact in discrete spaces.
5. The Metrics as Models of What Type of Work the Search Must Do
Reinterpretation in terms of work:
Manhattan (L1)
Work is additive: tasks must be completed one dimension at a time. No synergy: improving one dimension doesn’t help another. Most combinatorial search falls here.
Euclidean (L2)
Work is vectorial: improvement in multiple dimensions at once is possible. Tasks may benefit from forming “diagonal” moves. Continuous control, ML optimization, and robotics fall here.
Chebyshev (L∞)
Work is parallel-limited: all dimensions can improve simultaneously, but time is bounded by the slowest. Multi-agent and batch-update systems often approximate this.
6. Abstract Interpretation Cheat Sheet
Metric
Meaning in abstract search
When it applies
Heuristic use
Manhattan (L1)
independent atomic differences that must each be corrected
discrete, sequential edits
CSPs, 8-puzzle, symbolic reasoning
Euclidean (L2)
simultaneous proportional adjustments
continuous, differentiable optimization
gradient descent, control, embeddings
Chebyshev (L∞)
parallel small adjustments; progress limited by the “slowest” dimension
synchronous updates, some game AI
multi-effect operators, gridworlds w/ diagonals
7. Final Summary
Abstract search inherits its geometry from its operator structure.
If each move adjusts one component → Manhattan geometry. If each move adjusts all components proportionally → Euclidean geometry. If each move adjusts all components by ≤1 in parallel → Chebyshev geometry.
Thus these metrics:
define lower bounds in heuristic search, shape admissible heuristics for A* and IDA*, guide design of representations, determine which ML models and optimizers work effectively, reflect the “physics” of state-space transitions even when no physical space exists.
