*> such that the distance between two points that are linearly the same distance...

> such that the distance between two points that are linearly the same distance apart in a sentence would result in two very different positional encodings relative to one another, just arbitrarily dependent on where they were in the curve. I'm sure this is somewhat counteracted by the fact that you have lots of these sine waves offset and at different frequencies,

Relative distances in a sentence are actually maintained well by the sine encoding - better than they would by a triangle wave.

Think of encoding position as two waves, sin & cos. The pair as a vector has constant magnitude (Pythagoras). The size is independent of position. Two positions encoded as two sin & cos vectors have a difference vector which is also constant magnitude, so relative position's size is independent of absolute position too.

Position is encoded in the rotation of that pair around a circle. When embedded into high-dimensional model space by a linear map, that pair becomes an ellipse in some 2d plane whose orientation depends on the map. Other features, ie the token values, translate the centre of that ellipse but not its orientation. So the model is able to control how much weight to give position independent of position but dependent on token by translating the embedding vector to move the centre of the position encoding ellipse close to the origin in model space (by a translation determined by token), then giving greater or lesser weight to the subspace which spans that ellipse, ie the subspace of the orientation of the 2d plane. At the same time the model is able to map the ellipse to a circle in a standard orientation, and then particular patterns of relative positions of tokens within a sentence are rotationally invariant in that subspace of the mapped model space, as well as slight variations in the positions mapping to nearby points in the mapped model space. As with token-dependent model space translation, token- or position-dependent model-space rotation allows relative position patterns to be treated approximately independent of absolute position.

Add more sine waves to the mix and you get higher-dimension ellipsoids and subspaces, but the same principles apply. More tokens combine information though, so eg producing values that depend on patterns of relative positions of more than two tokens.

Crucially, all the maps just mention are affine maps, linear maps using a matrix plus bias. So the methods of linear algebra learned at high school (ie matrix and vector operations) apply, and maps easily combine multiple operations into one by matrix multiplication, just like in computer graphics. However, token-dependent and position-dependent (absolute or relative) selection of which maps, or actually weighted combinations of selections, is not linear. That's where the neural network non-linearities come in, and therefore multiple model layers because maps have to be selected then applied in the next layer.

Triangle waves don't have the same mapped constant vector magnitude and rotational invariance properties as combinations of sine waves. The model network could learn to accommodate the shapes of triangle wave induced subspaces, but it would place greater load on the model network due to being a less natural fit to combinations of affine maps. The greater load would probably result in more position-dependent artifacts and lower quality for a given model size. Similar to the difference between convolutional networks for image recognition versus old-school networks not using convolution.