But that's... exactly what I described and asked how to avoid...

saltcured · on Feb 2, 2023

I had trouble parsing your earliest comment, so I only tried to address the incremental backup concern. I may not have understood the conversation, but it seemed like you claimed that a filesystem level backup of a clone was not going to produce incremental backup IO in practice.

A periodic fetch into a persistent cloned repo will be incremental unless the upstream is doing something crazy with frequent branch deletions and repacks. In practice, most upstream repos I encounter behave relatively monotonically. They accumulate new commits and branch/tag heads but do not often create garbage or need repacking.

A periodic backup of the cloned repo will also be incremental if using an appropriate tool like restic or rclone-copy. Also, since the clone only changes during the fetch, you can serialize these in one periodic job and be confident that you are making a consistent snapshot of the repo.

The advantage of this approach is its simplicity. It is easy to reason about and easy to work with the backups to restore a repo without having to learn about other tools. It's the kind of thing I could feel comfortable setting up and running for years on end with little supervision.

A more sophisticated approach that integrates with git hooks, e.g. to do event-driven rather than periodic backup, is plausible but I think could quickly get in the way of itself. And if working with a hosted upstream, you would need to integrate with their proprietary hooks, e.g. GitHub actions, and deal with other restrictions of the hosting environment. Such a solution likely brings new failure modes and may not be a worthwhile tradeoff...

remram · on Feb 3, 2023

Again, this requires you to have a persistent clone on a filesystem. I specifically wonder if we can do (and I quote) "direct incremental git-to-S3 backups", and you keep replying "it's easy, do it indirectly with a persistent cloned repo".

I don't understand where you are stuck, tbh.

yencabulator has provided a good tip I think, as you could store the previous set of refs and use that to build an incremental Git bundle (one with only the objects that were not in the previous bundle). I don't know if you can do that with the existing Git client though.