My immediate thought was to generate training data by making up solved puzzles, and then removing numbers one at a time (while ensuring there's still a unique solution), with a few different branches at each step from an initial solution.
You could generate massive datasets that way -- it would be pretty easy to generate a few billion pre-images of a solution. I mean, a solution has 80 numbers and a puzzle about 20. 80 choose 20 is about 10^18, or a billion billion.
I imagine strategically removing numbers from solved puzzles so as to reinforce the neural connections for solution and filter out possible 'noise'