Obviously concerning that a single (perfectly valid) flight plan can take down both the primary and backup. Reject the flight plan that the system can't understand, you've got 4 hours for someone on front-line support to be able to work out the correct path and enter it manually? Meanwhile it'd be good if the system continued to operate.
Futher concerns about first and second line support being unable to find in the logs the cause or even the flight plan being processed when the systems failed. Had to bring in the 3rd party developers to look at "lower-level" logs to find out what happened. If your monitoring/logging isn't good enough that the first responder can't work out at least what the system was doing when it failed, that's a significant problem.
Most of the system did continue to operate - but it couldn't accept new flightplans automatically; the flight plans were given 4 hours in advance, so they only put in the restrictions after a couple of hours.
Still I agree it would have been better if it had continued - but yes the most important bit is how long it took to find the bad plan.
Yep. From the report: At 0832 both systems failed and the controllers started to empty the four-hour buffer. At 1100, systems still weren't back and so to avoid the hard cutover that looked to be coming at at 1230, they began the switch to manual mode. It took them until 1336 to restore the automatic systems, and until 1803 to fully switch out of manual mode.
Given those operational constraints, it sounds like the support teams basically have two hours to resolve a critical system failure.
I've literally chased this kind of bug down in our ATC simulation tool. People don't think Adaptation be like it is, but it do. Just wait till they hear of context dependent fixes with the same name. I.E. airport, airway, fix, all named 'BUD'. This stuff was written when airplanes were bicycles with handmade motors and canvas covered wings.
> Given that the system could not reconcile the error, the fail-safe software logic intervened to prevent the incorrect data being passed to air traffic controllers, and the FPRSA-R primary system – as designed – suspended its functioning and handed its tasks to a back-up system. But the back-up system applied the same logic to the flightplan, with the same result, and similarly suspended itself.
This seems to be the root cause. My reading of this is that the system initially validated the flight plan, but while processing the plan it hit something it didn’t understand. It should have treated this as a late validation error, rejected the plan and continued its work, but instead it treated it as an internal error and crashed. I guess it probably hit some kind of assertion.
It's explained in the article. Two waypoints had identical designators, which is an edge case not implemented in the software - not sure how this is a "late validation error" if the data is in fact valid.
The root cause to me seems to be that whatever they are using as designators (possibly strings shudder) is not guaranteed to be unique.
I believe when OP said 'treated this as a late validation error', they were acknowledging the inevitability of bugs and root cause-ing "why did a single failure to validate cause the entire system to fall over rather than reject the single flight plan in question." The words 'treated this as' are load bearing here.
But this suggests not (as one of those ALVINs is inside the UK):
> Although there has been work by ICAO and other bodies to eradicate non-unique waypoint names
there are duplicates around the world. In order to avoid confusion latest standards state that
such identical designators should be geographically widely spaced. In this specific event, both of
the waypoints were located outside of the UK, one towards the beginning of the route and one
towards the end; approximately 4000 nautical miles apart
> such identical designators should be geographically widely spaced
Yes, though I wouldn't normally regard "4,000nm" (mentioned in the article) as widely spaced in geographical terms. Unless they meant nautical miles (which should be abbreviated as M, NM, Nm or nmi according to en.wiktionary.org) rather than nanometres.
OK, perhaps that comment wasn't really worth making ...
On the other hand, perhaps next year we'll read about some other system (aviation or silicon design) collapsing because the software mixed up nanometres and nautical miles.
I think in "approximately 4000 nautical miles apart", as I quoted above, the "nautical miles" means ... nautical miles? IDK what else to say to make it clearer.
I don't know anything about ATC systems. Why should it be possible to continue when a single flight plan is wrong? Naïvely it sounds like having even a single plane in an unknown position would make safe automated control impossible. Is that wrong?
First, you don't understand the purpose of a flight plan. The purpose of the flightplan safety.
You, the captain, are saying "My intention going from A to Z via K,J and X". You then "activate" your flightplan over the radio upon departure.
The purpose of the flightplan is so people on the ground broadly have an idea of where you are and when you'll arrive. The former is to assist with co-ordination between control zones. The latter is so that if the destination airport doesn't see you X hours after your ETA, then they can start searching along the filed route until they find the debris of your crashed aircraft.
Second, you need to take "automation" out of your thinking. Flight crew don't just punch the plan into the nav and sit there twiddling their thumbs.
Shit happens and the crew might need to deviate from the plan. It happens all day every day. There might be some horrible weather ahead that you want to route around. There might be something happening in the airspace around you that ground might want to route you around.
Therefore flight plan merely describes your INTENTION.
> Why should it be possible to continue when a single flight plan is wrong?
To make the system resilient.
Note: When the error happened the flight was not in the air yet. The system in question received the flight plan 4 hours before departure time. If the system would have flagged the flight plan as bad they could have called them and told them that they can't fly. If not that they could have refused them when they were entering the airspace. Can happen any time for any reason.
> Naïvely it sounds like having even a single plane in an unknown position would make safe automated control impossible.
Flight plans are not for knowing the position of the plane.
OK, that makes sense. And understood on route vs position. But wouldn't your design still require additional complexity in the system? It would need to keep track of all bad flight plans, and also keep a database of the status of those plans. The controllers would need to update the status to tell the system that they had phoned and cancelled the bad plan. That (small) additional complexity would cost money and add risk.
It sounds to me like the engineers made a design decision between "add handling mechanisms for valid-but-unexpected flight plans" and "ensure we can handle absolutely every valid flight plan"? If so, this is a rare case where I sympathise with the engineering team behind a major IT failure.
Yes, absolutely. And that is the tricky issue. Because the plan was filed with the
Eurocontrol’s IFPS system which handed it over to the UK's system.
Flight crew and dispatchers report that flight plans are regularly rejected, and then they need to file a new one. But it sounds like this rejection happens in a layer before the one which failed in this case.
So this is basically a system which is not meant to be able to reject a flight plan, since the plans it receives were already checked and validated.
"We won't need to reject a flight plan as the data we receive will always be perfect!" sounds like an approach to error handling that will lead to these sort of situations.
The issue should have been caught in one of the higher systems, and then the error should have also been handled in a more appropriate way.
> The system in question received the flight plan 4 hours before departure time.
This is not true. The plan is transferred 4 hours before it is due to enter UK airspace. Big difference. Flights can already be in the air when the plan is transferred.
Airspace Adaptation, or the catalog of different items in the old ATC system, is nightmarishly complex. It allows for insane FPs that are just not suited for straight forward computation. Canada and Mexico also share fix names with US. From a math and physics perspective, the language of flight plans is degenerate and, I think, doesn't even hold group structure. It's the reason that humans are still required even though MIT's JEDI did a bang up job of conflict resolution, or keeping planes from giving each other hugs. on the nose. At 300 miles per hour.
Depends.. yes, why was the flight plan not rejected individually or before (and where is at least the proper logging which allows to find the root cause and restart quicker)? But the system is likely more complex and distributed, and I can also see that defensively programmed safety critical systems can and should do that at a certain point where certain input or output is just unexpected and wrong. To recover in some way or silently go on with eventually unexpected (even if not undefined) behaviour then is no option.
A defensively programmed safety critical system should reject (loudly) the offending flight plan (and its permission to fly in the airspace) rather than shut itself off and leave all the planes flying (currently and in the foreseable future) to figure things out for themselves.
Said that and yes is always an easy and apparent statement to make as an armchair outsider.. read the full report? ;)
(Hint again: Something certainly could have been done much better there in this system, but the limited component itself that all caused this maybe even just had no other option at that point. Been there once in those many component systems talking with each other across with too many interfaces, people, etc involved, part of them safety critical and also subject to endless requirements and certifications? Can recommend..)
All the more reason that a single flight plan that fails path validation doesnt (by design) shut down the entire air space of a nation for 24-48 hours.
> The real issue is that a single erroneous flight plan should not stop the full system.
But the flight plan wasn't erroneous - it was perfectly valid. It just happened to have a weird coincidence of two identically tagged airports spanning the UK. If someone had put an extra waypoint in between the two, nothing would have broken (AIUI).
> But the flight plan wasn't erroneous - it was perfectly valid.
Well. Then the real issue is that a single flight plan, erroneous or not, should not stop the full system. It should flag that flight plan for problem resolution and keep on working with the rest of the flight plans.
Would result in a slightly inconvenienced flight dispatcher, or worst case an inconvenienced flight, instead of a nationwide shutdown.
> Then the real issue is that a single flight plan, erroneous or not, should not stop the full system.
Sure but "flight data is safety critical information that is passed to ATCOs the system must be sure it is correct and could not do so in this case" is a reasonably valid stance in a super-critical system that must absolutely defer to safe operations in the case of error.
It's not "one item being out of order" though - it's "the system cannot verify this flight plan as being correct".
Stopping in the face of being unable to verify that safety critical information is correct is a perfectly good choice, especially when the good consequences are fairly light like "some people are a bit late with their flights" and the bad consequences may be 10s or 100s of people dying.
> consequences are fairly light like "some people are a bit late with their flights" and the bad consequences may be 10s or 100s of people dying
I think this might be the root of the misunderstanding between us. The consequences of an airplane not having a flight plan is not “10s or 100s of people dying”. The consequences of rejecting a flight plan are:
- if caught on the ground:the one flight is delayed. They don’t take off until they manage to file a fligtplan the system accepts.
- if caught at the airspace boundary when they arrive without a flight plan in the system there are two options: if ATC want to be hard-ass they will be refused entry and have to land at an alternate airport. If ATC is acomodating they will be asked at every new controller “<flightnumber> state your intentions”.
Either way the system is fine and dandy. This is not what keeps airplanes from coliding with each other or the ground.
Obviously the first option, catching the problem on the ground, is better because it is not increasing the workload of the ATC personel.
Here is the thing, you are talking about “the system cannot verify this flight plan as being correct". And you are right! It cannot. So it shouldn’t try. It should give up. On that flight plan. I totally agree with that. But there were lots of other flight plans it could and should have verified.
Have you done web development? What would you think of a server where a single weird edge case request would stop the whole service for hours? Would you say it is well engineered?
> The consequences of an airplane not having a flight plan is not “10s or 100s of people dying”
The consequence of not being able to route a flight through reasonably congested airspace safely may well be "10s or 100s of people dying".
> when they arrive without a flight plan in the system
They had a flight plan but the system could not verify that it had correct data. i.e. the system believe it had internal corruption and should shut down.
> What would you think of a server where a single weird edge case request would stop the whole service for hours?
What would you think of a service where internal checks suggested corruption and it kept on serving potential corrupt data to customers and writing it to databases? I would consider that a failure in extremis.
And I would agree. So stop serving the potentially corrupted data to costumers. Mark it as potentially corrupt and let an operator handle that potentially corrupt data manually while the system keeps processing the the rest of the not corrupt data.
Is this such a hard concept to understand? Because it feels like it is not getting through.
We know. That's clear to everyone here. But clearly in their design of the system they hadn't considered it.
I'm sure no doubt they will now fix the issue, but you can only aim to handle faults that you predict during development, plan for unexpected faults, and hope that something super unexpected doesn't cause cascading faults
Are you sure? Have you also consulted with zimpenfish? From their comment it seems they disagree.
> hope that something super unexpected doesn't cause cascading faults
But that is the thing. Processing failing isn’t and shouldn’t be “unexpected”. Here the individual processing items are flight plans, but the concept is more general. They could be “requests” to a web server, they could be “tasks” in an async worker node, they could be “transactions” in a database, if you do any compute with a work item it can fail. You need to put a big “try-catch” around it and add a field in your database or queue or remote procedure call system to mark them as failed.
One work item failing shouldn’t be “super unexpected”. Even if you can’t name the exact conditions under witch it might fail.
> But clearly in their design of the system they hadn't considered it.
They might well have considered it the possibility of duplicate tags in the same flight plan but since identical tags are supposed to geographically distinct, they probably didn't further consider a flight plan where the entry and exit points from the UK (which is a relatively small airspace, I think) are the same tag (since it sounds from the report they did that this was an extremely weird flight plan.)
> If someone had put an extra waypoint in between the two, nothing would have broken (AIUI).
I don't see it said anywhere that the two same-named waypoints were consecutive, with no other points between them. Just that they were "approximately 4000 nautical miles apart". Or that being consecutive duplicates rather than just duplicates was the cause.
Since they don't name the waypoints, it is very vague.
As best I understand page 9 of [0], it sounds like the search for UK entry and exit points led to the A and B being found with no specified entry/exit/waypoints between them, and therefore "the software could not extract a valid UK portion of flight plan between these two points".
> So there could be any number from zero up, of international waters waypoints between them?
For me, it read like "there was no route between Aa and Ab because there were no waypoints available between them". But yeah, without a more detailed breakdown, information, and/or a better understanding of the precise format of the files and procedures, it is going to be a bit of assumption and guesswork.
Entirely possible there was a unit test that confirmed the system would error out in this particular condition. This sounds more like a requirements issue.
From reading these comments I can see a huge number of people have never had to conduct risk analysis in developing a product.
Sure, the unexpected identical naming should not cause an issue, but when you're developing software like this you conduct multiple risk analyses and FMEAs to attempt to design a system that can handle faults that you can predict
You can have failovers, safe states, recovery procedures up to the eyeballs, but if a situation occurs that you had not predicted then you're in unknown territory.
Your system may handle the fault and fail safely, or it may cause an unexpected chain of events.
TLDR: You can plan for faults all you want, but you can never guarantee it's 100% bombproof
https://publicapps.caa.co.uk/docs/33/NERL%20Major%20Incident...
Obviously concerning that a single (perfectly valid) flight plan can take down both the primary and backup. Reject the flight plan that the system can't understand, you've got 4 hours for someone on front-line support to be able to work out the correct path and enter it manually? Meanwhile it'd be good if the system continued to operate.
Futher concerns about first and second line support being unable to find in the logs the cause or even the flight plan being processed when the systems failed. Had to bring in the 3rd party developers to look at "lower-level" logs to find out what happened. If your monitoring/logging isn't good enough that the first responder can't work out at least what the system was doing when it failed, that's a significant problem.