Four international flights not yet airborne were held on the ground in New Zealand.
Airways said a latent software defect during the initiation of a sectorisation process caused a display anomaly.
Sectorisation involved splitting the airspace Airways managed from one overall sector into two sectors.
“Sectorisation occurs several times a day,” Airways said in a statement accompanying the report into air traffic system disruption.
But an incoming message from a pilot during this “computationally complex” process prevented sectorisation from happening, the report said.
“The particular timing led to an invalid memory access and prevented the sectorisation process completing correctly,” the state-owned enterprise added.
“This issue then affected the state that aircraft appeared in the reserve software and required a restart to restore the correct data state.”
New Zealand’s oceanic airspace covers an area almost four times the size of Australia.
Much of that is over the South Pacific and Southern Ocean with minimal traffic, or just a few long-haul flights flying at constant altitude across the ocean.
The Tasman Sea can be busier, and before increasing daily transtasman traffic, controllers initiated a routine sectorisation process.
“In simple terms, when a sectorisation process occurs, some elements are removed and then recreated to align with the new sectors,” Airways said.
The agency said that during testing and operational use in the days before the outage, routine sectorisation had been performed many times without incident.
Screens went blank
One controller asked a colleague to return from their break early to open the Tasman sector.
“The process should take five to 10 seconds,” the Airways report said.
But on starting sectorisation, a controller’s data display went blank, as did their colleague’s.
That was regarded as “a complete workstation failure”, the report added.
“Without the data display, the controller has limited ability to interact with the system.”
The controllers had not experienced this before, so one sought expert advice from two technicians who were on site.
At roughly the same moment, the principal desk in Christchurch started flagging multiple unusual indicators.
Attempts were made to consolidate work on a standby workstation.
But that third workstation was not accepting sectorisation.
“The atmosphere was stressful, time-critical and noisy,” the report added.
“External attention from outside parties arose quickly.”
Each of the controllers on shift at the start of the event was towards the end of their shift, but all completed the shift in full, the report added.
Aircraft held
About 11 minutes after the initial display anomaly and following the move to the reserve platform, a decision was made to hold international departures in domestic airspace.
And 26 minutes after the initial display anomaly, operations resumed on a reserve controller workstation for aircraft operating within the Pacific sector.
Airways said 50 minutes into the disruption, the platform was restored.
And at 57 minutes, the Oceanic Control System (OCS) resumed full normal operations, including for the Tasman.
The agency said aircraft separation and safety were maintained throughout the incident.
“By the next morning, the cause hadn’t yet been identified,” the report added.
“As a mitigation, a rollback to the previous software version was undertaken while priority work to identify and rectify the cause continued.”
For the investigation, the agency said, it interviewed four controllers, all of whom had more than 20 years’ experience.
A duty manager, operations manager and five technicians were also interviewed.
The report said one controller at the time was unaware that several aircraft would shortly be required to divert.
Air NZ call
The report said Air New Zealand contacted the duty manager, asking if the OCS system had an issue.
The duty manager confirmed it did, but said the OCS was moving to a reserve platform and would soon be operational.
Air NZ said it had aircraft airborne that could hold for only about 30 minutes.
The duty manager thought that would be sufficient because moves to the reserve platform were normally complete in about 10 minutes.
Control system
Airways said it bought the OCS from Canadian manufacturer CAE Inc in 2000.
After that, Airways assumed responsibility for the software maintenance and support.
The agency said the system had demonstrated 99.99% service availability percentage in the 12 months before the incident.
The software development cycle is tied to a global schedule of data changes known as Aeronautical Information Regulation and Control, with which Airways must comply.
It said the OCS had a dual-platform architecture with two physically independent systems, available at all times and capable of providing most system functions.
One of those is a reserve platform, intended as a temporary backup system and without as many functions as the main platform.
Tests
The Herald has been contacted by an industry insider who queried whether the software was tested properly.
Airways, in today’s report, said testing happened in layers.
“First, developers test in their own branches of code. Next, test builds are provided to the Auckland-based Oceanic System Specialists team for further checking.”
It said once a release baseline was built, it was deployed into a laboratory environment for integration testing.
“This environment mirrors operational platforms but cannot reproduce every dimension of the operational Oceanic system,” the report added.
‘Dangling pointer’
Airways said a technician found a long-latent timing defect known as a “dangling pointer” in the display process.
“The defect had been in the code base for around 20 years.”
Findings
Airways said it developed and applied a software patch by August 22 to address the data display application issue when sectorisation failed to complete.
“The issue occurred due to an incoming CPDLC [controller pilot data link communications] message from an aircraft, during a system sectorisation process.”
CPDLC is a means of communication between controllers and pilots.
Airways said it also applied a patch in August after the sectorisation process failed to complete because of a data display issue on the main platform.
It also pledged to identify areas for improvement in terms of information flow between controllers and technical staff.
“Important contextual information was not made available to the Christchurch-based duty manager from Auckland in a timely manner to enable them to effectively co-ordinate a response,” the report added.
It said that issue was reviewed.
Airways also said customers were not informed of the service disruption in a timely manner.
The agency said it would set up another external phone line to use for operational matters.
“We acknowledge our communication with airlines during the incident did not meet our expected standards,” Airways chief executive James Young said.
“We apologise for this and are addressing this to ensure we have a more robust process for communicating changes to our air traffic management with our customers directly,” he added.
“We also apologise to the passengers whose flights were delayed or cancelled as a result of the unforeseen technical issue.”
John Weekes is a business journalist covering aviation. He has previously covered consumer affairs, crime, politics and courts.
Stay ahead with the latest market moves, corporate updates, and economic insights by subscribing to our Business newsletter – your essential weekly round-up of all the business news you need.