An introduction to supply chain attacks

#SupplyChainAttacks #DevOps #FOSS #Security

Published on 2 May 2024 by Andrew Owen (5 minutes)

On March 28 Andres Freund discovered malicious code in the XZ Utils package that could have compromised the security of around half the servers on the internet. The attack was audacious in its scope, planning and timescale, leading many to speculate that it was conducted by a state agency. What’s really terrifying is that it was discovered by accident by a database developer. Security researchers failed to spot it.

“This might be the best executed supply chain attack we’ve seen described in the open, and it’s a nightmare scenario: malicious, competent, authorized upstream in a widely used library."—Filippo Valsorda

So what is a supply chain attack? It’s when a trusted outside provider or partner is compromised for the purpose of accessing your systems and data. Attackers target both commercial and open-source software. There is also a concern that software originating from certain countries may contain government-mandated malicious code.

The XZ Utils attack targeted x86-64 Linux variants derived from Red Hat and Debian (including Ubuntu). Because it was added to the released binaries and not present in the source code, 32-bit and non-Intel architectures were not affected. Crucially, versions of Linux that did not escalate the privileges of XZ Utils by binding it to OpenSSH were also unaffected. In practice, only a few distributions were affected before it was spotted. The code would remain dormant until a specific encryption key was used, at which point it could completely compromise the system. For the curious, Ars Technica has a more detailed description.

But the reason it got as far as it did is because XZ had a single unpaid maintainer who was losing interest in the project. A coordinated campaign of social engineering over a period of years gained the maintainer’s trust and ultimately resulted in a fake account being added as a second maintainer and eventually effectively taking over from the original maintainer. After the exploit was added, more fake accounts were used to push for the compromised code to be included in the next release of various Linux distributions.

This should underline what security researchers have always known: people are the weakest link. As Thom Holwerda wrote: “While we have quite a few ways to prevent, discover, mitigate, and fix unintentional security holes, we seem to have pretty much nothing in place to prevent intentional backdoors placed by trusted maintainers. And this is a real problem.” His proposed solution is to set up a foundation dedicated to providing maintainers who are dealing with non-code related issues with help, advice, and possibly even financial and health assistance.

It’s now over a decade since the Heartbleed security bug in OpenSSL was discovered. The industry reaction was to set up the Core Infrastructure Initiative as part of the Linux Foundation to fund at-risk projects. In August 2020 it was replaced by the Open Source Security Foundation. But the focus is on security components like OpenSSH. Since 2018, we’ve seen supply chain attacks such as Event-stream, ASUS, SolarWinds and Mimecast. Had it been successful, XZ would have dwarfed them all.

During my time in the security industry, I saw a pivot from intrusion prevention to intrusion detection. If XZ had gone undetected, monitoring systems would have been crucial in mitigating against it. Fortinet, one provider of such software, recommends these preventative measures:

Audit unapproved shadow IT infrastructure.
Have an updated and effective software asset inventory in place.
Assess vendors’ security posture.
Treat validation of supplier risk as an ongoing process.
Use client-side protection tools.
Use endpoint detection and response (EDR) solutions.
Deploy strong code integrity policies to allow only authorized apps to run.
Maintain a highly secure build and update infrastructure.
Build secure software updaters as part of the software development life cycle.
Develop an incident response process.

To that, I would add: adopt a zero trust security model and set up redundant secure backups (and do regular test restores).

Afterword

On July 19, 2024, CrowdStrike demonstrated the flaw in this approach. If a trusted third-party software security company pushes an update that takes down IT infrastructure and requires manual recovery, the result is the biggest IT outage in history. A 41 kilobyte file pushed to systems running Falcon Sensor on Windows affected airlines, airports, banks, broadcasters, government services, hospitals, hotels, manufacturing, software companies (including Microsoft) and stock markets.

When the very thing designed to protect has the opposite effect, the breach of trust can result in the demise of the company, as was the case with airbag manufacturer Takata Corporation. If CrowdStrike survives, it will be because in this incident no-one died. Criminals and state agencies have taken note of this potential attack vector. And it’s worth remembering that China and Russia were largely unaffected, being far less reliant on Windows than most other countries. And of course in the aftermath phishing attacks rapidly appeared attempting to exploit the situation.

In computer security, the acronym CIA stands for Confidentiality, Integrity and Accessibility. It’s the third part that is usually overlooked. If you can’t access your system, the other two don’t matter. This event will raise questions about the wisdom of running major software infrastructure on operating systems that so vulnerable to being rendered inoperable. It was Windows this time, but Linux and macOS are not immune. There’s a reason air traffic control systems run on real time operating systems instead.

Some observers have questioned the wisdom of deploying on a Friday, but when you’re dealing with zero-day exploits, you need to be able to deploy safely at any time on any day of the week. Some commentators have sought to blame individuals, but ultimately it’s a failure of management on the part of CrowdStrike and its customers. In CrowdStrike’s case, it would seem to be a QA process failure. In the case of the customers, it’s a failure to add redundancy to mission-critical systems. I think I’m remembering the news reports correctly that Brussels Airport was relatively unaffected (until the planes started to back up in the wrong place) because it had a redundant system in place that it was able to switch to when the main system failed.

The lesson is simple, don’t blindly trust your providers and:

11. Have a validated recovery or fail-over plan for your mission-critical systems.

Image: Original by Claudio Schwarz.