In Part 1 of this blog series, we highlighted the benefits of CrowdStrike’s investigative approach and the CrowdStrike Falcon® Real Time Response capabilities for avoiding a significant incident in the first place, and minimizing the damage should an attacker gain entry into your environment. We also explored a range of governance and process-oriented steps that are often left out of technology-centric discussions on incident response preparedness.
In this post, we cover containment and recovery considerations in the context of large-scale enterprise remediation efforts. MOXFIVE’s lessons learned in this area are based on responding to over 700 incidents across a range of industry sectors and company sizes.
The industry-leading CrowdStrike Falcon platform sets the new standard in cybersecurity. Watch this demo to see the Falcon platform in action.
Contain
Define a Strategy
Before beginning containment, the incident response team should intentionally define a strategy based on factors such as the known impact to the environment, the threat actor’s likely motivation, and experts’ experience with this or similar threat actors in other environments.
If it’s likely that the attacker desires to maintain a long-term presence in the environment, and the scope of the attacker’s access is not well understood, responders should avoid piecemeal containment actions. To avoid a lengthy “whack-a-mole” game in those situations, it is crucial to first understand all of the mechanisms by which the attacker maintains persistence and execute a package of containment actions simultaneously.
Containment Techniques
For many of today’s attacks, particularly those with obvious destructive impacts, it’s wise to immediately begin containment and recovery activities in parallel. Simply stated, the goal of containment is to deny the attacker access to the environment. While there is a variety of means to accomplish this, some of the most important activities to consider are listed below.
Network
- Block IP addresses known to be used by the attacker at perimeter firewalls. Block resolution of domain names known to be used by the attacker.
- Scan the internet perimeter and ensure that no unexpected systems are accessible to the internet.
- Disable inbound Remote Desktop Protocol (RDP) traffic at perimeter firewalls.
- Limit traffic traversing the WAN between sites. This can be a powerful containment step in many cases, but the benefit must be balanced with the negative effect on immediate investigation and recovery activities. With WAN connectivity limited, it may not be possible to deploy tooling such as the CrowdStrike Falcon® platform without time-consuming workarounds.
Endpoint
- Block known attacker tools from executing on servers and end-user systems.
- Increase endpoint detection aggressiveness to block unknown, but potentially malicious, tools from executing.
- Remove malware from systems. The ideal way to accomplish this is to leverage real-time response capabilities that can remotely perform actions directly on the compromised endpoints. If tooling isn’t already deployed, removal can be accomplished manually.
Identity and Access Management
- Reset passwords for accounts being used by the attacker. This can be done quickly.
- Implement multifactor authentication (MFA) for email and VPN access.
- Implement Microsoft’s Local Administrator Password Solution (LAPS) to randomize each computer’s (servers and end-user systems) built-in local administrator account passwords. While not a full-fledged privileged access management (PAM) technology, it can be implemented relatively quickly as an interim measure.
- Reset passwords for all Active Directory accounts, including privileged users, regular users, service accounts and the Kerberos-related accounts. This will take planning, which should be expedited but is nonetheless necessary to avoid exacerbating operational disruptions.
It’s important to strike the right balance between the sometimes competing goals of preserving evidence and containing the threat. The most effective way to do this is to establish close coordination between the investigation and containment teams. By working from a single source of truth related to system status, the investigation team can flag systems for immediate triage data collection and be included in decisions that may affect evidence. Keep in mind that the evidence referred to here is primarily going to be used by the investigation team to fully understand the intrusion and its impact — it’s rare that such evidence is used in criminal legal proceedings. If you’re preserving forensic images of every system that may have been accessed by an attacker, you’re likely going above and beyond what is truly needed.
Keep It Lean
When planning containment, focus on measures that are immediately disruptive to the attacker and are relatively low effort. Beyond those immediate containment measures, resist the urge to implement additional enhancements until business operations have substantially been restored.
Think of this as triaging an injured accident victim. One of the most immediate concerns is to remove them from the overturned vehicle and stabilize them for treatment. While they may benefit from additional treatment en route to the hospital, the key is getting them safely on their way.
Recover
The objective of recovery is to restore the business to a pre-incident level of function. Success directly correlates with how closely the IT teams align with their business colleagues and the degree to which business priority is infused into all recovery activities. The remainder of this section is written from the perspective of a typical ransomware scenario, where the victim organization’s systems are widely disrupted after being encrypted with ransomware.
A Bird's-Eye View of the Course
During lengthy, high-pressure projects, you may hear the adage “it’s a marathon, not a sprint” used to emphasize setting a pace that can be sustained over the long haul. While generally apt, that analogy does not precisely fit enterprise incident response, which does indeed require periods of “sprint” among the steady pace of the “marathon.” If you’ve never been through such an incident, the unknown factor is unsettling. Speaking of running analogies, if we are about to run a race, what does the course look like?
Recovery efforts follow a predictable pattern of activity, as illustrated in Figure 1, which is based on MOXFIVE’s experience assisting large enterprises to recover from ransomware attacks. This chart depicts days across the X axis — a typical large enterprise remediation effort lasts around 80 calendar days. The Y axis depicts the relative level of effort for the incident management and engineering functions, with 10 being maximum. (Note that the raw levels of effort will vary between environments and incidents, with engineering requiring many more resources in terms of person-hours than incident management.)
Note the following when reviewing the chart:
- The typical enterprise remediation has a heavy incident management component upon commencement, tapering off after around the first week. Accelerate this phase by having a defined and tested remediation playbook. Many organizations test other aspects of incident response, but few perform detailed tests of recovery for business-critical applications.
- It takes several days for engineering resources to ramp up to their maximum level of effort. Shifting this peak, which on average occurs around Day 10, to the left decreases the incident’s business impact. This can be done through having an up-to-date, business-prioritized asset inventory, defined recovery plans for the most critical functions and their supporting systems, and plans for surge engineering support.
- After approximately three weeks, the overall effort falls into more of a “marathon” and less of a “sprint,” which gradually tails off over approximately one and a half months.
Initial Recovery Plan Considerations
Now onward to the nuts and bolts of recovery. In a ransomware scenario, three key considerations when developing the initial recovery plan are:
- Availability of viable server backups. If backups are viable, the recovery effort can shift rapidly into restoring servers. This factor can also influence the organization’s approach to negotiating with the threat actor.
- Availability of a decryption tool. If a decryption tool is available, decryption efforts can begin. A combination of decryption and restoration can also be applied to speed recovery in some cases.
- Determining which servers and end-user systems are encrypted or corrupted. Understanding this enables planning, forecasting of when business functions will be restored, and prioritization of the most important restores or decryptions.
With the rubber about to meet the road, ensure that engineering teams operate with a shared understanding of the environment. As discussed in the planning section (Part One), this system tracker is a single source of truth and is critical to running an effective recovery. In the absence of a robust asset management system, a shared spreadsheet hosted in a collaboration environment will suffice. Every server and end-user system should have a row, and then populate the initial version with your best asset management snapshot from prior to the incident. If there isn’t an asset list, an Active Directory computers list can provide a rough substitute for Windows systems. It should have at least the following fields:
- Hostname
- Location (for servers, name the virtual cluster or data center in which the server resides; for end-user systems, note if the system is in an office or remote)
- IP address (for servers)
- Operating system (can derive this from the EDR hostlist)
- Function (for servers, what application does this server support; for end-user systems, is it a workstation or a laptop)
- System status (with respect to recovery): not encrypted, encrypted, backup available, no backup, restored
- Backup location
- EDR installed status (based on a frequently updated pull of the host listing from the tool)
For optimal usage with a large team, put together a tutorial on how engineers and project managers should update it. (If using a spreadsheet, use picklists to normalize available choices for as many fields as possible to keep the data clean and usable.)
These major sub-workstreams typically comprise the core of the recovery effort:
- Stabilize core infrastructure such as Active Directory
- Restore core business applications such as email
- Restore encrypted servers from backup
- Decrypt encrypted servers
- Clean up or reimage affected end-user systems
Backup Restoration
When starting restoration from backups, have the most seasoned engineers develop detailed work instructions that describe the process in detail, including updating the tracking document at key points so others will know that system’s status.
Identify tasks that can be done in parallel so more engineers can be added to speed up progress where possible. This is not always possible, however. For example, if available bandwidth only supports pulling down two virtual system images from cloud backup simultaneously, adding 10 more engineers to this process will only add confusion.
Include phases in which a restored server is handed off to application teams for configuration and validation, since server engineers may not have application-specific knowledge. The final validation phase should include signoff by the business users. Consider breaking down the planning by functional group, tied to priority, and then further broken down by operating system, which may determine which teams are involved.
Server Decryption
For server decryption, create detailed work instructions as done for restorations. Plan to run the decryption tool on a copy of the image rather than the encrypted system or files themselves, wherever possible. In some cases, the tool may encounter errors, so it’s important to be able to have the original available to make another copy and try again. Most organizations need additional storage capacity at this stage.
End-User Systems
The approach for end-user systems will differ from servers because it is more efficient to reimage or replace encrypted workstations and laptops.
Systems that are functional but contain some remnants of the attack should have a next-generation cybersecurity platform like CrowdStrike Falcon installed and should be cleaned up — there’s no need to reimage or replace these systems. These systems should be recovered remotely, using real-time response capabilities that can perform clean up actions directly on the compromised systems. Systems that are rendered unusable due to encryption will need to be reimaged. This process will differ between users in offices versus those that are remote. For offices, a common approach is to set up an imaging server and then mass reimage systems site by site. Remote users are more complicated, typically needing IT-managed system swaps.
When considering purchasing replacement end-user system hardware, which can sometimes be attractive from the perspective of minimizing delays, be prepared that your cyber insurance policy may not cover these purchases.
Recognize and Avoid Road Bumps
Common pitfalls related to recovery include those below. We strongly recommend solving these challenges prior to the stressful days of an incident.
- Not having viable server backups. Without backups, organizations are incentivized to pay the ransom quickly. Many businesses choose to pay because it is the lesser of a set of bad options. Having viable backups gives you options. Protect at least one backup mechanism from being corrupted deliberately by an attacker that has administrative-level privileges in the environment.
- Poor understanding of IT assets. Effective asset management is a substantial topic in its own right. From an incident response perspective, it is crucially important to understand which servers and end-user systems comprise your environment. The primary reason for this need is that you cannot protect assets you are not aware of and cannot manage. A secondary but still important reason is that without a solid understanding of how many assets you have, reporting meaningful progress against foundational incident response activities such as EDR deployment and systems recovered from backup requires an understanding of the total counts. Your CEO and other stakeholders will want answers to the questions:
- “Are we protected”?
- “What percentage of the environment have we restored”?
Without an accurate system inventory, your responses to these questions will be a moving target.
- Not enough temporary storage. Two main drivers of temporary storage needs are room to store forensic images and to facilitate the decryption process. Investigative teams will want to preserve certain systems from an evidence perspective. In the case of virtualized systems, this approach can enable their investigation while also allowing for quickly getting the rebuilt or restored system into production.
- Untested restore process. Many organizations use cloud-based backup technologies, but often the restore processes have not been tested beyond one-off restorations. How many systems can simultaneously be restored? Are there bandwidth constraints, or constraints based on the user interface into the cloud backup solution (for example, can only two users access the interface at one time, meaning that adding ten engineers to this process would be pointless)? Conduct testing to know the answer, and include this information in your business continuity plans.
- Restoring unnecessary systems. One of the important things that a solid asset management process, and an updated system tracker, can tell you is which systems support which applications and business processes. Every environment has obsolete systems. Flag those so you don’t spend precious time and resources restoring them.
- Reimaging end-user systems unnecessarily. If users’ workstations and laptops are encrypted such that they’re unusable, they will need to be reimaged or replaced. If they are usable, however, they likely do not need to be reimaged. A combination of techniques using Falcon Real Time Response with the protection provided by the Falcon platform should be sufficient in most cases to have confidence. Unnecessary reimaging increases business interruption, increases response costs and can result in users unnecessarily losing data that was not backed up.
- Replacing hardware. There are cases in which purchasing new hardware can be a reasonable approach. Many organizations, however, can have a knee-jerk reaction to replacing hardware en masse, e.g., replacing all laptops. Unnecessarily replacing hardware increases response costs, and those costs may not be covered by insurance.
Prepare for a Rainy Day
Many organizations execute a range of preparation activities that focus on the response components of incident response — for example, red/purple team tests of security controls, incident tabletop simulations, and after-action reviews. When it comes to the recovery part of incident response, however, few organizations proactively prepare. Consider adding the below activities to your IT and security plan.
- Implement critical technology controls today:
- All servers and end-user systems have endpoint detection and response agents with next-generation antivirus (NGAV) blocking capabilities activated.
- Implement multifactor authentication or internet-facing systems including email and VPN.
- Ensure every system has a unique local administrator password that is different from all others.
- Isolate at least one backup mechanism from being compromised by an attacker that obtains privileged internal access.
- Ensure every system can be managed — for example, installing software, assessing operating system and application vulnerabilities, and installing patches.
- Understand the internet-facing footprint and ensure available services are expected and protected.
- Conduct a tabletop exercise aimed at testing recovery processes at a high level to identify gaps specific to your organization.
- Shore up your asset management processes. Have a current IT asset inventory that ties to business priorities and build in a process to ensure that the information stays updated.
- Add a simulation of an attacker attempting to corrupt backups, with internal privileged access, to the scope of regular penetration tests.
- Test recovery from cloud backup systems to understand the maximum rate per hour that can be restored.
- Create recovery plans for the most critical business processes down to the system level. For example, to restore a critical manufacturing application, define which specific servers and workstations would be required, how specifically they would be restored, and how long would that take. Then test those plans.
Additional Resources
- Learn more about how CrowdStrike Breach Services can help you respond to an attack with speed and recover from an incident with surgical precision.
- Explore the speed and efficiency of MOXFIVE’s platform for incident management and business resilience.
- Download the complete CrowdStrike Incident Response eBook to learn more about CrowdStrike’s modern approach to rapid response and recovery from today’s widespread security incidents.
- Get on-demand access to CrowdStrike incident responders, forensic investigators, threat hunters and endpoint recovery specialists with a CrowdStrike Services Retainer.