Where do you start with Critical Security Controls?

Verizon’s 2015 Data Breach Investigation Report (DBIR) was published last month, building on the findings of 2014’s DBIR report.  We’ll take a look at the major findings of the 2015 report in our next post, but in the spirit of heading for the dessert bar first, we’ll start with the goodies.  The conclusion of the report includes a table of the top Critical Security Controls (CSCs) from the SANS Institute which cover the majority of the observed threats:

CSC ID# Description
CSC 13-7 Two-factor authentication for remote logins.
CSC 6-1 Make sure applications are up to date, patched and supported versions.
CSC 11-5 Verify that internal devices available on the internet are actually required to be available on the internet.
CSC 13-6 Use a proxy on all outgoing traffic to provide authentication, logging, and the ability to whitelist or blacklist external sites.
CSC 6-4 Test web applications both periodically and when changes are made.  Test applications under heavy loads for both DOS and legitimate high use cases.
CSC 16-9 Use an account lockout policy for too many failed login attempts.
CSC 17-13 Block known file transfer sites.
CSC 5-5 Scan email attachments, content, and web content before it reaches the user’s inbox.
CSC 11-1 Remove unneeded services, protocols and open ports from systems.
CSC 13-10 Segment the internal network and limit traffic between segments to specific approved services.
CSC 16-8 Use strong passwords – suggested:

  • contain letters, numbers and special characters
  • changed at least every 90 days
  • minimal password age 1 day before reset
  • cannot use 15 previous passwords as new password
CSC 3-3 Restrict admin privileges to prevent installation of unauthorized software and admin privilege abuse.
CSC 5-1 Use antivirus software on all endpoints and log all detection events.
CSC 6-8 Review and monitor the vulnerability history, customer notification process, and patching/remediation for all 3rd party software.

 

This is not a comprehensive list – it’s a starting point.  Some CSCs listed above may not apply to your company, and other CSCs critical to your environment may not be in the list.  The technology used to implement CSCs updates over time to keep up with threats is more often than not playing catch up.  The bottom line is that even with a well-controlled environment you can still be vulnerable to unknown threats.

The key to detecting unknown threats is to know the baseline behavior for your network and look for deviations from that baseline.  If you know what normal login traffic looks like you can see when there are attempts to log in from unusual locations or at unusual times.  You can detect attempts to upload files to unknown ftp sites.  You can detect an unusual spike in web traffic. More importantly – you can isolate and investigate the behavior.

Logs from domain controllers, or web servers, or syslogs from network devices exist by default, but there is no easy way to correlate across the logs in their raw form to analyze problems.   We recommend Splunk® for its ability to correlate fields across different log formats, and its dashboard and alerting capabilities to highlight activity that deviates from baselines.

 

 

How Does an Advanced Threat Work?

In our last post we looked at the Verizon 2014 DBIR report’s recommendations for security controls from the SANS institute to protect against advanced threats.   Mandiant’s Threat Report – M-Trends 2015: A View From the Front Lines – provides additional context for why these controls are needed through a case study of an advanced threat.

The first step in the attack outlined in Mandiant’s study was gaining initial access to a corporate domain, and this was done by authenticating with valid credentials through a virtualized application server.  Mandiant was not able to determine how the credentials were obtained in this case.  While spear phishing and other user exploits are possible methods for harvesting credentials the attackers may have used software vulnerabilities on the system to intercept login credentials.  For example, the year old Heartbleed vulnerability was exploited to gain valid credentials used for attacks on Community Health Systems and the Canada Revenue Agency.

Once the attacker had access the next step was to use a misconfiguration of the virtual appliance to gain elevated privileges which allowed for the download of a password dumping utility.  That led to the local administrator password, which was the same for all systems, and thus gave access to every system in the domain.  Mandiant’s report then outlines how Metasploit was used to reconnoiter the environment, obtain corporate domain admin credentials, configure additional entry points, and install back doors to communicate with command and control over the internet.

Once the attacker was entrenched into the systems they moved on to their target in the retail environment.  The retail environment was a child domain of the corporate domain and the hacked domain admin credentials allowed full access.  The attacker copied malware to the retail sales registers and that malware harvested POS cards, copying the data to a workstation with internet access so that it could exfiltrate the data to the attacker’s servers via FTP.

Some important points about this attack are:

  1. Valid credentials were used.
    Patching vulnerable servers and educating users may make it harder but not impossible for attackers to obtain valid credentials.  Multifactor authentication could have made access more difficult and monitoring user login locations could have flagged unauthorized access.

  2. Server and application configuration is critical.
    An application misconfiguration led to local admin privileges and common local admin accounts made it easy to spread throughout the corporate domain and then to the target retail environment.  Using different admin credentials and restricting access to the retail environment would have slowed down and possibly prevented data exfiltration.

  3. Network traffic anomalies were not noticed.
    FTP traffic was used to download tools and upload data.  The attack used external command and control servers.  Watching traffic over the corporate and retail networks would have picked up recurring patterns in data sent to and received from unknown external addresses.

Advanced threats are designed to be persistent and difficult to detect.  Mandiant reported:

[A]ttackers still had a free rein in breached environments far too long before being detected—a median of 205 days in 2014 vs. 229 days in 2013. At the same time, the number of organizations discovering these intrusions on their own remained largely unchanged. Sixty-nine percent learned of the breach from an outside entity such as law enforcement. That’s up from 67 percent in 2013 and 63 percent in 2012.

 

It is not possible to completely eliminate the possibility of a data breach.  But you can limit the scope of a breach and detect it sooner by using the best practices outlined by SANS.org, and by monitoring your network and server activity with SIEM tools such as Splunk®.

 

Advanced Threats and Cyber Espionage

Cyber Espionage was one of the most complex attack patterns described in the 2014 Data Breach Investigations Report (DBIR) from Verizon.  Cyber Espionage is one form of an Advanced Threat in which an attack is specifically designed to infiltrate your network, dig in once it’s there, and then exfiltrate data back to the attacker’s servers.

Advanced threat attackers will typically enter the network through user activity and will exploit unpatched software security bugs in order to entrench themselves into your network.  In the best of all possible worlds users would know better than to open email attachments or click on suspicious links and your software would be instantly patched as soon as the patches are available.  In reality users can be fooled by sophisticated spearfishing emails that are indistinguishable from legitimate emails, or by a website that they’ve used previously without issue which has been compromised in a water hole attack.  And while patching systems quickly is the goal, the reality is that there are a large number of systems running older, unpatchable software that are easy targets.

The DBIR report assembled a list of security controls from the SANS Institute to combat Cyber Espionage – and all of them are fundamental best practices:

Even with best practices firmly in place due diligence is no guarantee against compromise.  Best practices can lower the chances that an advanced threat will make it in to your network but complete security coverage means more than just locking down your network.  You also need to check to see if you’ve been compromised and determine the scope of the problem if it’s there.

One of the difficulties in determining if you’ve been compromised by an Advanced Threat is that the signs are subtle.  Once your systems have been compromised there will be no ransomware demands or programs that suddenly stop working.  The point of an Advanced Threat is to infiltrate a computer without ever being detected and to then spread to other computers in the network so that the attack can maintain a foothold even if it’s removed from the originally infected computer.

Detecting an Advanced Threat means understanding what normal activity looks like on your network and monitoring it so that you notice activity outside normal operating parameters.   It means monitoring your outbound network activity to look for signs that data is being exfiltrated.  It means looking for internal network traffic on computers that have no reason to contact each other.  It may also mean monitoring servers or workstations for unusual processes or unexpected installations.  It means looking for any activity you can’t explain.

Finding the information to set up a security baseline isn’t the problem.  There is a vast amount of data available in logs and by setting up collectors for system or network activity.  The problem is filtering out the vast majority of normal activity in order to pinpoint the exceptions that indicate a compromise.  That’s where Splunk® can be used to pinpoint anomalies and correlate incidents across multiple log files to not only find a problem but also trace the scope of their activity.

Prioritizing Your Cyberdefenses

Verizon’s 2014 Data Breach Investigations Report (DBIR) compiled a decade’s worth of security incident data both from breaches and security incidents that did not result in data loss.  They were able to group the incidents into 9 patterns based on the method of attack, the attack target, and attacker motivations. They were Point of Sale Intrusions (e.g. Home Depot), Web App Attacks, Insider and Privilege Misuse (e.g. Edward Snowden), Physical Theft, Miscellaneous Errors (e.g. document misdelivery or incorrect document disposal), Crimeware, Payment Card Skimmers, Cyberespionage, and DOS attacks.  These 9 patterns described 92% of the attacks; the remaining 8% didn’t fit into the existing patterns but used similar underlying methods.

This breakdown of attack patterns is relevant because Verizon also found different industries were targeted by different types of patterns.  Obviously retailers with physical stores were targets for Point of Sale Intrusions but they were also primary targets of DOS attacks.  While one might expect Crimeware to be a significant problem in Healthcare it turns out that Physical Theft, Insider Misuse, and Miscellaneous Errors were far more serious issues.

Knowing the most likely avenue of attack for your environment enables you to prioritize your defenses.  Each pattern has a recommended set of control measures, securing POS systems can mean isolating those systems, restricting access, using and updating antivirus, and enforcing strong password policies.  If Crimeware or Cyberespionage are potential problems keeping a software inventory and scanning for unauthorized software may be more beneficial than restricting access.  The conclusion of the report includes charts displaying which security controls measures are most significant vs. threat pattern and industry, including links to the applicable sections of SANS Critical Security Controls.

Addressing the most significant threats to your network is a good place to start but obviously you want 100% protection and prioritizing your defenses for 92% of possible threats isn’t a complete defense. As we discussed in an earlier post looking at Cisco’s 2015 Annual Security Report,  malware and spam are evolving and becoming more difficult to detect so there is also the possibility that a new type of attack could make it through your defenses before you have the tools to defend it.

How can you defend against unknown attacks?  The answer is by knowing what your environment looks like when there is no problem and monitoring your environment for unusual activity that can indicate problems.  Network traffic to atypical outside sites could indicate someone trying to exfiltrate data from your environment.  Failed login attempts could be someone trying a brute force attack to log in to your network.  A large number of connections to your company’s web portal could be someone trying to hijack it.

Whether or not you’re looking at the activity in your environment most applications have logs recording the activity.  The difficulty is in differentiating normal behavior from threats and tracing threats across different log formats.  Heroix has begun to work with Splunk® to help analyze logs and identify attacks.

Security and Unsupported Linux Releases

Last September a serious vulnerability was found in the widely used Unix/Linux bash shell this vulnerability was dubbed “Shell Shock” and had been in existence since the beginnings of the bash shell in 1989.  The vulnerability was widely publicized and patches were developed and distributed quickly.  While the initial round of Shell Shock patches were incomplete, comprehensive patches have been available since October 2014.  In theory Shell Shock should no longer be a threat.

In practice, as per the Cisco 2015 Annual Security Report, there are still a large number of systems that have not yet been patched for Shell Shock.  Part of the reason systems haven’t been patched is that they are running older unsupported Linux releases for which patches are not available.  Unless a server is actively targeted for attack or displaying performance issues, upgrading to an actively supported OS and patching possible threats is often a lower priority for already overworked IT departments.

However as older code is further examined more bugs are found and the possible attack vectors increase.  Qualys’ recently discovered GHOST vulnerability is yet another bug found in older code.  GHOST uses a buffer overrun in the gethostbyname functions in the GNU glibc library to provide a means for attackers to execute arbitrary code.  The security bulletin includes C code for a program that will test for GHOST and outlines how GHOST can be exploited for the Exim mail server.

GHOST dates back to glibc 2.2 (released November 2000) and was fixed in glibc 2.18 (released August 2013), so the fix is already available.  All you have to do is upgrade to glibc version 2.18 or newer.

Unfortunately the repaired versions of glibc may not be an option for older OS versions.  One problem is that the glibc library is referenced by applications and a change in glibc versions can cause those applications to break.  As a result patching glibc involves both upgrading to an OS that supports the repaired glibc library and testing applications to make sure they can function with the newer system and library versions.  If the application doesn’t work on the newer system then the application needs to be upgraded as well.   If the application is no longer supported and doesn’t function on the new system version then the decision is between how critical the application is to your organization and how serious a security threat you consider unpatched vulnerabilities.

Finding and patching bugs will not eliminate all security threats.  Users may select insecure passwords, fall prey to phishing scams, or lose a laptop with cached passwords.  If an attacker wants it badly enough they will find a way into your network.  Patching security bugs will make it harder for them to succeed and may give you enough notice to block them completely.

Best Practices to Combat Computer Security Issues in 2015

Cisco’s recently published 2015 Annual Security Report summarized the security trends it found in 2014 and advised on best practices to address predicted threats in 2015.  Some of the key security findings were:

  • Malware is getting better at evading detection
    With increasing attention to Java security the number of security exploits using Java decreased over the course of 2014.  However exploits are becoming more difficult to detect because they change quickly. They are expanding into technologies not often exploited before (e.g. Microsoft’s Silverlight) and they can involve multiple technologies (e.g. using both Flash Player and Javascript).

  • Spam is growing in volume and sophistication
    Cisco found that the volume of spam increased 250% from January to November in 2014 and spammers are finding more ways of evading spam filters.  Spam content, sender addresses, and originating IP addresses can all be difficult to differentiate from legitimate emails.  In the case of hiding the sender’s IP address spammers have been using a “snowshoe” method of emailing in where the emails are sent from a large number of (often infected) computers. So it isn’t possible to track the email back to a specific blacklisted address.

  • Known threats are still problems on outdated software
    2014’s headline security threats – Heartbleed and Shellshock – are still issues.  Cisco’s surveys found that 56% of the OpenSSL implementations were using versions greater than 4 years old and only 10% of the IE browsers accessing sites were using the current version.  The problem is not that patches aren’t being applied, it’s that versions aren’t being updated to ones where patches are available for vulnerabilities. Cisco provided the following recommendation:

    To overcome the guaranteed eventual compromise that results from manual update processes, it may be time for organizations to accept the occasional failure and incompatibility that automatic updates represent.

  • Users are a significant vulnerability
    Malware may be inadvertently downloaded from seemingly safe websites – Cisco found that many high-ranking, short lived websites contained malware.  Malware was also installed through browser add ons and software downloads – often with misleading and confusing install options that trick the user into agreeing to install the malware.  In Cisco’s survey of 70 companies there were 711 users affected by malware at the beginning of the year. Rising to a peak of 1751 users affected during the month of September.

Cisco provided several recommendations on how to deal with the current security climate:

  • Adopt a more sophisticated endpoint visibility, access, and security (EVAS) control strategy.
    Even if you are able to secure your network you still have to plan for what to do if an attack occurs.  Determining the scope of an attack that makes it past a network’s front door means monitoring the potential target endpoints within the network.  EVAS monitors the endpoint activity within a network before, during, and after attacks, allowing you to formulate a plan to deal with mitigating the threat and providing the tools to conduct a forensic analysis to prevent the problem from occurring again.

  •  Security must be integrated into the business
    Business planners and security staff must work together to ensure that security is an integral part of all IT plans.  However security that makes it difficult for users to access resources can result in users finding ways to circumvent security measures.  Security planning must consider both protection from threats and accessibility for users.

  • Users must be included in the security plans
    No security plan will be able to address all problems.  There are too many possible threats and they change too quickly.  Users must be trained on what activities are potentially dangerous and how to recognize when there is a problem as well as how to report problems.

VMware NUMA Performance

In our previous post we outlined how NUMA works with Virtual Machines (VMs) that either fit entirely into one NUMA home node or VMs that are divided into multiple NUMA clients with each client assigned its own home node. The following points should be considered when working with VMs that use NUMA hardware:

  • The hypervisor may migrate VMs to a new NUMA node on the host.
    Hypervisors adjust physical resources as needed to balance VM performance and fairness of resource allocation over all VMs. If the home NUMA node for a VM is maxed out on CPU, the hypervisor may allocate CPU from a different NUMA node even if that means incurring latency from accessing remote memory. When CPU is available on the NUMA node with VM memory the hypervisor will migrate the CPU resources back to that NUMA node in order to improve memory access. The hypervisor will factor in CPU usage, memory locality, and the performance cost of moving memory from one NUMA node to another in its determination of whether to migrate a VM to a different NUMA node.


    VMs may also be migrated in an attempt to achieve long term fairness. The CPU Scheduler in VMware vSphere 5.1 gives an example of 3 VMs on 2 NUMA nodes. The NUMA node with 2 VMs will split resources between the 2 VMs while the NUMA node with only one VM will have all resources allocated to that VM. In the long term migrating VMs between the two nodes may average out performance but in the short term it just transfers the resource contention from one VM to another with the additional performance cost of moving memory from one NUMA node to the other. This type of migration can be disabled by setting the advanced host attribute /Numa/LTermFairnessInterval=0.

  • VMs that frequently communicate with each other may be placed on the same NUMA node.
    VMware’s ESX hypervisor may place VMs together on the same NUMA node if there is frequent communication between the VMs. This “ action-affinity” can end up causing an imbalance in VM placement on NUMA nodes by assigning multiple VMs to the same home NUMA node and underpopulating other NUMA nodes. In theory the gain from the VMs being able to access common memory resources will offset the increase in CPU ready levels. In practice this may not be the case and this feature may be disabled by setting /Numa/LocalityWeightActionAffinity=0 in advanced host attributes.

  • Hyperthreading doesn’t count when VMs are assigned to NUMA nodes.
    The hypervisor looks at available cores when determining which NUMA node to assign as the home node for a VM. If a NUMA node has 4 physical cores and a VM is allocated 8 processors, then the VM would be divided into 2 NUMA clients spread over 2 nodes. This ensures CPU resources but may increase memory latency. If a VM is running a memory intensive workload it may be more efficient to restrict the VM to one NUMA node by configuring the hypervisor to take hyperthreading into account. This is done by setting the numa.vcpu.preferHT advanced VM property to True.

  • VMs migrating between hosts with different NUMA configurations may experience degraded performance.
    For VMs moving to a host with smaller NUMA nodes, it is possible that they will need to be split into multiple NUMA clients, while hosts with larger NUMA nodes may be able to merge wide VMs into a single node. Performance will be degraded until the hypervisor on the new host can configure the VMs for the new host NUMA configuration.

  • VMs spread over multiple NUMA nodes may benefit from vNUMA.
    Some applications and operating systems can take advantage of the NUMA architecture to improve performance. A VM running these applications or operating systems that is spread across multiple NUMA nodes can be configured to use virtualized NUMA (vNUMA) to take advantage of the underlying architecture just as if it were on a physical host leading to large performance gains. However if the VM migrates to a new host and the NUMA configuration on the new host is different this could end up degrading performance until the VM can be restarted using the new vNUMA configuration.

While adjustments to the hypervisor’s NUMA algorithms may provide some performance improvements, the last two items are the most important takeaways. It is a best practice to ensure that hosts in a cluster have the same NUMA configuration to avoid performance issues when VMs move from one host to another.

How to Minimize CPU Latency in VMware with NUMA

On the most basic level CPUs do one thing: process data based on instructions. The faster the CPU the faster it processes data. But before a CPU can process data it has to read both the data and the instructions from slower system RAM and that latency can slow the CPU processing. In order to minimize the time the CPU is waiting on reading data, CPU architectures include on-chip memory caches that are much faster than RAM. However even though the on-chip caches have hit rates that are better than 95% there are still times when the CPU has to wait for data from RAM.

When the CPU reads from RAM the data is transferred along a bus shared by all the CPUs on a system. As the number of CPUs in a system increase the traffic along that bus increases as well, and CPUs can end up contending with each other to access RAM. This is where NUMA comes in – NUMA is designed to minimize the problem of system bus contention by increasing the number of paths between CPU and RAM.

NUMA (Non Uniform Memory Architecture) breaks up a system into nodes of associated CPUs and local RAM. NUMA nodes are optimized so that the CPUs in a node preferentially use the local RAM within that node. The result is that CPUs typically contend only with other CPUs within their NUMA node for access to RAM rather than with all the CPUs on a system.

As an example consider a system with 4 processor sockets, each with 4 cores and 128 GB RAM. Without NUMA that comes to 16 physical processors that could potentially be queued up on the same system bus to access 128 GB RAM. If this same system were broken up into 4 NUMA nodes each node would have 4 CPUs with local access to 32 GB RAM.

The ESXi hypervisor can manage virtual machines (VMs) so that they take advantage of the NUMA system architecture. The VMware 5.1 Best Practices Guide divides VMs running on NUMA into 2 groups:

  1. The number of virtual CPUs for a VM is less than or equal to physical CPUs in the NUMA node.
    The ESXi hypervisor assigns the VM to a home NUMA node where memory and physical CPU are preferentially used. Best practices in this case are that the allocated VM memory be less than the NUMA node memory. As far as the VM is concerned it is effectively on a non-NUMA system where all CPU and memory resources are local.

  2. The number of virtual CPUs for a VM is greater than the number of physical CPUs in the NUMA node (“Wide VMs”).
    Wide VMs are split into multiple NUMA clients with each client assigned a different home NUMA node.  For example, if a system had multiple NUMA nodes of 1 socket with 4 cores each (4 physical CPUs/node) and a wide VM had 8 virtual CPUs then ESXi can divide the VM into two NUMA clients with 4 physical CPUs each assigned to 2 different home NUMA nodes. The problem with dividing a wide VM into multiple NUMA clients is that it introduces the possibility that one of the client nodes may need to access memory from a different NUMA client node.

Note: In a previous post we discussed hyperthreading – as per The CPU Scheduler in VMware vSphere 5.1, hyperthreading isn’t taken into account when you’re calculating the number of available virtual processors on a NUMA node.

In our next post we’ll take a look at using virtual NUMA (vNUMA) to minimize wide VMs accessing remote NUMA memory and what happens when VMs configured to use NUMA are migrated to a host with a different NUMA system configuration.

Just When You Thought Your Devices Were Secure…

Last month’s Shell Shock bug had Unix, Linux and Network admins patching their systems against a bash shell vulnerability. This month everyone gets to play along as October adds patches for Microsoft, Adobe, Oracle and a new SSL bug named POODLE.

  • Microsoft October 2014 patches
    Microsoft has issued 3 Critical and 5 Important patches. One of the Critical patches addresses 14 vulnerabilities in Internet Explorer versions 6 through 11, although the bugs are only rated as Moderate in IE 6. As discussed in a previous post Microsoft can’t test every possible configuration. I suggest installing patches on test systems in your own environment before deploying throughout your Windows environment.

  • Adobe Flash Player patches
    Adobe has issued patches for both Cold Fusion (Important) and Flash Player (Critical). The Critical Flash Player patches cover Windows, Mac, Linux, Android and iOS and include patches for both Flash Player and Adobe AIR. Adobe also recommends upgrading to the latest versions in addition to patching, and you’re better off patching and upgrading Flash sooner rather than later. You may also want to consider using a Flash Block/Flash Control plugin or configuring IE to require you to approve sites before you allow them to run Flash Player content.

  • Oracle Critical Patch Update
    The National Vulnerability Database lists 131 CVE vulnerabilities for Oracle in October 2014. Oracle patches also cover their Java, Solaris and MySQL acquisitions and the patches for Java SE on Windows rate up to 10 out of 10 for severity level. The Oracle update page provides an extensive risk matrix for each of the patched applications – use this to evaluate the severity of the vulnerability for your specific applications and then test and patch accordingly.

  • POODLE bug in SSL3.0
    POODLE stands for Padding Oracle On Downgraded Legacy Encryption (CVE-2014-3566) and works by listening in and decrypting less secure SSL 3.0 traffic. Most web servers and clients use the secure TLS protocols for HTTPs connections and will fail back to SSL 3.0 only for legacy applications.. However it is possible for hackers to interfere with a HTTPs session negotiation so that TLS fails and the session fails back to the SSL 3.0 allowing this bug to be exploited. The patch for POODLE is to remove the SSL 3.0 protocol from web servers and clients or to disable failback to SSL 3.0 if you need to maintain legacy applications.  This vulnerability should be addressed for both web servers and web clients as soon as possible but is rated as 4.3 (Medium) and is nowhere near the threat level of either Shell Shock or Heartbleed.

    Microsoft provides instructions on a registry edit to disable SSL 3.0 for IIS web servers and askubuntu.com has information on how to remove SSL 3.0 support for Apache, Nginx and other web servers. Qualys SSL Labs provides an SSL Server test that will evaluate the security of your site for SSL 3.0 and other potential vulnerabilities.

    Qualys also provides a browser test for SSL 3.0 support. Eventually newer browsers will stop supporting SSL 3.0 but until then it can be disabled:

    • Firefox
      Set “security.tls.version.min” to 1 in “about:config” – or use the Disable SSL 3.0 plugin to do it for you.

    • Google Chrome
      You can use the startup flag “–ssl-version-min=tls1” to start Chrome without SSL 3.0 support. Recent versions of Chrome also support the TLS_FALLBACK_SCSV mechanism that prevents failing back to SSL 3.0.

    • Internet Explorer
      In Tools – Internet Options – Advanced – Security, uncheck the boxes for SSL 2.0 and SSL 3.0.

Maximizing VM Performance and CPU Utilization

In a previous post we discussed memory management in VMware and the allocation of memory. Memory over allocation is when you provision your virtual machines with more memory than actually exists on the host machines. Memory over allocation works because the hypervisor assigns memory to virtual machines as needed rather than as provisioned. Do you have a server that needs 2 GB memory for 10 minutes each night and functions at .5 GB for the rest of the day? The hypervisor will run the VM with .5 GB of memory, increase it to 2 GB as needed for 10 minutes, and then reclaim the memory when it hasn’t been used for a while and is needed elsewhere.

The safest scenario is to plan for the case where all VMs are using their maximum memory allocation and only assign existing resources. However this leaves a lot of idle memory on the table that could be used for additional VMs. If you use that idle memory to provision additional VMs the (unlikely) worst case scenario would be if all the VMs spiked memory to 100% at the same time causing the hypervisor to start swapping and leading to severe performance degradation. The additional VMs you could create from memory over allocation aren’t worth the risk for a mission critical VM. However if you need to squeeze in a couple more web servers or virtual desktops then memory over provisioning is useful.

Just as with VM memory, CPU is usually highly underutilized and can be over allocated without compromising performance. As per the Performance Best Practices for VMware vSphere 5.5:

In most environments ESXi allows significant levels of CPU overcommitment (that is, running more vCPUs on a host than the total number of physical processor cores in that host) without impacting virtual machine performance.
(p. 20)

 

Without over allocation the total number of vCPUs is limited to the number of physical CPU cores (pCPU) on a host:

(# Processor Sockets) X (# Cores / Processor)  = # Physical Processors (pCPU)

If the physical processors use hyperthreading:

(# pCPU) X (2 logical processors / physical processor) = # Logical Processors

If you’ve got 2 processors with 6 cores each that would provide 12 pCPUs or 24 pCPUs with hyperthreading enabled. However hyperthreading works by providing a second execution thread to an existing core. When one thread is idle or waiting the other thread can execute instructions. This can increase efficiency if there is enough CPU Idle time to provide for scheduling two threads. However in practice performance increases are up to 30% rather than the 2x CPU suggested by the logical CPU count formula.

In addition to considering the effect of hyperthreading you will also need to consider the type of workloads being run by processors and whether you are using NUMA (Non-Uniform Memory Access) hardware. We’ll delve into the intricacies of tuning vCPUs, workloads, and host hardware in a later post. For now Best Practices for Oversubscription of CPU, Memory and Storage suggests starting with one vCPU per VM and increasing as needed and quotes recommendations for the maximum ratio of vCPUs to pCPU varying from 1.5 to 15.

The Best Practices paper lists several metrics to monitor in order to determine the best vCPU to pCPU ratio for your environment:

VM CPU Utilization: To determine if a VM requires additional vCPU resources.
Host CPU Utilization: To determine overall pCPU utilization.
CPU Ready: Measures the amount of time a VM has to wait for pCPU resources. VMware recommends this should be less than 5%.

Maximum CPU for both Host and VM is typically set at 80% but this value should be adjusted depending on your workload and hardware.