Scaling Virtual Machines in Kubernetes Clusters: Insights for an application deployed on Kubernetes
Kubernetes has emerged as a cornerstone for building scalable applications, offering dynamic deployment, management, and scaling of workloads. Yet, balancing infrastructure costs with performance requirements remains a complex challenge. Our research set out to address a critical question: how do different node configurations in Kubernetes affect application performance, cost, environmental impact and scalability?
This blog is based on our research, which is available on our GitHub Repository.
The Challenge of Efficient Scaling
Selecting the right node configuration is not straightforward. Each workload has unique demands that influence its performance. Overprovisioning wastes resources and increases costs, while underprovisioning risks application instability and poor user experience. Additionally, inefficient node configurations lead to higher energy consumption, contributing to unnecessary carbon emissions.
Our research dives deep into understanding how different node types and configurations behave under varying workloads, providing actionable insights to optimize the horizontal scaling of Kubernetes clusters.
Methology
The research was conducted using a structured prototype that facilitated data collection and analysis. The prototype comprised a Kubernetes cluster deployed on Vultr, consisting of multiple node pools with predefined configurations. Each node pool was associated with specific node types and plans, ensuring that only nodes with matching specifications were included. This setup enabled controlled testing of the application under varying server conditions.
Key Features of the Prototype
- CMS Application: A custom-built Content Management System (CMS) was split into a frontend (Next.js) and an API (Nest.js).
- Simulated Delays: To ensure consistent benchmarking, we introduced a 30 ms delay for simulated database queries and relied on static data responses.
- Controlled Environment: External dependencies, such as Cloudflare or databases, were simulated to avoid skewing results.
Generating Server Load
The primary tool for load testing was K6, an open-source solution developed by Grafana for stress testing applications. This tool allowed precise simulation of concurrent users and server load conditions by specifying:
- Virtual Users (VUs): Representing concurrent client sessions.
- HTTP Endpoints: Defining the application layers (client or API) to be tested.
- Ramp-Up Period: A gradual increase in load before conducting a test at peak load.
- Thresholds: Requirements that have to be met in order to have a successful test.
Used thresholds
During our testing, the following thresholds were used. These were determined by doing an interview with our product owner.
- A 0% request failure rate.
- A response time of under 1000 ms for 95% of the HTTP requests.
Ramp-Up Period
After an interview with our product owner, a ramp-up period of five seconds was choosen.
Testing Workflow
Load Test Execution
For each node configuration, the number of Virtual Users (VUs) was incrementally increased by 100 until performance thresholds were exceeded. In cases where a test failed, it was repeated to rule out anomalies caused by network or system instability. If failures persisted after retesting, the VU count was reduced by 50, and testing resumed. If the reduced count also failed, it was lowered by another 50. Once a test passed, four consecutive validation runs were required to confirm stability. If any of these validation runs failed, the VU count was reduced further until all four validation runs were successful.
The test that was used was called MaxRPS. This is a custom build K6 script that dynamically increases the load on the server.
Key Metrics Collected
After each test run, the following data points were recorded:
Test Information
· TestID: Unique identifier for the test.
· Test Type: Type of test script executed (e.g., MaxRPS).
Node and Application Details
· Node Name: Formatted as [node type]-test[number]-[application]· Node-Spec: Specifications of the node tested.
· Instances: Number of pods used during the test.
Load Configuration:
· Set Virtual Users: Number of VUs configured.
· Ramp-Up Time: Duration for gradually increasing VUs.
· Max Test Time: Maximum allowed test duration.
Performance Metrics:
· Median: Median response time for requests.
· p(95): 95th percentile response time (95% of requests are faster than this).
· Total Requests Sent: Number of requests sent.
· Total Requests Received: Successfully processed requests.
· Lost Requests: Failed requests.
· Died: Whether the test passed without exceeding thresholds.
Cluster Configuration
Nodes were grouped into pools based on their specifications. Three primary node types were assessed: Regular Performance, High Performance (AMD EPYC), and High Performance (Intel Xeon).
Nodepools were provisioned in a controlled Vultr environment, with external actors like databases and Cloudflare simulated to avoid bottlenecks unrelated to node performance.
The following node configurations were assessed during our research:
Regular performance
High Performance (Intel)
High Performance (AMD)
Before conducting in-depth testing of the Intel and AMD high-performance node types, we initially tested only the first Node-Spec. This preliminary testing allowed us to compare both node types to determine which would better suit our use case. If one spec significantly outperformed the other, it would not be practical to choose the lower-performing option, especially when both are similarly priced.
Test Setup: A blueprint for replicability.
Replicability was a key goal of our research. We ensured the test setup was well-documented and systematic:
CMS Deployment
The CMS was decoupled into frontend (Next.js) and API (Nest.js) layers, allowing for independent analysis.
Simulated delays (e.g., 30 ms for API calls) replicated real-world latency, as determined from database query analyses.
Load Testing Tools
K6: Used to simulate virtual users (VUs), configure HTTP endpoints, and generate ramp-up load scenarios.
K7: Used to automate a big part of the testing.
K10: A complementary tool developed for monitoring server and cluster metrics in real-time during tests.
There’s an additional tool, called K17, in development. This tool combines K7 and K10.
Results
Node Type Performance
Regular Performance Nodes
- Cost-effective but limited in scalability, with lower breakpoints for VUs.
High Performance Intel Nodes:
- Achieved the highest VU count across both client and API layers.
- Stable under high VU counts with minimal failed requests.
*For the last tests, two devices had to be used, since one device could not handle the amount of network traffic required. These results can vary a bit from real-world results.
High Performance AMD EPYC Nodes
- Performed well, but lagged slightly behind Intel Xeon nodes in maximum RPS.
When comparing the AMD and Intel High Performance node types, a difference of 100 VU for the frontend and 75 VU for the API was found. Based on these findings, we decided not to include the High Performance (AMD) nodes in the further testing. It’s however possible that different workloads perform better on AMD than on Intel. This was not the case in our research.
Impact of CPU Spikes
As detailed in Appendix 6: Instance count vs. vCPU & memory usage of the research paper, CPU spikes were a significant factor in performance degradation. Nodes with multiple instances per vCPU struggled to manage these spikes efficiently, leading to higher response times and a reduced VU count. Figure 10 illustrates these effects, highlighting the inefficiencies in resource utilization.
Figure 11 shows a screenshot illustrating how the load was distributed across the different vCPUs during testing. For this test, a single instance was deployed on a High Performance (Intel) 4 vCPU node. From the figure, it is evident that not all vCPU cores were utilized simultaneously. For a significant portion of the test, three out of the four vCPU cores remained idle, which had a noticeable impact on overall performance. Additionally, it is interesting to observe frequent shifts in which vCPU core was actively used.
Impact on the Climate
Beyond performance and cost considerations, node configuration choices significantly influence energy consumption and carbon emissions. For instance, the CPU used in the High Performance (Intel) nodes, an Intel® Xeon® Platinum 8268, demonstrate varying levels of energy usage depending on the number of vCPU cores:
- 2-core Node: ~79 kWh annually *
- 4-core Node: ~158 kWh annually *
- 8-core Node: ~316 kWh annually *
At first glance, these values may seem small. However, when scaled to hundreds or thousands of nodes operating 24/7, the cumulative energy consumption becomes substantial.
By optimizing configurations and scaling only as necessary, organizations can reduce energy waste, minimize operating costs, and align with global sustainability goals. For example, maintaining a 1:1 ratio of pods to vCPUs not only ensures peak performance but also avoids inefficiencies that lead to unnecessary power consumption.
Key Takeaway: By carefully selecting node configurations that match workload demands, organizations can significantly reduce their carbon footprint while optimizing costs. This highlights the dual benefit of environmentally conscious infrastructure planning: savings for the business and a step toward sustainability.
*It is important to note that these energy calculations are theoretical and based on the CPU’s thermal design power (TDP). Real-world scenarios — where CPUs rarely operate at maximum load continuously — will likely yield lower energy usage. Additionally, other system components like memory, storage, and cooling contribute to the overall power footprint.
Conclusion
Based on the tests conducted, it can be concluded that the selection of the most suitable nodepool type (and node plan) at Vultr for scaling depends on the specific performance requirements and load demands of the application.
The tests performed for the developed Content Management System demonstrate that the High Performance (Intel) node type delivers the best performance for both the frontend and API. When faced with high numbers of VUs and a high number of requests, this node type consistently outperforms others in the tests.
The third node plan of this type, which delivered the best results, is most suitable for scenarios where the application is under heavy load. However, based on an interview (see Appendix 3: Expected Number of Users), it is not expected that the application will experience such an extreme load.
To remain cost-efficient, it is important to account for the expected load and then determine which node plan is best suited for scaling under typical usage scenarios. This approach ensures resources are used effectively without incurring unnecessary costs.
During scaling, it is also important to take into account how many vCPU cores the specific node has and how many pods are active. As shown, the node performs the best when there is a 1:1 ratio of instances and vCPU cores. Having more pods has an impact on the performance, but this is less noticeable than having fewer pods than vCPU cores.
This research has also made it easier to determine which node type is best suited for different workloads. Additionally, we provided insights into the load-handling capacity and environmental impact of each node type, offering valuable guidance for making informed decisions when scaling.
Reflection
While the chosen testing methodology can be considered robust, there are several limitations and areas for improvement that should be addressed in future tests.
Test Environment Limitations
A key challenge was the lack of a fully controlled environment, which introduced external factors that may have influenced the results:
- Network Limits: The shared internet connection at the test location caused bandwidth fluctuations, which likely impacted intensive load testing of the client application.
- Download Speed: The Eduroam network’s limited download speed of 22 MB/s posed challenges for testing at higher loads. For instance, when testing the High Performance (Intel) 8 vCPU node, multiple devices had to be used to handle the required network traffic. This setup likely resulted in an underestimation of the performance potential for high-capacity nodes.
- Wi-Fi Connection: Reliance on Wi-Fi instead of a wired Ethernet connection likely impacted both speed and stability. Additionally, the use of Wi-Fi may have influenced response times, which were a critical factor in meeting performance thresholds.
- Hardware Constraints: Distributing tests across multiple devices introduced infrastructure limitations, complicating the test setup and reducing reliability.
- Physical Environment: Shared networks and active devices in the test location contributed to variability, with environmental distractions further skewing results.
Methodological Considerations
While the methodology ensured a systematic approach, there is room for refinement:
- Ramp-Up Time: During the first round of testing, we observed that the CPU required a short period to stabilize before it could handle the full load. During this time, the initial requests frequently failed or experienced significant delays. Thus, a ramp-up time of five seconds was chosen. However, this decision was made without any real supporting analysis or justification. The assumption was that five seconds would be sufficient for the test to succeed, but there was no in-depth investigation into whether this duration was realistic or if it was too long or too short.
- Consecutive Testing: Tests were run consecutively without repeating them at different times of the day, potentially overlooking variations caused by system load, network conditions, or other temporal factors.
Recommendations for Future Tests
To overcome these limitations and improve the reliability of future studies, the following steps are recommended:
- Isolated Test Environment: Establishing a dedicated local or cloud-based test environment would minimize external interference and improve data accuracy. At the very least, it is recommended to monitor and control the number of devices and applications that are open during testing.
- Wired Network: Using Ethernet connections will ensure stable and consistent bandwidth, bypassing fluctuations common with Wi-Fi.
- Higher Bandwidth: Testing with higher bandwidth connections will better support large data transfers and high traffic scenarios, particularly for frontend testing.
- Multiple Testing Time Points: Repeating tests at different times and days will account for temporal variations, ensuring more consistent results.
- Advanced Monitoring Tools: Tools like Wireshark can provide deeper insights into network fluctuations or anomalies, enhancing result analysis.
Additional Recommendations
To further improve the reliability of the analysis, all node types and plans could be tested in future research. This would provide a more comprehensive view of alternative options not covered in this study.
Additionally, as other node types were not tested in this research, it may be worth considering them in the future, especially if visitor numbers grow significantly, as they may offer viable alternatives.
Conclusion
Despite the limitations, we stand by the findings of this research. The chosen test methodology, which included thresholds and a step-by-step approach, was carefully designed to produce reproducible and reliable results. Repeating tests to identify edge cases and organizing result documentation in an Excel file further contributed to the reliability of the test method.
We are confident that the results provide an accurate representation of the performance of the tested node types and configurations. For future work, it would be valuable to create a fully isolated test environment and use more advanced monitoring tools to further minimize external influences.
Finishing up
This post is the first in a three-part blog series. The second blog delves into how node and nodepool scaling can be automated proactively using a prediction model, including the critical timings that need to be considered during scaling. The third blog explores how the findings from each research paper are combined in a single system which can be deployed on a Kubernetes Cluster.
More details about our research can be found in our GitHub organization.
This blog was written by Martijn Schuman. The research was conducted by Jasper van Willigen, Jorrick Stempher, Stefan Lont and Martijn Schuman. Additionally, the research team was supported by Teun van der Kleij, Ties Greve, Jeroen Terpstra and Bas Plat. The project was carried out during the Quality in Software Development semester at Hogeschool Windesheim in Zwolle, The Netherlands, under the guidance of Ernst Bolt.
