Google Cloud’s Big Promise: Performance Stability

Cloud services changed the IT ops game, but performance was more iffy than lots of people realized. Google’s Cloud Platform got to benefit from its predecessors.

In December, Google announced general availability of its infrastructure-as-a-service (IaaS) Google Compute Engine. Compute Engine is among the three pillars of the Google Cloud Platform. The alternative two are App Engine (Google’s platform-as-a-service) and cloud storage (Google’s SQL, NoSQL, and object storage).

These pillars comprise a totally large-scale compute and storage infrastructure for a diversity of scalable service offerings that Google developers make heavy use of internally and that Google makes available for public consumption via software as-a-service (SaaS) and well-defined programming interfaces (APIs).

Many within the blogosphere have discussed the value-performance and particular features of Compute Engine that set it except for its competitors. Last March, GigaOm published a test drive of Compute Engine by Sebastian Stadil and his team at Scalr, a front-end cloud management firm, comparing Google Compute Engine to Amazon Web Services. Google’s writes to disk were almost twice as fast as AWS’s, the report said. Janakiram MSV, head of cloud infrastructure services at Aditi Technologies, published his tackle “Ten Features That Make Google Compute Engine Better Than AWS,” including 5x faster virtual machine boot times.

[ learn more about how Google positions Compute Engine? See Google Compute Cloud Challenges Amazon.]

However, i locate the promise of “performance stability” most attractive. When public cloud computing first emerged we, the general public, were happy to get our hands on compute and storage resources on demand, at cost-effective. Many folks also experienced great joy in our ability to circumvent the burdens of traditional system procurement and IT processes. Through the years, however, as cloud use has matured, the standard of resource performance, or its stability, has become increasingly important.

Cost of performance instability
The performance that we experience from a selected cloud resource (CPU, I/O, latency, bandwidth) can vary over the years for a lot of reasons, including:

  • Resource sharing with best-effort isolation,
  • Performance of 1 resource (e.g., remote storage) is impacted by that of another, e.g., networking, hardware heterogeneity (similarly endowed devices can behave differently), and mess ups,
  • Bugs and inefficiencies in complex cloud software stacks and virtualization technologies, and
  • Placement, migration, fault tolerance, and adding and removing resources and virtual instances.

Inconsistent resource performance impacts cloud users negatively in multiple ways. First, fluctuations in performance might be significant (as much as 5x) and limit users’ ability to accurately measure, reproduce, and predict execution time and price. This inability to foretell cost, performance, and cargo may end up in ineffective scaling decisions, manual or automatic, and should preclude public cloud use for certain application domains.

Given that public clouds are opaque, measuring this variance as ordinary users for you to account for it, it really is infeasible. Also, instability when experienced by parallel workloads, which include Map Reduce jobs, could have a compound effect on performance. One explanation for here is the “straggler problem” wherein similar jobs take longer than their counterparts, for no apparent reason.

The cost of instability it really is finest to me, however, is the human effort by developers, dev-ops, and system administrators. The impact of performance fluctuation in IaaS has given users an incentive to change their workloads to atone for, mask, or otherwise avoid instability. Examples of such modifications include using multiple network-attached block storage devices, introducing complexities into distributed programming systems for straggler avoidance, akin to killing off long-running jobs or executing multiple instances of a similar job to look which one returns first, and gaming a cloud service’s virtual instance placement.

This additional effort required by cloud users should be would becould very well be extremely costly both with regards to cloud resource use and human capital. Such work is mindless if the general public cloud provider addresses even a subset of the causes behind performance instability themselves. Enter Google Compute Engine.

Compute Engine’s promise of stability
Although there’s much debate within the blogosphere about whether Google has come to the IaaS table too late, i think that by delaying it has a distinct advantage over the contest: hindsight. Google was ready to observe the challenges and pitfalls of different IaaS offerings (let alone to realize tremendous experience with warehouse-scale service computing) to spot an increasingly common pain point of IaaS users. Moreover, Google has used this experience to design from scratch an IaaS that other vendors can only retroactively bandage and attempt best-effort improvements.

As a result, Google Compute Engine is the following-generation of IaaS system and offers resource performance stability via a suite of novel engineering advances. These advances include: customized virtualization under KVM; advanced resource isolation technologies, reminiscent of specialized Linux control groups that shield one process from others; clever data replication/redundancy strategies; novel datacenter design and geographic placement; and dedicated high-speed fiber networks between well thought out and proven software services, resembling App Engine, Cloud Storage, BigQuery, and YouTube.

By that specialize in developing a scalable, performance-stable IaaS system, Google has the chance to give virtual machine instances with consistent resource performance at very discount. By doing so, not just will users save on VM instance use, but they’ll also save the effort and time that they’re using today to re-architect their virtual instances and cloud applications to beat performance instabilities of alternative IaaS systems. If this promise of performance stability from Compute Engine involves fruition, i feel that we are going to see the needle of IaaS market share move quickly in Google’s direction.

InformationWeek Conference is an exclusive two-day event going on at Interop where you are going to join fellow technology leaders and CIOs for a packed schedule with learning, information sharing, professional networking, and celebration. Come learn from one another and honor the nation’s leading digital businesses at our InformationWeek Elite 100 Awards Ceremony and Gala. One can find out additional information and register here. In Las Vegas, March 31 to April 1, 2014.

Chandra Krintz is a professor of computer science on the University of California at Santa Barbara and chief scientist of AppScale Systems Inc. AppScale is an open source cloud platform that’s API-compatible with Google App Engine. She holds M.S. and Ph.D. degrees from … View Full Bio

More Insights