Introduction • Cloud computing is a technology that enables us to create, configure, and customize applications through an internet connection. • Software environment is a collection of programs, libraries, and utilities that allow users to perform specific tasks. • Software environments are often used by programmers to develop applications or run existing ones. • A software environment for a particular application could include the operating system, the database system, specific development tools, or compilers. Features of Cloud and Grid Platforms • Important features in real cloud and grid platforms!! • In four tables, we cover the capabilities, traditional features, data features, and features for programmers and runtime systems to use • The entries in these tables are source references for anyone who wants to program the cloud efficiently Grid Computing • Grid computing (sometimes referred to as virtual supercomputing) is a group of networked computers that work together as a virtual supercomputer to perform large tasks, such as analyzing huge sets of data or weather modeling. • Grid computing is extensively used in scientific research and high- performance computing to solve complex scientific problems. • For example, grid computing can be used to simulate the behavior of a nuclear explosion, model the human genome, or analyze massive amounts of data generated from particle accelerators. • Advantages: • Can solve larger, more complex problems in a shorter time • Easier to collaborate with other organizations • Make better use of existing hardware Cloud vs Grid Computing • https://www.skysilk.com/blog/2017/cloud-vs-grid-computing/ Cloud Capabilities and Platform Features Commercial clouds need broad capabilities, as summarized in Table Table 6.2 lists some low-level infrastructure features. Table 6.3 lists traditional programming environments for parallel and distributed systems that need to be supported in Cloud environments. They can be supplied as part of system (Cloud Platform) or user environment. Table 6.4 presents features emphasized by clouds and by some grids. Traditional Features Common to Grids and Clouds • Features related to workflow, data transport, security, and availability concerns that are common to today’s computing grids and clouds • Workflow • Workflow links multiple cloud and non-cloud services in real applications on demand. • Data Transport • difficulty in using clouds: cost (in time and money) of data transport • special structure of cloud data with blocks (in Azure blobs) and tables could allow high- performance parallel algorithms, but initially, simple HTTP mechanisms are used to transport data • Security, Privacy, and Availability • techniques are related to security, privacy, and availability requirements for developing a healthy and dependable cloud programming environment • Use virtual clustering to achieve dynamic resource provisioning with minimum overhead cost. • Use stable and persistent data storage with fast queries for information retrieval. • Use special APIs for authenticating users and sending e-mail using commercial accounts. • Cloud resources are accessed with security protocols such as HTTPS and SSL. • Fine-grained access control is desired to protect data integrity and deter intruders or hackers. • Shared data sets are protected from malicious alteration, deletion, or copyright violations. • Features are included for availability enhancement and disaster recovery with life migration of VMs. • Use a reputation system to protect data centers. This system only authorizes trusted clients and stops pirates. • Data Features and Databases • Program Library • Many efforts have been made to design a VM image library to manage images used in academic and commercial clouds • Blobs and Drives • basic storage concept in clouds is blobs for Azure and S3 for Amazon • In addition to a service interface for blobs and S3, one can attach “directly” to compute instances as Azure drives and the Elastic Block Store for Amazon. • DPFS • covers the support of file systems such as Google File System (MapReduce), HDFS (Hadoop), and Cosmos (Dryad) with compute-data affinity optimized for data processing • It could be possible to link DPFS to basic blob and drive-based architecture, but it’s simpler to use DPFS as an application-centric storage model with compute-data affinity and blobs and drives as the repository-centric view • SQL and Relational Databases • Both Amazon and Azure clouds offer relational databases • Table and NOSQL Nonrelational Databases • present in the three major clouds: BigTable in Google, SimpleDB in Amazon, and Azure Table [13] for Azure • Queuing Services • Both Amazon and Azure offer similar scalable, robust queuing services that are used to communicate between the components of an application • Programming and Runtime Support • desired to facilitate parallel programming and provide runtime support of important functions in today’s grids and clouds • Worker and Web Roles • The roles introduced by Azure provide nontrivial functionality, while preserving the better affinity support that is possible in a nonvirtualized environment. • Worker roles are basic schedulable processes and are automatically launched. • Note that explicit scheduling is unnecessary in clouds for individual worker roles and for the “gang scheduling” supported transparently in MapReduce. • Queues are a critical concept here, as they provide a natural way to manage task assignment in a fault–tolerant, distributed fashion. Web roles provide an interesting approach to portals. GAE is largely aimed at web applications, whereas science gateways are successful in TeraGrid. • MapReduce • There has been substantial interest in “data parallel” languages largely aimed at loosely coupled computations which execute over different data samples. • The language and runtime generate and provide efficient execution of “many task” problems that are well known as successful grid applications. • However, MapReduce, summarized in Table 6.5, has several advantages over traditional implementations for many task problems, as it supports dynamic execution, strong fault tolerance, and an easy-to-use high-level interface. • The major open source/commercial MapReduce implementations are Hadoop [23] and Dryad [24–27] with execution possible with or without VMs. • Cloud Programming Models • Both the GAE and Manjrasoft Aneka environments represent programming models; both are applied to clouds, but are really not specific to this architecture. • Iterative MapReduce is an interesting programming model that offers portability between cloud, HPC and cluster environments. • SaaS • Services are used in a similar fashion in commercial clouds and most modern distributed systems. • We expect users to package their programs wherever possible, so no special support is needed to enable SaaS. Parallel and Distributed Programming Paradigms • We define a parallel and distributed program as a parallel program running on a set of computing engines or a distributed computing system. • The term carries the notion of two fundamental terms in computer science: distributed computing system and parallel computing. • A distributed computing system is a set of computational engines connected by a network to achieve a common goal of running a job or an application. • A computer cluster or network of workstations is an example of a distributed computing system. • Parallel computing is the simultaneous use of more than one computational engine (not necessarily connected via a network) to run a job or an application. • For instance, parallel computing may use either a distributed or a non- distributed computing system such as a multiprocessor platform. • Running a parallel program on a distributed computing system (parallel and distributed programming) has several advantages for both users and distributed computing systems. • From the users’ perspective, it decreases application response time; from the distributed computing systems’ standpoint, it increases throughput and resource utilization. • Running a parallel program on a distributed computing system, however, could be a very complicated process. • Therefore, to place the complexity in perspective, data flow of running a typical parallel program on a distributed system is further explained in this chapter. Parallel Computing and Programming Paradigms • Consider a distributed computing system consisting of a set of networked nodes or workers. • The system issues for running a typical parallel program in either a parallel or a distributed manner would include the following: • Partitioning: This is applicable to both computation and data as follows: • Computation partitioning:This splits a given job or a program into smaller tasks. Partitioning greatly depends on correctly identifying portions of the job or program that can be performed concurrently. In other words, upon identifying parallelism in the structure of the program, it can be divided into parts to be run on different workers. Different parts may process different data or a copy of the same data. • Data partitioning: This splits the input or intermediate data into smaller pieces. Similarly, upon identification of parallelism in the input data, it can also be divided into pieces to be processed on different workers. Data pieces may be processed by different parts of a program or a copy of the same program. • Mapping: This assigns the either smaller parts of a program or the smaller pieces of data to underlying resources. This process aims to appropriately assign such parts or pieces to be run simultaneously on different workers and is usually handled by resource allocators in the system. • Synchronization: Because different workers may perform different tasks, synchronization and coordination among workers is necessary so that race conditions are prevented and data dependency among different workers is properly managed. Multiple accesses to a shared resource by different workers may raise race conditions, whereas data dependency happens when a worker needs the processed data of other workers. • Communication: Because data dependency is one of the main reasons for communication among workers, communication is always triggered when the intermediate data is sent to workers. • Scheduling: For a job or program, when the number of computation parts (tasks) or data pieces is more than the number of available workers, a scheduler selects a sequence of tasks or data pieces to be assigned to the workers. It is worth noting that the resource allocator performs the actual mapping of the computation or data pieces to workers, while the scheduler only picks the next part from the queue of unassigned tasks based on a set of rules called the scheduling policy. For multiple jobs or programs, a scheduler selects a sequence of jobs or programs to be run on the distributed computing system. In this case, scheduling is also necessary when system resources are not sufficient to simultaneously run multiple jobs or programs. Motivation for Programming Paradigms • Because handling the whole data flow of parallel and distributed programming is very time consuming and requires specialized knowledge of programming, dealing with these issues may affect the productivity of the programmer and may even result in affecting the program’s time to market. Furthermore, it may detract the programmer from concentrating on the logic of the program itself. • Therefore, parallel and distributed programming paradigms or models are offered to abstract many parts of the data flow from users. In other words, these models aim to provide users with an abstraction layer to hide implementation details of the data flow which users formerly ought to write codes for. Therefore, simplicity of writing parallel programs is an important metric for parallel and distributed programming paradigms. Other motivations behind parallel and distributed programming models are (1) to improve productivity of programmers, (2) to decrease programs’ time to market, (3) to leverage underlying resources more efficiently, (4) to increase system throughput, and (5) to support higher levels of abstraction • MapReduce, Hadoop, and Dryad are three of the most recently proposed parallel and distributed programming models. They were developed for information retrieval applications but have been shown to be applicable for a variety of important applications [41]. Further, the loose coupling of components in these paradigms makes them suitable for VM implementation and leads to much better fault tolerance and scalability for some applications than traditional parallel computing models such as MPI MapReduce, Twister, and Iterative MapReduce • MapReduce, as introduced in Section 6.1.4, is a software framework which supports parallel and distributed computing on large data sets [27,37,45,46]. This software framework abstracts the data flow of running a parallel program on a distributed computing system by providing users with two interfaces in the form of two functions: Map and Reduce. Users can override these two functions to interact with and manipulate the data flow of running their programs. Figure 6.1 illustrates the logical data flow from the Map to the Reduce function in MapReduce frameworks. In the “value” part of the data, (key, value), is the actual data, and the “key” part is only used by the MapReduce controller to control the data flow