Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
77 views

Parallel Processing

This document discusses data-intensive computing systems which process large volumes of data, often terabytes or petabytes in size, known as big data. It proposes a data-intensive computer system consisting of an HPC cluster, massively parallel database, and intermediate operating system to process petascale datasets. The operating system would exploit parallelism in the database and optimize data flow between the cluster and database. A data-object oriented operating system is proposed to support high-level data objects like multi-dimensional arrays. User applications would compile to code executing on the cluster and database. The system supports collaborative work where large datasets are created and processed by many users.

Uploaded by

Mustafa Al-Naimi
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
77 views

Parallel Processing

This document discusses data-intensive computing systems which process large volumes of data, often terabytes or petabytes in size, known as big data. It proposes a data-intensive computer system consisting of an HPC cluster, massively parallel database, and intermediate operating system to process petascale datasets. The operating system would exploit parallelism in the database and optimize data flow between the cluster and database. A data-object oriented operating system is proposed to support high-level data objects like multi-dimensional arrays. User applications would compile to code executing on the cluster and database. The system supports collaborative work where large datasets are created and processed by many users.

Uploaded by

Mustafa Al-Naimi
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 5

Introduction

Data-intensive computing is a class of parallel computing applications which use a data


parallel approach to process large volumes of data typically terabytes or petabytes in size and
typically referred to as big data. Computing applications which devote most of their execution
time to computational requirements are deemed compute-intensive, whereas computing
applications which require large volumes of data and devote most of their processing time to
I/O and manipulation of data are deemed data-intensive. Scientific instruments, as well as
simulations, generate increasingly large datasets, changing the way we do science. We
propose that processing Petascale-sized datasets will be carried in a data-intensive computer,
a system consisting of an HPC cluster, a massively parallel database and an intermediate
operating system layer. The operating system will run on dedicated servers and will exploit
massive parallelism in the database, as well as numerous optimization strategies, to deliver
highthroughput, balanced and regular data flow for I/O operations between the HPC cluster
and the database. The programming model of sequential file storage is not appropriate for
dataintensive computations, so we propose a data-object-oriented operating system, where
support for high-level data objects, such as multi-dimensional arrays, is built in. User
application programs will be compiled into code that is executed both on the HPC cluster and
inside the database. The data-intensive operating system is however non-local, so that user
applications running on a remote PC will be compiled into code executing both on the PC and
inside the database. This model supports the collaborative environment, where a large data set
is typically created and processed by a large group of users. We have implemented a software
library, MPI-DB, which is a prototype of the data-intensive operating system. It is currently
being used to ingest the output of the simulation of a turbulent channel flow into the database.

Examples
1. The large-scale structure of the Universe
Contemporary research in astrophysics has deep and important connections to particle
physics. Observations of large structures in the universe lead physicists to the
discovery of the dark matter and the dark energy, and understanding these new forms
of matter will change our view of the universe on all scales, including the particle
scale and the human scale. Theoretical developments in astrophysics must be tested
against vast amounts of data collected by instruments, such as the Hubble Space
Telescope, as well as against the results of supercomputer simulation experiments,
like the Millenium Run [5]. These data sets are available in public databases, and are
being mined by scientists to gain intuition and to make new discoveries, but the
researchers are limited by the technological means available to access the data. In
order to analyze astrophysical data researchers write scripts that perform database
queries, transfer the resulting data sets to their local computers and store them as flat
files. Such limited access has already produced important discoveries. For example,
recently a new log-power density spectrum was discovered by such analysis of the
data in the Millenium Run database [6]. This is the most efficient quantitative
description of the distribution of the density of matter in the Universe, that was
obtained so far.
2. Computational modeling of the cochlea
The human cochlea is a remarkable highly nonlinear transducer that extracts vital
information from sound pressure and converts it into neuronal impulses that are sent
to the auditory cortex. The cochlea’s accuracy, amplitude range and frequency range
are orders of magnitude better than man made transducers. Understanding its
function has tremendous medical and engineering significance. The two most
fundamental questions of cochlear research are to provide a mathematical description
of the transform computed by the cochlea and to explain the biological mechanisms
that compute this transform. Presently there is no adequate answer to either of these
two questions. Signal processing in the cochlea is carried out by a collection of
coupled biological processes occuring on length scales measuring from one
centimeter down to a fraction of a nanometer. A comprehensive model describing
the coupling of the dynamics of the biological processes occurring on multiple scales
is needed in order to achieve system level understanding of cochlear signal
processing. A model of cochlear macro-mechanics was constructed in 1999–2002 by
Givelberg and Bunn [18], who used supercomputers to generate very large data sets,
containing results of simulation experiments. These results were stored as flat files
which were subsequently analyzed by the authors on workstations using specially
developed software. aA set of web pages devoted to this research [19] is widely and
frequently accessed, however the data was never exposed to the wide community for
analysis since no tools to ingest simulation output into a database existed when the
cochlea model was developed.

 Characteristics:
Several common characteristics of data-intensive computing systems distinguish them
from other forms of computing:

(1) The principle of collection of the data and programs or algorithms is used to
perform the computation. To achieve high performance in data-intensive computing,
it is important to minimize the movement of data.[19] This characteristic allows
processing algorithms to execute on the nodes where the data resides reducing system
overhead and increasing performance.[20] Newer technologies such as InfiniBand
allow data to be stored in a separate repository and provide performance comparable
to collocated data.

(2) The programming model utilized. Data-intensive computing systems utilize a


machine-independent approach in which applications are expressed in terms of high-
level operations on data, and the runtime system transparently controls the scheduling,
execution, load balancing, communications, and movement of programs and data
across the distributed computing cluster.[21] The programming abstraction and
language tools allow the processing to be expressed in terms of data flows and
transformations incorporating new dataflow programming languages and shared
libraries of common data manipulation algorithms such as sorting.
(3) A focus on reliability and availability. Large-scale systems with hundreds or
thousands of processing nodes are inherently more susceptible to hardware failures,
communications errors, and software bugs. Data-intensive computing systems are
designed to be fault resilient. This typically includes redundant copies of all data files
on disk, storage of intermediate processing results on disk, automatic detection of
node or processing failures, and selective re-computation of results.

(4) The inherent scalability of the underlying hardware and software architecture.
Data-intensive computing systems can typically be scaled in a linear fashion to
accommodate virtually any amount of data, or to meet time-critical performance
requirements by simply adding additional processing nodes. The number of nodes and
processing tasks assigned for a specific application can be variable or fixed depending
on the hardware, software, communications, and distributed file system architecture.

 The data-intensive computer differs from the traditional


computer in a number of important aspects:
A. Direct I/O between memory and database.
B. Moving the program to the data.
C. Data-object-oriented operating system.
D. Operating system support for distributed data objects .
E. Collaborative, non-local operating system services .

Why Is Big Data Important?

The importance of big data doesn’t revolve around how much data you have, but what
you do with it. You can take data from any source and analyze it to find answers that
enable cost reductions, time reductions, new product development and optimized
offerings and smart decision making. When you combine big data with high-
powered analytics, you can accomplish business-related tasks such as:

 Determining root causes of failures, issues and defects in near-real time.

 Generating coupons at the point of sale based on the customer’s buying habits.

 Recalculating entire risk portfolios in minutes.

 Detecting fraudulent behavior before it affects your organization.


References :

1.  Distributed Computing Economics by J. Gray, "Distributed Computing


Economics," ACM Queue, Vol. 6, No. 3, 2008, pp. 63-68.
2. Computing in the 21st Century, by I. Gorton, P. Greenfield, A. Szalay, and R.
Williams, IEEE Computer, Vol. 41, No. 4, 2008, pp. 30-32.
3. Data Intensive Scalable Computing by R.E. Bryant. "Data Intensive Scalable
Computing," 2008.
4. Data Intensive Computer. (https://www.scribd.com).
5. Big Data What it is and why it matters
(https://www.sas.com/en_us/insights/big-data/what-is-big-data.html).

You might also like