Data Intensive Computing

Uploaded by

M. Iqbal

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views

Data Intensive Computing

Uploaded by

M. Iqbal

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

Data Intensive Computing

Adam Carter
EPCC, The University of Edinburgh
EUDAT Training Coordinator

APARSEN Advanced Practitioners Course

Glasgow, July 2013
Introduction
• Data Intensive Computing and its Relationship to
Data Curation/Preservation
• The talk is slightly tangential…
– but there are many overlaps in the subjects,
technologies and aims of the data preservation and
data intensive computing/research
• Data is being preserved so that it can be re-used
Data Preservation Lifecycles
• Most data preservation lifecycles include that idea that
data is “unarchived” or “awakened”
• In the future data is likely to be more active more of the
time
– A good thing:
• Active curation
• Annotation, tagging and linking
– A challenge:
• Can best practices of archiving be maintained on a “live” dataset?
• You could certainly still look on this as a different use-
case, but in this case it’s still important to understand
what is going on elsewhere
Data Intensive Computing
Computing applications which devote most of their execution
time to computational requirements are deemed compute-
intensive and typically require small volumes of data, whereas
computing applications which require large volumes of data and
devote most of their processing time to I/O and manipulation of
data are deemed data-intensive. – Wikipedia

• My working definition:
– I/O-bound computations
• Data is (generally) too big to fit in memory
– Efficient disk access is required to get the data to the CPU on time
– Having the data in the right place at the right time is vital
Cluster Computing

Grid Computing

Supercomputing

Cloud Computing
The Role of Data Infrastructures in Data
Intensive Computing
• Traditionally, we bring the data to the compute
• In the future, we’ll want to bring the compute to
the data
– So where is the data?
– More than likely it’s in a repository…
– Maybe an “archive”…
What will the data in the archive look like?
• Files?
• Rows & Tables in a Relational Database?
• Tuples in a Triple Store?
How might you bring in compute?
• As (relational) database queries (SQL)
• As queries against an RDF store (SPARQL)
• As VMs which can mount local disks
• As scripts or executables that you allow a user to
run
• As services that you as a data service offer with
some kind of API (e.g. as a web service)
• These are the approaches that will need to be
offered by “repositories” holding large amounts
of “live” data.
– Many will probably also be relevant to archives
• How can a user get the information back out of
the archive?
– As complete files?
– Over the Internet (and your network!)?
Back to the Compute…
• Need to understand the performance of your
computations and your data transfers
• Do you know how fast your program runs?
• Do you know if it’s spending all of its time on
compute or if it’s spending its time waiting for
data?
• Where is your data bottleneck?

• Benchmarking is key
Amdahl’s Other Laws
• Gene Amdahl’s quantification of the balance
required for data-intensive applications:
– One bit of I/O per compute cycle
– Memory Size (bytes) / instructions per second = 1
– One IOop per 50,000 instructions
A Data Intensive Computer

Gb Ethernet GPU 4GB DDR3 Memory

USB2

× 120
…or use whole datacentre(s)
Making best use of a machine designed for
data-intensive computing
• Work on streams of data, not files
– Not (so easily) searchable
– Not (so easily) sortable

– Not all programs can benefit from this approach

• and those that can, might require work

• Use multiple threads and asynchronous I/O

• If you’re using files, use a library that does some of

the hard work for you, e.g. MPI-IO
Some Data Intensive Technologies
• MapReduce/Hadoop (described earlier in the
week)

• Storm (http://storm-project.net)

– Low latency
– Real-time
– No writes to disk at intermediate stages
– Reportedly not quite such good scalability in terms of
throughput
…or use whole datacentre(s)
…or the internet?
Conclusions
• Data-Intensive is a new(ish) kind of computing
– necessitated by the huge amounts of data
– and offering new opportunities
• Need to think about new ways of doing computing
– It’s usually parallel computing, but not “traditional HPC”
• Matters for data preservation. Either:
– you’re preserving huge amounts of data that need to be
easily reused
– you need to process large amounts of data to do a
meaningful reduction so that the stored data retains its
value

MODULE 4 Notes
No ratings yet
MODULE 4 Notes
12 pages
15CS565 Module4
No ratings yet
15CS565 Module4
61 pages
Cloud COMPUTING Module 4
No ratings yet
Cloud COMPUTING Module 4
50 pages
Unit4 2
No ratings yet
Unit4 2
102 pages
Chapter Two Data Science: by Abdulaziz Oumer
No ratings yet
Chapter Two Data Science: by Abdulaziz Oumer
29 pages
15cs565 Cloud Computing Module 4 Notes
No ratings yet
15cs565 Cloud Computing Module 4 Notes
33 pages
Session 8 - George Strawn - Big Data
No ratings yet
Session 8 - George Strawn - Big Data
34 pages
Emrging
No ratings yet
Emrging
19 pages
Big Data Training
No ratings yet
Big Data Training
244 pages
data intensive computing
No ratings yet
data intensive computing
33 pages
Architecture Conscious Data Mining: Srinivasan Parthasarathy Data Mining Research Lab Ohio State University
No ratings yet
Architecture Conscious Data Mining: Srinivasan Parthasarathy Data Mining Research Lab Ohio State University
16 pages
Ds Notes
No ratings yet
Ds Notes
88 pages
Datascience-unit3
No ratings yet
Datascience-unit3
19 pages
Emerging Chapter 2
No ratings yet
Emerging Chapter 2
26 pages
Unit 1 1
No ratings yet
Unit 1 1
10 pages
L8 Big Data Management en
No ratings yet
L8 Big Data Management en
58 pages
0 The BigDataEra
No ratings yet
0 The BigDataEra
36 pages
Big Data
No ratings yet
Big Data
10 pages
Chapter 2 DS New
No ratings yet
Chapter 2 DS New
29 pages
2021 - Adaptive Code Generation for Data-Intensive Analytics
No ratings yet
2021 - Adaptive Code Generation for Data-Intensive Analytics
14 pages
Chapter 2 - Data Science
No ratings yet
Chapter 2 - Data Science
20 pages
Unit II - Data Science
No ratings yet
Unit II - Data Science
113 pages
BDS Session 1
100% (1)
BDS Session 1
70 pages
CC - Lecture 6-Data
No ratings yet
CC - Lecture 6-Data
44 pages
Report Title: Wasit University
No ratings yet
Report Title: Wasit University
8 pages
Buy Ebook Data Architecture A Primer For The Data Scientist A Primer For The Data Scientist 2nd Edition W.H. Inmon Cheap Price
100% (4)
Buy Ebook Data Architecture A Primer For The Data Scientist A Primer For The Data Scientist 2nd Edition W.H. Inmon Cheap Price
49 pages
Lecture 2 Scalable Data Systems
No ratings yet
Lecture 2 Scalable Data Systems
41 pages
Guha Roy 2017
No ratings yet
Guha Roy 2017
3 pages
R23-IDS-Unit3-PPT
No ratings yet
R23-IDS-Unit3-PPT
36 pages
Introduction To Big Data and NoSQL
No ratings yet
Introduction To Big Data and NoSQL
52 pages
Introduction To Big Data-0
No ratings yet
Introduction To Big Data-0
77 pages
R23 IDS Unit 3 Lecture Notes
No ratings yet
R23 IDS Unit 3 Lecture Notes
57 pages
IDS_sem ans unit 1
No ratings yet
IDS_sem ans unit 1
10 pages
Parallel Processing
No ratings yet
Parallel Processing
5 pages
Course Outline and Introduction
No ratings yet
Course Outline and Introduction
37 pages
Cascade
No ratings yet
Cascade
20 pages
Module 1.ppt
No ratings yet
Module 1.ppt
29 pages
Data-Intensive Computing
No ratings yet
Data-Intensive Computing
88 pages
Moving Processing To Data: On The Influence of Processing in Memory On Data Management
No ratings yet
Moving Processing To Data: On The Influence of Processing in Memory On Data Management
21 pages
Chapter Two
No ratings yet
Chapter Two
14 pages
Data Mining With Bigdata
No ratings yet
Data Mining With Bigdata
30 pages
Understanding Big Data
No ratings yet
Understanding Big Data
14 pages
Chapter1 PDF
No ratings yet
Chapter1 PDF
30 pages
Syllabus E63 Spring2016-2
No ratings yet
Syllabus E63 Spring2016-2
3 pages
In-Memory Processing
No ratings yet
In-Memory Processing
7 pages
BDA UNIT-I
No ratings yet
BDA UNIT-I
15 pages
Chapter 2
No ratings yet
Chapter 2
31 pages
Tesseract Pim Architecture For Graph Processing - Isca15
No ratings yet
Tesseract Pim Architecture For Graph Processing - Isca15
13 pages
Hadoop & BigData (UNIT - 2)
No ratings yet
Hadoop & BigData (UNIT - 2)
22 pages
Jifs223295 2
No ratings yet
Jifs223295 2
25 pages
IRJET - Big Data-A Review Study With Comp
No ratings yet
IRJET - Big Data-A Review Study With Comp
6 pages
Week6 Iot Big Data
No ratings yet
Week6 Iot Big Data
21 pages
Big Data Challenges in Bioinformatics
No ratings yet
Big Data Challenges in Bioinformatics
47 pages
BDA QUESTION BANK
No ratings yet
BDA QUESTION BANK
10 pages
Big Data Problems: Understanding Hadoop Framework: G S Aditya Rao, Palak Pandey
No ratings yet
Big Data Problems: Understanding Hadoop Framework: G S Aditya Rao, Palak Pandey
3 pages
Current Big Data Issues and Their Solutions Via Deep Learning: An Overview
No ratings yet
Current Big Data Issues and Their Solutions Via Deep Learning: An Overview
12 pages
2 Data Science
No ratings yet
2 Data Science
27 pages
Introduction To Big Data
No ratings yet
Introduction To Big Data
5 pages
Distributed Caching & Data Management: Mastering Redis, Memcached, And Apache Ignite Caching
From Everand
Distributed Caching & Data Management: Mastering Redis, Memcached, And Apache Ignite Caching
Rob Botwright
No ratings yet
Computer Science I Essentials
From Everand
Computer Science I Essentials
Randall Raus
5/5 (7)
gx8 Design
No ratings yet
gx8 Design
11 pages
Final Report Iec-10
No ratings yet
Final Report Iec-10
10 pages
3 - SG3125HV Troubleshooting Book EN
100% (1)
3 - SG3125HV Troubleshooting Book EN
65 pages
Wiley - Fundamentals of Microelectronics, 2nd Edition - 978-1-118-15632-2
No ratings yet
Wiley - Fundamentals of Microelectronics, 2nd Edition - 978-1-118-15632-2
3 pages
Unit 2 Knowledge Representation and Reasoning
No ratings yet
Unit 2 Knowledge Representation and Reasoning
68 pages
SAP MDG Master Data Gove 6228635
No ratings yet
SAP MDG Master Data Gove 6228635
15 pages
Dharmendra Kumar Dubey: Academic Profile
No ratings yet
Dharmendra Kumar Dubey: Academic Profile
3 pages
[FREE PDF sample] Mastering Flutter: A Beginner's Guide 1st Edition Sufyan Bin Uzayr ebooks
100% (2)
[FREE PDF sample] Mastering Flutter: A Beginner's Guide 1st Edition Sufyan Bin Uzayr ebooks
50 pages
Flexsim Tutorial
50% (2)
Flexsim Tutorial
171 pages
Question Bank - PPS
No ratings yet
Question Bank - PPS
6 pages
42LS3450, 345S, 345T
No ratings yet
42LS3450, 345S, 345T
53 pages
CCNA Course Outline (R&S)
No ratings yet
CCNA Course Outline (R&S)
6 pages
Roborun+ Utility User Manual - v3.0
No ratings yet
Roborun+ Utility User Manual - v3.0
74 pages
Hitachi Energy - RED670 (Application Manual) - Unlocked
No ratings yet
Hitachi Energy - RED670 (Application Manual) - Unlocked
900 pages
Cricket Project
No ratings yet
Cricket Project
14 pages
White Paper Interconnect Solutions Debugging Issues Advanced ARM CoreLink
No ratings yet
White Paper Interconnect Solutions Debugging Issues Advanced ARM CoreLink
8 pages
BÀI TẬP SO SÁNH
No ratings yet
BÀI TẬP SO SÁNH
3 pages
Triangulation: Statement
No ratings yet
Triangulation: Statement
3 pages
RAS Configuration
No ratings yet
RAS Configuration
108 pages
Graphic Print Discussions - Google Docs
No ratings yet
Graphic Print Discussions - Google Docs
2 pages
Archive 24-05-29 13-18-42
No ratings yet
Archive 24-05-29 13-18-42
266 pages
3 Mahout Clustering
No ratings yet
3 Mahout Clustering
24 pages
Products Affected / Serial Numbers Affected:: TP17 212.pdf 08-11-17
No ratings yet
Products Affected / Serial Numbers Affected:: TP17 212.pdf 08-11-17
4 pages
Codingbat Python Questions and Answers Section 2: Warmup-2
No ratings yet
Codingbat Python Questions and Answers Section 2: Warmup-2
13 pages
Control WPF
No ratings yet
Control WPF
741 pages
Lec1 Intro
No ratings yet
Lec1 Intro
119 pages
Teste Grila Ecommerce
No ratings yet
Teste Grila Ecommerce
15 pages
Hospital Management Sample System
No ratings yet
Hospital Management Sample System
89 pages
P90
No ratings yet
P90
8 pages
IOT Lab Experiment No. 9
No ratings yet
IOT Lab Experiment No. 9
3 pages