Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

03 Data

Download as pdf or txt
Download as pdf or txt
You are on page 1of 56

Data Handling: Import, Cleaning and Visualisation

Lecture 3:
A Brief Introduction to Data and Data Processing

Dr. Aurélien Sallin


Recap and warm-up
Basic programming concepts

· Values, variables
· Vectors
· Matrices
· Loops
· Logical statements
· Control statements
· Functions
Three tutorials

· Compute the mean with your own function


· Evolution in action: fast and slow sloths -> exercise session
· Append and lists -> exercise session
Warm-up

# Vectors
some_numbers <- c(30, 50, 60)

some_numbers[c(2,3)]
some_numbers > 3
some_numbers * 5
Warm-up

What is total_sum?

numbers <- 1:4


total_sum <- 0
n <- length(numbers)

# start loop
for (i in 1:n) {

if(i %% 2 == 0){
total_sum <- total_sum + numbers[i]
} else {
total_sum <- total_sum + 2*numbers[i]
}

}
Don’t forget…
Data Processing
The binary system

Microprocessors can only represent two signs (states):

· ‘Off’ = 0
· ‘On’ = 1
The binary counting frame

· Only two signs: 0, 1.


· Base 2.
· Columns: 2 0
= 1 ,2
1
= 2 ,2
2
= 4 , and so forth.
The binary counting frame

What is the decimal number 139 in the binary counting frame?


The binary counting frame

What is the decimal number 139 in the binary counting frame?

· Solution:
7 3 1 0
(1 × 2 ) + (1 × 2 ) + (1 × 2 ) + (1 × 2 ) = 139.
The binary counting frame

What is the decimal number 139 in the binary counting frame?

· Solution:
7 3 1 0
(1 × 2 ) + (1 × 2 ) + (1 × 2 ) + (1 × 2 ) = 139.

· More precisely:
7 6 5 4 3
(1 × 2 ) + (0 × 2 ) + (0 × 2 ) + (0 × 2 ) + (1 × 2 )

2 1 0
+ (0 × 2 ) + (1 × 2 ) + (1 × 2 ) = 139.

· That is, the number 139 in the decimal system corresponds to 10001011 in
the binary system.
Conversion between binary and decimal

Number 128 64 32 16 8 4 2 1
Conversion between binary and decimal

Number 128 64 32 16 8 4 2 1

0= 0 0 0 0 0 0 0 0

1= 0 0 0 0 0 0 0 1

2= 0 0 0 0 0 0 1 0

3= 0 0 0 0 0 0 1 1

139 = 1 0 0 0 1 0 1 1
The binary counting frame

· Sufficient to represent all natural numbers in the decimal system.


The binary counting frame

· Sufficient to represent all natural numbers in the decimal system.


· Representing fractions is tricky
- e.g. 1/3 = 0.333.. actually constitutes an infinite sequence of 0s and 1s.
- Solution: ‘floating point numbers’ (not 100% accurate)
Floating point numbers: a strange phenomenon

# Subtracting two nearly identical floating-point numbers


x <- 0.3 - 0.2
y <- 0.1

# Check if they are equal


result <- x == y

print(x)

## [1] 0.1

print(y)

## [1] 0.1

print(result)

## [1] FALSE
Floating point numbers: a strange phenomenon

print(format(x, digits = 20)) # prints a more precise value of x

## [1] "0.099999999999999977796"

print(format(y, digits = 20)) # prints a more precise value of y

## [1] "0.10000000000000000555"

tolerance <- 1e-9


equal <- abs(x - y) < tolerance
print(equal)

## [1] TRUE
Decimal numbers in a computer

If computers only understand 0 and 1, how can they express decimal numbers
like 139?
Decimal numbers in a computer

If computers only understand 0 and 1, how can they express decimal numbers
like 139?

· Standards define how symbols, colors, etc are shown on the screen.
· Facilitates interaction with a computer (our keyboards do not only consist of
a 0/1 switch).
What time is it?
The hexadecimal system

· Binary numbers can become quite long rather quickly.


· Computer Science: refer to binary numbers with the hexadecimal system.
The hexadecimal system

· 16 symbols:
- 0-9 (used like in the decimal system)…
- and A-F (for the numbers 10 to 15).
The hexadecimal system

· 16 symbols:
- 0-9 (used like in the decimal system)…
- and A-F (for the numbers 10 to 15).
· 16 symbols >>> base 16: each digit represents an increasing power of 16 (
16 , 16 , etc.).
0 1
The hexadecimal system

What is the decimal number 139 expressed in the hexadecimal system?


The hexadecimal system

What is the decimal number 139 expressed in the hexadecimal system?

· Solution:
1 0
(8 × 16 ) + (11 × 16 ) = 139.

· More precisely:
1 0
(8 × 16 ) + (B × 16 ) = 8B = 139.

· Hence: 10001011 (in binary) = 8B (in hexadecimal) = 139 in decimal.


The hexadecimal system

Advantages (when working with binary numbers)

1. Shorter than raw binary representation


2. Much easier to translate forth and back between binary and hexadecimal
than binary and decimal.

WHY?

😆
Character Encoding
Computers and text

How can a computer understand text if it only understands 0s and 1s?

A modified version of South Korean Dubeolsik (two-set type) for old hangul letters. (Illustration by Yes0song 2010, Creative Commons Attribution-Share Alike 3.0
Unported)
Computers and text

How can a computer understand text if it only understands 0s and 1s?

· Standards define how 0s and 1s correspond to specific letters/characters of


different human languages.
· These standards are usually called character encodings.
· Coded character sets that map unique numbers (in the end in binary coded
values) to each character in the set.
Computers and text

How can a computer understand text if it only understands 0s and 1s?

· Standards define how 0s and 1s correspond to specific letters/characters of


different human languages.
· These standards are usually called character encodings.
· Coded character sets that map unique numbers (in the end in binary coded
values) to each character in the set.
· For example, ASCII (American Standard Code for Information Interchange),
now superseded by utf-8 (Unicode).

ASCII logo. (public domain).


ASCII Table

Binary Hexadecimal Decimal Character

0011 1111 3F 63 ?

0100 0001 41 65 A

0110 0010 62 98 b
Character encodings: why should we care?
Character encodings: why should we care?

· In practice, Data Science means handling digital data of all formats and
shapes.
- Diverse sources.
- Different standards.
- Different languages (Japanese vs English).
- read/store data.
· At the lowest level, this means understanding/handling encodings.
Computer Code and Text-Files
Putting the pieces together…

Two core themes of this course:

1. How can data be stored digitally and be read by/imported to a computer?


2. How can we give instructions to a computer by writing computer code?
Putting the pieces together…

Two core themes of this course:

1. How can data be stored digitally and be read by/imported to a computer?


2. How can we give instructions to a computer by writing computer code?

In both of these domains we mainly work with one simple type of document:
text files.
Text-files

· A collection of characters stored in a designated part of the computer


memory/hard drive.
· An easy-to-read representation of the underlying information (0s and 1s)!
Text-files

· A collection of characters stored in a designated part of the computer


memory/hard drive.
· An easy to read representation of the underlying information (0s and 1s)!
· Common device to store data:
- Structured data (tables)
- Semi-structured data (websites)
- Unstructured data (plain text)
· Typical device to store computer code.
Text-editors: RStudio, Atom, VsCode

Install RStudio from here!

Install Atom from here!

Install VScode from here!

Install Sublime text from here!


Data Processing Basics
The ‘blackbox’ of data processing.
Components of a standard computing environment

Basic components of a standard computing environment.


Central Processing Unit

· R runs on one CPU core by default.


· All modern CPUs have multiple cores.
· Advanced: explore parallelization with plyr, doParallel() and future
Random Access Memory
Random Access Memory

large_matrix <- matrix(1, nrow=1e8, ncol=1e8)

## Error in matrix(1, nrow = 1e+08, ncol = 1e+08): Vektor ist zu groß

· Try to create a matrix with 10 8


× 10
8
elements.
· Assuming each number is stored using 8 bytes, this matrix would require
8 × 10 6 bytes of RAM (more on bytes in the next lecture).
1
Mass storage: hard drive
Network: Internet, cloud, etc.
Putting the pieces together…

Recall the initial example (survey) of this course.

1. Access a website (over the Internet), use keyboard to enter data into a
website (a Google sheet in that case).

2. R program accesses the data of the Google sheet (again over the Internet),
downloads the data, and loads it into RAM.

3. Data processing: produce output (in the form of statistics/plots), output on


screen.
5468616E6B7320616E642073656520796F75206E657874207765656B21

🤓
Q&A
References

You might also like