Data Science Algorithmen Master - 02 Data Handling
Data Science Algorithmen Master - 02 Data Handling
Getting data
Visualizing data
Characterizing data
Manipulating data
http://archive.ics.uci.edu/ml/index.php
https://aws.amazon.com/fr/datasets/
Metaportale:
http://dataportals.org/
https://www.opendatamonitor.eu
Given a file with 3 columns of values
10 0 cold
25 0 warm
15 5 cold
20 3 warm
18 7 cold
20 10 cold
22 5 warm
24 6 warm
Read data into a list
def csv_file_to_list(csv_file_name):
with open(csv_file_name, 'rb') as f:
reader = csv.reader(f)
data = list(reader)
return data
Read data into a dictionary
(all keys, value is last column)
def load_3row_data_to_dic(input_file):
f = open(input_file, 'r')
dic = {}
entries = (f.read()).splitlines()
for i in range(0, len(entries)):
values = entries[i].split(' ')
dic[int(values[0]), int(values[1])] = values[2]
return dic
Write dictionary back to file
Bar charts
Pie charts
Line plots
Scatter plots
Histograms
…
Simple library
plt.show()
from collections import Counter
grades = [83,95,91,87,70,0,85,82,100,67,73,77,0]
#round to ten
decile = lambda grade: grade / 10 * 10
histogram = Counter(decile(grade) for grade in grades)
plt.xlabel("Decile")
plt.ylabel("# of Students")
plt.title("Distribution of Exam 1 Grades")
plt.show()
# y value series
variance = [1,2,4,8,16,32,64,128,256]
bias_squared = [256,128,64,32,16,8,4,2,1]
# x values
xs = range(len(variance))
# we can make multiple calls to plt.plot
# to show multiple series on the same chart
# green solid line, red dot-dashed line, blue dotted line
plt.plot(xs, variance, 'g-', label='variance')
plt.plot(xs, bias_squared, 'r-.', label='bias^2')
plt.plot(xs, total_error, 'b:', label='total error')
plt.scatter(friends, minutes)
plt.show()
plt.pie([0.95, 0.05],
labels=["Uses pie charts", "Knows better"])
num_friends = [100,49,41,40,25,21,21,19,19,18,18,
16,15,15,15,15,14,14,13,13,13,13,12,12,11,
10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,
9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,
8,8,8,8,8,8,8,8,8,8,8,8,8,
7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,
6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,
5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,
4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,
3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,
2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]
Vector calculations
Matrix calculations
def sum_of_squares(v):
"""v_1 * v_1 + ... + v_n * v_n"""
return dot(v, v)
friend_counts = Counter(num_friends)
xs = range(101)
ys = [friend_counts[x] for x in xs]
plt.bar(xs, ys)
plt.axis([0,101,0,25])
plt.title("Histogram of Friend Counts")
plt.xlabel("# of friends")
plt.ylabel("# of people")
plt.show()
num_points = len(num_friends) # 204
sorted_values = sorted(num_friends)
smallest_value = sorted_values[0] # 1
second_smallest_value = sorted_values[1] # 1
second_largest_value = sorted_values[-2] # 49
Average (mean)
def mean(x):
return sum(x) / len(x)
mean(num_friends)
if n % 2 == 1:
# if odd, return the middle value
return sorted_v[midpoint]
else:
# if even, return the average of the middle values
lo = midpoint - 1
hi = midpoint
return (sorted_v[lo] + sorted_v[hi]) / 2
Generalization of the median
def mode(x):
"""returns a list, might be more than one mode"""
counts = Counter(x)
max_count = max(counts.values())
return [x_i for x_i, count in counts.iteritems()
if count == max_count]
mode(num_friends) # [1, 6]
Measure how spread out the data is
◦ If near 0 -> hardly spread out
◦ If large number -> spread out
def data_range(x):
return max(x) - min(x)
def de_mean(x):
"""translate x by subtracting its mean
(so the result has mean 0)"""
x_bar = mean(x)
return [x_i - x_bar for x_i in x]
def variance(x):
"""assumes x has at least two elements"""
n = len(x)
deviations = de_mean(x)
return sum_of_squares(deviations) / (n - 1)
Variance is square unit of data
Therefore the standard deviation is
introduced as square root of the variance
def standard_deviation(x):
return math.sqrt(variance(x))
def interquartile_range(x):
return quantile(x, 0.75) - quantile(x, 0.25)
Correlation
Compare data sets to find commonalities
Sample:
◦ Amount of time people spend on our web site is related to the number of friends
◦ Start with a list of daily_minutes each user spends on the web site
daily_minutes = [1,68.77,51.25,52.08,38.36,44.54,57.13,51.4,41.42,31.22,34.76,
54.01,38.79,47.59,49.1,27.66,41.03,36.73,48.65,28.12,46.62,35.57,
32.98,35,26.07,23.77,39.73,40.57,31.65,31.21,36.32,20.45,21.93,
26.02, 27.34,23.49,46.94,30.5,33.8,24.23,21.4,27.94,32.24,40.57,
25.07,19.42,22.39,18.42,46.96,23.72,26.41,26.97,36.76,40.32,
35.02,29.47,30.2,31,38.11,38.18,36.31,21.03,30.86,36.07,28.66,
29.08,37.28,15.28,24.17,22.31,30.17,25.53,19.85,35.37,44.6,17.23,
13.47,26.33,35.02,32.09,24.81,19.33,28.77,24.26,31.98,25.73,
24.86,16.28,34.51,15.23,39.72,40.8,26.06,35.76,34.76,16.13,44.04,
18.03,19.65,32.62,35.59,39.43,14.18,35.24,40.13,41.82,35.45,36.07,
43.67,24.61,20.9,21.9,18.79,27.61,27.21,26.61,29.77,20.59,27.53,
13.82,33.2,25,33.1,36.65,18.63,14.87,22.2,36.81,25.53,24.62,
26.25,18.21,28.08,19.42,29.79,32.8,35.99,28.32,27.79,35.88,29.06,
36.28,14.1,36.63,37.49,26.9,18.58,38.48,24.48,18.95,33.55,14.24,
29.04,32.51,25.63,22.22,19,32.73,15.16,13.9,27.2,32.01,29.27,33,
13.74,20.42,27.32,18.23,35.35,28.48,9.08,24.62,20.12,35.26,19.92,
31.02,16.49,12.16,30.7,31.22,34.65,13.13,27.51,33.2,31.57,14.1,
33.42,17.44,10.12,24.42,9.82,23.39,30.93,15.03,21.67,31.09,33.29,
22.61,26.89,23.48,8.38,27.81,32.35,23.84]
Analog to the Variance
Dot product of the two pairs of data deviation
from mean value
X = [-2, -1, 0, 1, 2]
Y = [ 2, 1, 0, 1, 2]
X = [-2, -1, 0, 1, 2]
Y = [99.98, 99.99, 100, 100.01, 100.02]
If independent, then
Start
P(B) P(G)
def random_kid():
return random.choice(["boy", "girl"])
both_girls = 0
older_girl = 0
either_girl = 0
random.seed(0)
for _ in range(10000):
younger = random_kid()
older = random_kid()
if older == "girl":
older_girl += 1
if older == "girl" and younger == "girl":
both_girls += 1
if older == "girl" or younger == "girl":
either_girl += 1
Question:
what is conditional probability P(F|E)?
P(E|F) = P(E) * P(F|E) / P(F)
P(Disease|TestPositive): P(D|T)
We know
◦ P(T|D) : 99% -> P(T|¬D) : 0,01
◦ P(D) : 1/10000 = 0,0001 -> P(¬D) : 0,9999
P(D|T) = P(T|D)P(D) / [P(T|D)P(D) + P(T|,¬D)P(¬D)]
P(D|T) = 0,99 * 0,0001 / (0,99*0,0001 + 0,01*0,9999)
P(D|T) = 0,98
Uniform Distribution
◦ Equal weight on all numbers between 0 and 1
◦ -> weight for single point = 0!
http://en.wikipedia.org/wiki/Binary_search_al
gorithm
Filters & Conversions
Data values might be huge in difference
For some applications relation of data is more
interesting
scaled = (value-min)/(max-min)
To get rid of extremes you can use the
quantile / percentile function