Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
2 views

Data Science Algorithmen Master - 02 Data Handling

The document covers data handling techniques including data acquisition, visualization, characterization, and manipulation using Python. It provides examples of reading data into lists and dictionaries, plotting various types of charts with Matplotlib, and performing statistical analysis such as mean, median, variance, and correlation. Additionally, it discusses the importance of statistics for understanding larger data sets and includes code snippets for practical implementation.

Uploaded by

niklasaaron
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Data Science Algorithmen Master - 02 Data Handling

The document covers data handling techniques including data acquisition, visualization, characterization, and manipulation using Python. It provides examples of reading data into lists and dictionaries, plotting various types of charts with Matplotlib, and performing statistical analysis such as mean, median, variance, and correlation. Additionally, it discusses the importance of statistics for understanding larger data sets and includes code snippets for practical implementation.

Uploaded by

niklasaaron
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 76

02 - Data Handling

 Getting data

 Visualizing data

 Characterizing data

 Manipulating data
 http://archive.ics.uci.edu/ml/index.php
 https://aws.amazon.com/fr/datasets/

 Metaportale:
 http://dataportals.org/
 https://www.opendatamonitor.eu

 Given a file with 3 columns of values

10 0 cold
25 0 warm
15 5 cold
20 3 warm
18 7 cold
20 10 cold
22 5 warm
24 6 warm
 Read data into a list

def csv_file_to_list(csv_file_name):
with open(csv_file_name, 'rb') as f:
reader = csv.reader(f)
data = list(reader)
return data
 Read data into a dictionary
(all keys, value is last column)

def load_3row_data_to_dic(input_file):
f = open(input_file, 'r')
dic = {}
entries = (f.read()).splitlines()
for i in range(0, len(entries)):
values = entries[i].split(' ')
dic[int(values[0]), int(values[1])] = values[2]
return dic
 Write dictionary back to file

def save_3row_data_from_dic(output_file, data):


f = open(output_file, 'w')
for key, value in data.items():
f.write(str(key[0]) + ' ' + str(key[1]) + ' ' + value + '\n')
def printf(format, *args):
sys.stdout.write(format % args)

 printf(‘hello %s world‘, ‘good‘)

 printf(‘pi is %f‘, 3.1415)


Plots
 http://matplotlib.org

 Bar charts
 Pie charts
 Line plots
 Scatter plots
 Histograms
 …
 Simple library

 Diagram is set up internally step by step


 Finally, show() displays the result

 Many features -> see online manual


import matplotlib.pyplot as plt

years = [1950, 1960, 1970, 1980, 1990, 2000, 2010]


gdp = [300.2, 543.3, 1075.9, 2862.5, 5979.6, 10289.7, 14958.3]

# create a line chart, years on x-axis, gdp on y-axis


plt.plot(years, gdp, color='green', marker='o', linestyle='solid')

# add a title (GDP gross domestic product)


plt.title("Nominal GDP")

# add a label to the y-axis


plt.ylabel("Billions of $")
plt.show()
movies = ["Annie Hall", "Ben-Hur", "Casablanca", "Gandhi",
"West Side Story"]
num_oscars = [5, 11, 3, 8, 10]

# bars are by default width 0.8


xs = [i for i, _ in enumerate(movies)]

# plot bars with left x-coordinates [xs], heights [num_oscars]


plt.bar(xs, num_oscars)
plt.ylabel("# of Academy Awards")
plt.title("My Favorite Movies")

# label x-axis with movie names at bar centers


plt.xticks([i for i, _ in enumerate(movies)], movies)

plt.show()
from collections import Counter

grades = [83,95,91,87,70,0,85,82,100,67,73,77,0]
#round to ten
decile = lambda grade: grade / 10 * 10
histogram = Counter(decile(grade) for grade in grades)

# give each bar a width of 8


plt.bar([x for x in histogram.keys()], histogram.values(), 8)
# x-axis -5 .. 105, y-axis 0 .. 5 , labels 0 .. 100
plt.axis([-5, 105, 0, 5])
plt.xticks([10 * i for i in range(11)])

plt.xlabel("Decile")
plt.ylabel("# of Students")
plt.title("Distribution of Exam 1 Grades")

plt.show()
# y value series
variance = [1,2,4,8,16,32,64,128,256]
bias_squared = [256,128,64,32,16,8,4,2,1]

# zip() combines two data series to tuples


total_error = [x + y for x, y in zip(variance, bias_squared)]

# x values
xs = range(len(variance))
# we can make multiple calls to plt.plot
# to show multiple series on the same chart
# green solid line, red dot-dashed line, blue dotted line
plt.plot(xs, variance, 'g-', label='variance')
plt.plot(xs, bias_squared, 'r-.', label='bias^2')
plt.plot(xs, total_error, 'b:', label='total error')

# because we've assigned labels to each series


# we can get a legend for free
# loc=9 means "top center"
plt.legend(loc=9)
plt.xlabel("model complexity")
plt.title("The Bias-Variance Tradeoff")
plt.show()
friends = [ 70, 65, 72, 63, 71, 64, 60, 64, 67]
minutes = [175, 170, 205, 120, 220, 130, 105, 145, 190]
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i']

plt.scatter(friends, minutes)

# label each point


for label, friend_count, minute_count in zip(labels, friends, minutes):
plt.annotate(label, xy=(friend_count, minute_count),
xytext=(5, -5), textcoords='offset points')

plt.title("Daily Minutes vs. Number of Friends")


plt.xlabel("# of friends")
plt.ylabel("daily minutes spent on the site")

plt.show()
plt.pie([0.95, 0.05],
labels=["Uses pie charts", "Knows better"])

# make sure pie is a circle and not an oval


plt.axis("equal")
plt.show()
Statistics
 Small data sets can simply be represented by
giving the numbers
 For larger data sets this is probably opaque
(imagine 1 million of numbers …)

 -> we need statistics


from __future__ import division # do not round to int
from collections import Counter
from linear_algebra import sum_of_squares, dot
import math
import matplotlib.pyplot as plt

num_friends = [100,49,41,40,25,21,21,19,19,18,18,
16,15,15,15,15,14,14,13,13,13,13,12,12,11,
10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,
9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,
8,8,8,8,8,8,8,8,8,8,8,8,8,
7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,
6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,
5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,
4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,
3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,
2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]
 Vector calculations
 Matrix calculations

def dot(v, w):


"""v_1 * w_1 + ... + v_n * w_n"""
return sum(v_i * w_i for v_i, w_i in zip(v, w))

def sum_of_squares(v):
"""v_1 * v_1 + ... + v_n * v_n"""
return dot(v, v)
friend_counts = Counter(num_friends)
xs = range(101)
ys = [friend_counts[x] for x in xs]

plt.bar(xs, ys)
plt.axis([0,101,0,25])
plt.title("Histogram of Friend Counts")
plt.xlabel("# of friends")
plt.ylabel("# of people")

plt.show()
num_points = len(num_friends) # 204

largest_value = max(num_friends) # 100


smallest_value = min(num_friends) # 1

sorted_values = sorted(num_friends)
smallest_value = sorted_values[0] # 1
second_smallest_value = sorted_values[1] # 1
second_largest_value = sorted_values[-2] # 49
 Average (mean)

def mean(x):
return sum(x) / len(x)

mean(num_friends)

 Average depends on every single value


◦ Runaway values can heavily influence the average
 Median value
(middle-most value of the data set)

 To find the middle value the data set must be


sorted

 if (number of points is odd)


◦ take the middle one
 else
◦ Take the mean value of the left & right value
def median(v):
"""finds the 'middle-most' value of v"""
n = len(v)
sorted_v = sorted(v)
midpoint = n // 2

if n % 2 == 1:
# if odd, return the middle value
return sorted_v[midpoint]
else:
# if even, return the average of the middle values
lo = midpoint - 1
hi = midpoint
return (sorted_v[lo] + sorted_v[hi]) / 2
 Generalization of the median

 The quantile is the value, which is the highest


of a certain percentile of the data set.

def quantile(x, p):


"""returns the pth-percentile value in x"""
p_index = int(p * len(x))
return sorted(x)[p_index]
 quantile(num_friends, 0.10) # 1
 quantile(num_friends, 0.25) # 3
 quantile(num_friends, 0.75) # 9
 quantile(num_friends, 0.90) # 13
 Might be more than one value

def mode(x):
"""returns a list, might be more than one mode"""
counts = Counter(x)
max_count = max(counts.values())
return [x_i for x_i, count in counts.iteritems()
if count == max_count]

mode(num_friends) # [1, 6]
 Measure how spread out the data is
◦ If near 0 -> hardly spread out
◦ If large number -> spread out

 Range of data values

def data_range(x):
return max(x) - min(x)
def de_mean(x):
"""translate x by subtracting its mean
(so the result has mean 0)"""
x_bar = mean(x)
return [x_i - x_bar for x_i in x]

def variance(x):
"""assumes x has at least two elements"""
n = len(x)
deviations = de_mean(x)
return sum_of_squares(deviations) / (n - 1)
 Variance is square unit of data
 Therefore the standard deviation is
introduced as square root of the variance

def standard_deviation(x):
return math.sqrt(variance(x))

 These metrics are once more dependent on


the number of items and sensitive to extreme
values (outliers)
 Take the difference between 75% and 25%
percentile values

def interquartile_range(x):
return quantile(x, 0.75) - quantile(x, 0.25)
Correlation
 Compare data sets to find commonalities

 Sample:
◦ Amount of time people spend on our web site is related to the number of friends
◦ Start with a list of daily_minutes each user spends on the web site

daily_minutes = [1,68.77,51.25,52.08,38.36,44.54,57.13,51.4,41.42,31.22,34.76,
54.01,38.79,47.59,49.1,27.66,41.03,36.73,48.65,28.12,46.62,35.57,
32.98,35,26.07,23.77,39.73,40.57,31.65,31.21,36.32,20.45,21.93,
26.02, 27.34,23.49,46.94,30.5,33.8,24.23,21.4,27.94,32.24,40.57,
25.07,19.42,22.39,18.42,46.96,23.72,26.41,26.97,36.76,40.32,
35.02,29.47,30.2,31,38.11,38.18,36.31,21.03,30.86,36.07,28.66,
29.08,37.28,15.28,24.17,22.31,30.17,25.53,19.85,35.37,44.6,17.23,
13.47,26.33,35.02,32.09,24.81,19.33,28.77,24.26,31.98,25.73,
24.86,16.28,34.51,15.23,39.72,40.8,26.06,35.76,34.76,16.13,44.04,
18.03,19.65,32.62,35.59,39.43,14.18,35.24,40.13,41.82,35.45,36.07,
43.67,24.61,20.9,21.9,18.79,27.61,27.21,26.61,29.77,20.59,27.53,
13.82,33.2,25,33.1,36.65,18.63,14.87,22.2,36.81,25.53,24.62,
26.25,18.21,28.08,19.42,29.79,32.8,35.99,28.32,27.79,35.88,29.06,
36.28,14.1,36.63,37.49,26.9,18.58,38.48,24.48,18.95,33.55,14.24,
29.04,32.51,25.63,22.22,19,32.73,15.16,13.9,27.2,32.01,29.27,33,
13.74,20.42,27.32,18.23,35.35,28.48,9.08,24.62,20.12,35.26,19.92,
31.02,16.49,12.16,30.7,31.22,34.65,13.13,27.51,33.2,31.57,14.1,
33.42,17.44,10.12,24.42,9.82,23.39,30.93,15.03,21.67,31.09,33.29,
22.61,26.89,23.48,8.38,27.81,32.35,23.84]
 Analog to the Variance
 Dot product of the two pairs of data deviation
from mean value

def covariance(x, y):


n = len(x)
return dot(de_mean(x), de_mean(y)) / (n - 1)

 May be hard to interpret


◦ Large positive covariance: if x is big, y is also big
◦ Large negative covariance: if x is big, y is inverse big
 Divide out the standard deviation

def correlation(x, y):


stdev_x = standard_deviation(x)
stdev_y = standard_deviation(y)
if stdev_x > 0 and stdev_y > 0:
return covariance(x, y) / stdev_x / stdev_y
else:
return 0 # if no variation, correlation is zero

correlation(num_friends, daily_minutes) # 0,247


 Single outstanding values can distort the plot

 Always check your data for outliers


 Maybe it is safe to ignore them?

outlier = num_friends.index(100) # index of outlier


num_friends_good = [x
for i, x in enumerate(num_friends)
if i != outlier]
daily_minutes_good = [x
for i, x in enumerate(daily_minutes)
if i != outlier]
 Simpson‘s Paradox

 Beware if boundary conditions are different


◦ „The only difference is the observation“
◦ „All else is equal“

 Confounding variables can influence the


correlation
 Always check AND UNDERSTAND your data
 Correlation of zero means that there is no
linear relationship between the two variables

X = [-2, -1, 0, 1, 2]
Y = [ 2, 1, 0, 1, 2]

 have zero correlation, but they have a


relationship (which is non-linear)
 Correlation tells nothing about how large the
relationship is

X = [-2, -1, 0, 1, 2]
Y = [99.98, 99.99, 100, 100.01, 100.02]

 Data is perfectly correlated, but are you


interested in this relationship?
 Correlation is NOT causation

 If data correlates, this might be because


◦ There is an underlying relationship
◦ There is something bigger causing this behavior
(external forces)
◦ The correlation is by coincidence and means
nothing

 Conduct random experiments to foster the


results
Probability
 Quantify the uncertainty associated with
certain events to occur

 Given an event E we describe the probability


of the event happening as P(E)
 Two events E and F are dependent if knowing
something about whether E happens gives us
information about whether F happens.
 Otherwise, they are independent

 If independent, then

P(E, F) = P(E) * P(F)


 If events E, F are not necessarily independent

P(E|F) = P(E, F) / P(F)

P(E, F) = P(E|F) * P(F)

 Probability that E happens if we know that F


happens
 Assumptions:
◦ Each child can be either boy or girl equally like
◦ Gender of second child is independent of gender of
the first one

◦ P(B) = P(G) = 0.5

Start
P(B) P(G)

P(B|B) Boy P(G|B) P(B|G) Girl P(G|G)

Boy Girl Boy Girl

P(B,B) P(B,G) P(G,B) P(G,G)


 Conditional Probability for both kids are girls
if the first one is a girl:

P(G|G) = P(G,G) / P(G)


P(G|G) = ¼ / ½ = ½
 Conditional Probability for two girls if at least
one kid is a girl:

P(2girls |min1girl) = P(2girls, min1girl)/P(min1girl)


P(2girls |min1girl) = ¼ / ¾ = 1/3

 So if you know that there is at least one girl in a familiy with


two kids, it is 2:1 that the other kid is a boy.
from __future__ import division
from collections import Counter
import math, random

def random_kid():
return random.choice(["boy", "girl"])

both_girls = 0
older_girl = 0
either_girl = 0
random.seed(0)
for _ in range(10000):
younger = random_kid()
older = random_kid()
if older == "girl":
older_girl += 1
if older == "girl" and younger == "girl":
both_girls += 1
if older == "girl" or younger == "girl":
either_girl += 1

print "P(both | older):", both_girls / older_girl # 0.514 ~ 1/2


print "P(both | either): ", both_girls / either_girl # 0.342 ~ 1/3
 Provides a way to „reverse“ the conditional
probability:

P(E, F) = P(E) * P(F|E)

P(E|F) = P(E,F) / P(F)

 Question:
what is conditional probability P(F|E)?
P(E|F) = P(E) * P(F|E) / P(F)

 P(F) can be split in two parts:

P(F) = P(F,E) + P(F,¬E)

 Probability that F happens if E happened plus


F happens if E did not happen

P(E|F) = P(F|E)P(E) / [P(F|E)P(E) + P(F|,¬E)P(¬E)]


 Given disease affects 1 out of 10000
 Test gives correct result in 99%

 What is the probability that you are sick if the


test is positive?

 P(Disease|TestPositive): P(D|T)
 We know
◦ P(T|D) : 99% -> P(T|¬D) : 0,01
◦ P(D) : 1/10000 = 0,0001 -> P(¬D) : 0,9999
P(D|T) = P(T|D)P(D) / [P(T|D)P(D) + P(T|,¬D)P(¬D)]
P(D|T) = 0,99 * 0,0001 / (0,99*0,0001 + 0,01*0,9999)
P(D|T) = 0,98
 Uniform Distribution
◦ Equal weight on all numbers between 0 and 1
◦ -> weight for single point = 0!

 Better representation using the Probability


Density Function (pdf)
 Cumulative Distribution Function (cdf)

y=0 for x <0


y=x for 0 <= x <= 1
y=1 for x>1
 Bell-curve distribution
2
1 𝑥−µ
𝑓 𝑥 µ, 𝜎 = 𝑒𝑥𝑝 −
2𝜋𝜎 2 2𝜎 2
 Cumulative Distribution Function
 Sometimes it is necessary to find the x for a
given probability
 There is no simple way for that, but we can
use binary search to find the value

 http://en.wikipedia.org/wiki/Binary_search_al
gorithm
Filters & Conversions
 Data values might be huge in difference
 For some applications relation of data is more
interesting

 Rescaling to percent of (MIN, MAX)

scaled = (value-min)/(max-min)
 To get rid of extremes you can use the
quantile / percentile function

 Find high/low extremes and cut them off


 Then re-evaluate
 Simulation of Galton Board

 Write a program which simulates


a Galton board with 10 levels.
The probability of deviation of
each level can be parameterized,
when using (p = q = 0.5) a
Gauss deviation is expected
 With the program do experiments for at least
3 different p/q settings (p + q = 1)
 Use different numbers of marbels
(1E2, 1E4, 1E6, 1E8 and 1E10)

 Document the results in histograms, graphs,


You might also like