This document discusses histograms and stem-and-leaf plots for analyzing and visualizing the distribution of a single set of numerical data. It provides examples using yearly precipitation data from New York City to demonstrate how to create histograms and stem-and-leaf plots in R. Histograms partition data into bins to show the frequency or relative frequency of observations in each bin, while stem-and-leaf plots list the "stems" and "leaves" of values to show their distribution.
Exploratory Data Analysis (EDA) was promoted by John Tukey in 1977 to encourage visually examining data without hypotheses. EDA uses graphical and non-graphical techniques like histograms, scatter plots, box plots to summarize variable characteristics. EDA allows understanding data distributions and relationships without models through inspection and information graphics. Common EDA goals are describing typical values, variability, distributions, and relationships between variables.
Serhiy Korolenko - The Strength of Ukrainian Users’ P@ssw0rds2017OWASP Kyiv
This document discusses password complexity, storage, recovery, and cracking. It provides information on hashing passwords with algorithms like bcrypt and scrypt and using salts. It also covers dictionary, brute force, and rule-based attacks using tools like Hashcat. Rules can be generated from wordlists using techniques like Markov chains, tmesis, and the creator's own rules. The document concludes with analyses of real password leaks showing most common passwords and patterns.
The document discusses several techniques for data-driven hallucination of different times of day from a single outdoor photo. It proposes using a local affine model to learn color transformations between frames from a matched video clip. The model is optimized using a least squares approach with regularization. It also discusses using superpixel segmentation and comparing performance of the local affine model versus a linear model.
The document discusses techniques for synthesizing images at different times of day from a single input photo by exploiting a database of time-lapse videos. It proposes using a local affine model to learn color transformations between frames and apply those transformations to the input photo. The local affine model represents color transforms between frames using affine transforms for each local image block, and finds the output that minimizes differences between the warped frames and has a similar structure to the input. It also discusses using superpixels to partition the image into segments for applying the local affine models.
Probability theory of bingo: How many draws do we need until all gifts are gi...kaosu_bingo
Please imagine that you are organizer of the party.
You made a plan to play the bingo game.
There is a limit to the time.
How many draws do we need until all gifts are given away?
For example, when
Number of participants is 80.Number of gifts is 20.
How many draws do we need? Can you answer?
After this speech you will be able to answer this question.
A practical Introduction to Machine(s) LearningBruno Gonçalves
The data deluge we currently witnessing presents both opportunities and challenges. Never before have so many aspects of our world been so thoroughly quantified as now and never before has data been so plentiful. On the other hand, the complexity of the analyses required to extract useful information from these piles of data is also rapidly increasing rendering more traditional and simpler approaches simply unfeasible or unable to provide new insights.
In this tutorial we provide a practical introduction to some of the most important algorithms of machine learning that are relevant to the field of Complex Networks in general, with a particular emphasis on the analysis and modeling of empirical data. The goal is to provide the fundamental concepts necessary to make sense of the more sophisticated data analysis approaches that are currently appearing in the literature and to provide a field guide to the advantages an disadvantages of each algorithm.
In particular, we will cover unsupervised learning algorithms such as K-means, Expectation-Maximization, and supervised ones like Support Vector Machines, Neural Networks and Deep Learning. Participants are expected to have a basic understanding of calculus and linear algebra as well as working proficiency with the Python programming language.
Machine Learning can often be a daunting subject to tackle much less utilize in a meaningful manner. In this session, attendees will learn how to take their existing data, shape it, and create models that automatically can make principled business decisions directly in their applications. The discussion will include explanations of the data acquisition and shaping process. Additionally, attendees will learn the basics of machine learning - primarily the supervised learning problem.
Nearest neighbor models are conceptually just about the simplest kind of model possible. The problem is that they generally aren’t feasible to apply. Or at least, they weren’t feasible until the advent of Big Data techniques. These slides will describe some of the techniques used in the knn project to reduce thousand-year computations to a few hours. The knn project uses the Mahout math library and Hadoop to speed up these enormous computations to the point that they can be usefully applied to real problems. These same techniques can also be used to do real-time model scoring.
The document discusses object recognition and convolutional neural networks (CNNs). It provides an overview of object detection tasks and evaluation metrics. It then describes CNNs and their history, including how they use local connectivity, shared weights, multiple feature maps, and pooling. Region-based convolutional networks (R-CNNs) are introduced as an improvement over previous models like deformable part models. The document reviews improvements in average precision on the PASCAL VOC dataset over time, from early methods to R-CNNs. Finally, it provides a brief history of CNNs from the 1980s to their resurgence in recent years.
The document discusses recursion and provides examples of recursive algorithms like factorial, Fibonacci series, and Towers of Hanoi. It explains recursion using these examples and discusses the disadvantages of recursion. It also covers divide and conquer algorithms like quicksort and binary search. Finally, it discusses backtracking and provides the example of the eight queens problem to illustrate recursive backtracking.
The document summarizes a presentation given by Sulabh Pal, Suresh Kumar, and Himanshu Srivastava on April 27th 2012 about an ant colony optimization technique proposed for edge detection in images. The presentation covered the historical background of ant colony optimization, described how it works by modeling the behavior of ant colonies, and presented experimental results showing that the proposed approach improves edge detection compared to traditional methods and has shorter computation time than traditional ant colony optimization. The technique was also discussed as having applications in optimization problems like routing in communication networks.
The document discusses the importance and applications of counting in mathematics and daily life. It provides examples of how counting is used in basic arithmetic operations like addition and multiplication. It also explains how counting is applied in business for inventory management and logistics. The key points are that counting is fundamental to verifying mathematical operations, it can be done in various units of measurement, and accuracy in counting is important for accounting of physical goods.
1. The document discusses image formation, cameras, and digital image acquisition and representation. It describes how images are formed through light projection and sampling, and how analog and digital cameras work to capture images.
2. Digital images are represented as matrices, with each element corresponding to a pixel value. Grayscale images have a single value per pixel while color images have multiple values representing channels like red, green, and blue.
3. Pixels in digital images are quantized to a finite set of numeric values like 8-bit integers from 0 to 255 for storage and processing in computer systems. This affects qualities like radiometric resolution of the encoded image.
What is a superpixel?
This presentation describes Superpixel algorithms such as watershed, mean-shift, SLIC, BSLIC (SLIC superpixels based on boundary term)
references:
[1] Luc Vincent and Pierre Soille. Watersheds in digital spaces: An efficient algorithm based on immersion simulations. IEEE Transactions on Pattern Analysis and Machine Intelligence, 13(6):583–598, 1991.
[2] D. Comaniciu and P. Meer. Mean shift: a robust approach toward feature space analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(5):603–619, May 2002.
[3] Radhakrishna Achanta, Appu Shaji, Kevin Smith, Aurelien Lucchi, Pascal Fua, and Sabine Süsstrunk, SLIC Superpixels Compared to State-of-the-art Superpixel Methods, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 34, num. 11, p. 2274 - 2282, May 2012.
[4] Radhakrishna Achanta, Appu Shaji, Kevin Smith, Aurelien Lucchi, Pascal Fua, and Sabine Süsstrunk, SLIC Superpixels, EPFL Technical Report no. 149300, June 2010.
[5] Hai Wang, Xiongyou Peng, Xue Xiao, and Yan Liu, BSLIC: SLIC Superpixels Based on Boundary Term, Symmetry 2017, 9(3), Feb 2017.
The document discusses using Ruby for automation tasks beyond Ruby on Rails frameworks. It provides examples of using Ruby one-line commands and scripts to process data like coordinates and deploy applications. Domain-specific languages in Ruby are also covered, with examples like using the listo system for building a large C++ project across platforms.
This chapter discusses input modeling for simulation. It covers collecting input data, identifying appropriate probability distributions to represent the data, estimating parameters for the distributions, and evaluating goodness of fit. The document provides examples and guidelines for creating histograms and scatter plots to visualize data, selecting a distribution family based on the data characteristics, and using quantile-quantile plots to check the fit of distributions. Parameter estimation methods for common distributions are also introduced.
The document discusses transportation problems and assignment problems in operations research. It provides:
1) An overview of transportation problems, including the mathematical formulation to minimize transportation costs while meeting supply and demand constraints.
2) Methods for obtaining initial basic feasible solutions to transportation problems, such as the North-West Corner Rule and Vogel's Approximation Method.
3) Techniques for moving towards an optimal solution, including determining net evaluations and selecting entering variables.
4) The formulation and algorithm for solving assignment problems to minimize assignment costs while ensuring each job is assigned to exactly one machine.
The document discusses time and space complexity analysis of algorithms. Time complexity measures the number of steps to solve a problem based on input size, with common orders being O(log n), O(n), O(n log n), O(n^2). Space complexity measures memory usage, which can be reused unlike time. Big O notation describes asymptotic growth rates to compare algorithm efficiencies, with constant O(1) being best and exponential O(c^n) being worst.
The oc curve_of_attribute_acceptance_plansAnkit Katiyar
The document discusses operating characteristic (OC) curves, which describe the probability of accepting a lot based on the lot's quality level. The typical OC curve has an S-shape, with the probability of acceptance decreasing as the percent of nonconforming items increases. Sampling plans can approach the ideal step-function OC curve as the sample size and acceptance number increase. Specific points on the OC curve correspond to acceptance quality limits and rejection quality limits.
This document discusses conceptual problems in statistics, testing, and experimentation in cognitive psychology. It identifies three main sources of variability in psychological data: (1) participant interest and motivation, (2) individual differences, and (3) potentially stochastic cognitive mechanisms. Addressing this variability poses challenges for developing normative and descriptive models of cognition and for making inferences from group-level data to individuals. The document also discusses approaches like individual differences research and modeling heterogeneous groups to help address these challenges.
The document summarizes key concepts about queuing systems and simple queuing models. It discusses:
1) Components of a queuing system including the arrival process, service mechanism, and queue discipline.
2) Performance measures for queuing systems such as average delay, waiting time, and number of customers.
3) The M/M/1 queuing model where arrivals and service times follow exponential distributions with a single server. Expressions are given for performance measures in this model.
4) How limiting the queue length to a finite number affects performance measures compared to an infinite queue system.
Scatter diagrams and correlation and simple linear regresssionAnkit Katiyar
The document discusses scatter diagrams, correlation, and linear regression. It defines key terms like predictor and response variables, positively and negatively associated variables, and the correlation coefficient. It also describes how to calculate the linear correlation coefficient and interpret it. The document shows an example of using least squares regression to fit a line to productivity and experience data. It provides formulas to calculate the slope and intercept of the regression line and how to make predictions with the line. However, predictions should stay within the scope of the observed data used to fit the model.
This document provides an introduction to queueing theory, covering basic concepts from probability theory used in queueing models like random variables, generating functions, and common probability distributions. It then discusses fundamental queueing models and relations, including Kendall's notation for describing queueing systems and Little's Law relating average queue length and waiting time. Specific queueing models are analyzed like the M/M/1, M/M/c, M/Er/1, M/G/1, and G/M/1 queues.
This document provides an introduction to queueing theory. It discusses key concepts such as random variables, probability distributions, performance measures, Little's law and the PASTA property. It then examines several common queueing models including the M/M/1, M/M/c, M/Er/1, M/G/1 and G/M/1 queues. For each model it derives the equilibrium distribution and discusses measures like mean queue length and waiting time. The goal is to provide the fundamental mathematical techniques for analyzing queueing systems.
This document provides an introduction to queueing theory. It discusses key concepts such as random variables, probability distributions, performance measures, Little's law and the PASTA property. It then examines several common queueing models including the M/M/1, M/M/c, M/Er/1, M/G/1 and G/M/1 queues. For each model it derives the equilibrium distribution and discusses measures like mean queue length and waiting time. The goal is to give an overview of basic queueing theory concepts and common single-server and multi-server queues.
Probability mass functions and probability density functionsAnkit Katiyar
This document discusses probability mass functions (pmf) and probability density functions (pdf) for discrete and continuous random variables. A pmf fX(x) gives the probability of a discrete random variable X taking on the value x. A pdf fX(x) defines the probability that a continuous random variable X falls within an interval via its cumulative distribution function FX(x). The pdf must be non-negative and have an area/sum of 1 under the curve/over all x values.
The document outlines a lesson on basic statistical concepts for comparative studies. It covers terminology used in comparative studies including factors, levels, treatments, response variables and experimental units. It discusses topics like randomization to avoid confounding, Simpson's paradox, and the difference between experiments and observational studies. Factorial experiments involving multiple factors are also introduced.
This document discusses inventory management for multiple items and locations. It introduces the concepts of:
1) Setting aggregate inventory policies to meet system-wide objectives when dealing with multiple items and locations.
2) Using exchange curves to analyze the tradeoffs between total inventory levels and other factors like number of replenishments and service levels. These curves allow setting parameters like order costs and carrying costs.
3) Determining optimal reorder quantities, cycle stock, and safety stock levels across an inventory system using techniques like exchange curves. This helps allocate limited inventory budgets across items to maximize performance.
The document summarizes the economic production quantity (EPQ) model and its extensions. It discusses:
1) The EPQ model balances fixed ordering costs and inventory holding costs to determine optimal production/order quantities and intervals.
2) The economic order quantity (EOQ) model is a special case where production rate is infinite and demand is met through ordering.
3) Sensitivity analysis shows how the optimal solutions change with different parameters like production rate and setup costs.
The Kano Model classifies customer needs into three categories - threshold, performance, and excitement - based on their effect on customer satisfaction. Threshold attributes are basic needs whose absence causes dissatisfaction. Performance attributes directly improve satisfaction as implementation increases. Excitement attributes unexpectedly delight customers when implemented. The model is useful for identifying needs, setting requirements, concept development, and analyzing competitors to maximize performance attributes while including excitement attributes.
This document provides an overview of basic probability and statistics concepts. It covers variables, descriptive statistics like mean and standard deviation, frequency distributions through histograms, the normal distribution, linear regression, and includes a practice test in the appendices. Key topics are qualitative and quantitative data, parameters versus statistics, measures of central tendency and dispersion, and generating frequency tables and histograms from data sets.
Conceptual foundations statistics and probabilityAnkit Katiyar
This document provides guidance for a 6th grade statistics and probability unit of study. It outlines key concepts students should understand, including developing questions that anticipate variability, understanding data distributions in terms of center, spread and shape, and summarizing and describing distributions using various graphs such as dot plots, histograms and box plots. Students learn to analyze subgroups within data sets and how to match statistical questions to the appropriate graph. The document emphasizes interpreting and constructing dot plots, histograms and box plots to display and analyze numerical data.
This document provides an overview of basic statistical concepts including populations, samples, parameters, statistics, and sampling methods. It defines key terms like population, sample, parameter, statistic, and discusses sampling methods like simple random sampling and stratified sampling. It also covers sampling variability, estimation, hypothesis testing, prediction, and issues around representative vs non-representative samples.
The document outlines 5 axioms of probability:
1) Probabilities are non-negative
2) Probabilities of mutually exclusive events add
3) The probability of the sample space is 1
It then proves 5 theorems about probability:
1) The probability of an event equals 1 minus the probability of its complement
2) The probability of the impossible event (the empty set) is 0
3) The probability of a subset is less than or equal to the probability of the larger set it is contained within
4) A probability is between 0 and 1
5) The addition law - for two events the probability of their union equals the sum of their probabilities minus the probability of their intersection
Applied statistics and probability for engineers solution montgomery && rungerAnkit Katiyar
This document is the copyright page and preface for the book "Applied Statistics and Probability for Engineers" by Douglas C. Montgomery and George C. Runger. The copyright is held by John Wiley & Sons, Inc. in 2003. This book was edited, designed, and produced by various teams at John Wiley & Sons and printed by Donnelley/Willard. The preface states that the purpose of the included Student Solutions Manual is to provide additional help for students in understanding the problem-solving processes presented in the main text.
A hand kano-model-boston_upa_may-12-2004Ankit Katiyar
This document introduces the Kano Model, a framework used to classify product features based on their impact on customer satisfaction. It explains that some features are "basic" and expected, while others provide linear satisfaction proportional to quality or performance. Some "excitement" features unexpectedly delight customers. The document outlines a process to apply the Kano Model to user experience design including researching customer needs, analyzing data, plotting features on the Kano diagram, and strategizing priorities with clients. It provides an example workshop applying the model to a fictional business and discusses extending the model with personas and use cases.
2. Graphics for a Single Set of Numbers
• The techniques of this lecture apply in the following
situation:
– We will assume that we have a single collection of
numerical values.
– The values in the collection are all observations or
measurements of a common type.
• It is very common in statistics to have a set of values
like this.
• Such a situation often results from taking numerical
measurements on items obtained by random sampling
from a larger population.
•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
3. Example: Yearly Precipitation in New York City
The following table shows the number of inches of (melted)
precipitation, yearly, in New York City, (1869-1957).
43.6 37.8 49.2 40.3 45.5 44.2 38.6 40.6 38.7 46.0
37.1 34.7 35.0 43.0 34.4 49.7 33.5 38.3 41.7 51.0
54.4 43.7 37.6 34.1 46.6 39.3 33.7 40.1 42.4 46.2
36.8 39.4 47.0 50.3 55.5 39.5 35.5 39.4 43.8 39.4
39.9 32.7 46.5 44.2 56.1 38.5 43.1 36.7 39.6 36.9
50.8 53.2 37.8 44.7 40.6 41.7 41.4 47.8 56.1 45.6
40.4 39.0 36.1 43.9 53.5 49.8 33.8 49.8 53.0 48.5
38.6 45.1 39.0 48.5 36.7 45.0 45.0 38.4 40.8 46.9
36.2 36.9 44.4 41.5 45.2 35.6 39.9 36.2 36.5
The annual rainfall in Auckland is 47.17 inches, so this is
quite comparable.
•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
4. Data Input
As always, the first step in examining a data set is to enter the
values into the computer. The R functions scan or read.table
can be used, or the values can be entered directly.
> rain.nyc =
c(43.6, 37.8, 49.2, 40.3, 45.5, 44.2, 38.6, 40.6, 38.7,
46.0, 37.1, 34.7, 35.0, 43.0, 34.4, 49.7, 33.5, 38.3,
41.7, 51.0, 54.4, 43.7, 37.6, 34.1, 46.6, 39.3, 33.7,
40.1, 42.4, 46.2, 36.8, 39.4, 47.0, 50.3, 55.5, 39.5,
35.5, 39.4, 43.8, 39.4, 39.9, 32.7, 46.5, 44.2, 56.1,
38.5, 43.1, 36.7, 39.6, 36.9, 50.8, 53.2, 37.8, 44.7,
40.6, 41.7, 41.4, 47.8, 56.1, 45.6, 40.4, 39.0, 36.1,
43.9, 53.5, 49.8, 33.8, 49.8, 53.0, 48.5, 38.6, 45.1,
39.0, 48.5, 36.7, 45.0, 45.0, 38.4, 40.8, 46.9, 36.2,
36.9, 44.4, 41.5, 45.2, 35.6, 39.9, 36.2, 36.5)
•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
5. Plots for a Collection of Numbers
• Often we have no idea what features a set of numbers
may exhibit.
• Because of this it is useful to begin examining the
values with very general purpose tools.
• In this lecture we’ll examine such general purpose tools.
• If the number of values to be examined is not too large,
stem and leaf plots can be useful.
•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
7. Stem-and-Leaf Plots
> stem(rain.nyc, scale = 0.5)
The decimal point is 1 digit(s) to the right of the |
3 | 344444
3 | 55666667777777888889999999999
4 | 000000011112222334444444
4 | 55555666677778999
5 | 0000113344
5 | 666
The argument scale=.5 is use above above to compress the
scale of the plot. Values of scale greater than 1 can be used
to stretch the scale.
(It only makes sense to use values of scale which are 1, 2 or
5 times a power of 10.
•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
8. Stem-and-Leaf Plots
• Stem and leaf plots are very “busy” plots, but they show
a number of data features.
– The location of the bulk of the data values.
– Whether there are outliers present.
– The presence of clusters in the data.
– Skewness of the distribution of the data .
• It is possible to retain many of these good features in a
less “busy” kind of plot.
•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
9. Histograms
• Histograms provide a way of viewing the general
distribution of a set of values.
• A histogram is constructed as follows:
– The range of the data is partitioned into a number
of non-overlapping “cells”.
– The number of data values falling into each cell is
counted.
– The observations falling into a cell are represented
as a “bar” drawn over the cell.
•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
10. Types of Histogram
Frequency Histograms
The height of the bars in the histogram gives the number of
observations which fall in the cell.
Relative Frequency Histograms
The area of the bars gives the proportion of observations
which fall in the cell.
Warning
Drawing frequency histograms when the cells have different
widths misrepresents the data.
•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
11. Histograms in R
• The R function which draws histograms is called hist.
• The hist function can draw either frequency or relative
frequency histograms and gives full control over cell
choice.
• The simplest use of hist produces a frequency
histogram with a default choice of cells.
• The function chooses approximately log2 n cells which
cover the range of the data and whose end-points fall at
“nice” values.
•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
12. Example: Simple Histograms
Here are several examples of drawing histograms with R.
(1) The simplest possible call.
> hist(rain.nyc,
main = "New York City Precipitation",
xlab = "Precipitation in Inches" )
(2) An explicit setting of the cell breakpoints.
> hist(rain.nyc, breaks = seq(30, 60, by=2),
main = "New York City Precipitation",
xlab = "Precipitation in Inches")
(3) A request for approximately 20 bars.
> hist(rain.nyc, breaks = 20,
main = "New York City Precipitation",
xlab = "Precipitation in Inches" )
•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
13. New York City Precipitation
30
25
20
Frequency
15
10
5
0
30 35 40 45 50 55 60
Precipitation in Inches
•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
14. New York City Precipitation
8
6
Frequency
4
2
0
35 40 45 50 55
Precipitation in Inches
•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
15. Example: Histogram Options
Optional arguments can be used to customise histograms.
> hist(rain.nyc, breaks = seq(30, 60, by=3),
prob = TRUE, las = 1, col = "lightgray",
main = "New York City Precipitation",
xlab = "Precipitation in Inches")
The following options are used here.
1. prob=TRUE makes this a relative frequency histogram.
2. col="gray" colours the bars gray.
3. las=1 rotates the y axis tick labels.
•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
16. New York City Precipitation
0.08
0.06
Density
0.04
0.02
0.00
30 35 40 45 50 55 60
Precipitation in Inches
•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
17. Histograms and Perception
1. Information in histograms is conveyed by the heights of
the bar tops.
2. Because the bars all have a common base, the encoding
is based on “position on a common scale.”
•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
18. Comparison Using Histograms
• Sometimes it is useful to compare the distribution of the
values in two or more sets of observations.
• There are a number of ways in which it is possible to
make such a comparison.
• One common method is to use “back to back”
histograms.
• This is often used to examine the structure of
populations broken down by age and gender.
• These are referred to as “population pyramids.”
•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
19. New Zealand Population (1996 Census)
Male Female
95+
90−94
85−89
80−84
75−79
70−74
65−69
60−64
55−59
50−54
45−49
40−44
35−39
30−34
25−29
20−24
15−19
10−14
5−9
0−4
4 3 2 1 0 0 1 2 3 4
Percent of Population
•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
20. Back to Back Histograms and Perception
• Comparisons within either the “male” or “female” sides
of this graph are made on a “common scale.”
• Comparisons between the male and female sides of the
graph must be made using length, which does not work
as well as position on a common scale.
• A better way of making this comparison is to
superimpose the two histograms.
• Since it is only the bar tops which are important, they
are the only thing which needs to be drawn.
•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
21. New Zealand Population − 1996
4 Male
Female
3
% of population
2
1
0
0 20 40 60 80 100
Age
•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
22. Superposition and Perception
• Superimposing one histogram on another works quite
well.
• The separate histograms provide a good way of
examining the distribution of values in each sample.
• Comparison of two (or more) distributions is easy.
•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
23. The Effect of Cell Choice
• Histograms are very sensitive to the choice of cell
boundaries.
• We can illustrate this by drawing a histogram for the
NYC precipitation with two different choices of cells.
– seq(31, 57, by=2)
– seq(32, 58, by=2)
• These different choices of cell boundaries produce quite
different looking histograms.
•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
26. The Inherent Instability of Histograms
• The shape of a histogram depends on the particular set
of histogram cells chosen to draw it.
• This suggests that there is a fundamental instability at
the heart of its construction.
• To illustrate this we’ll look at a slightly different way of
drawing histograms.
• For an ordinary histogram, the height of each histogram
bar provides a measure of the density of data values
within the bar.
• This notion of data density is very useful and worth
generalising.
•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
27. Single Bar Histograms
• We can use a single histogram cell, centred at a point x
and having width w to estimate the density of data
values near x.
• By moving the cell across the range of the data values
we will get an estimate of the density of the data points
throughout the range of the data.
•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
28. Single Bar Histograms
• The area of the bar gives the proportion of data values
which fall in the cell.
• The height, h(x), of the bar provides a measure of the
density of points near x.
w
h(x) = bar height
at x
x
•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit
30. Stability
• The basic idea of computing and drawing the density of
the data points is a good one.
• It seems, however, that using a sliding histogram cell is
not a good way of producing a density estimate.
• In the next lecture we’ll look at a way of producing a
more stable density estimate.
• This will be our preferred way to look at a the
distribution of a set of data.
•First •Prev •Next •Last •Go Back •Full Screen •Close •Quit