GNGTS 2015
SESSIONE 3.2
CONTRIBUTION OF THE CLUSTER ANALYSIS OF HVSR DATA
FOR NEAR SURFACE GEOLOGICAL RECONSTRUCTION
P. Capizzi1, R. Martorana1, A. D’Alessandro2, D. Luzio1
1
2
Dip. Scienze della Terra e del Mare, Università di Palermo, Italy
Istituto Nazionale di Geofisica e Vulcanologia, Centro Nazionale Terremoti, Italy
Introduction. The use of HVSR technique allows in many cases (Bonnefoy-Claudet et
al., 2006) to obtain detailed reconstruction of the roof of the seismic bedrock (Di Stefano et
al., 2014) and to identify areas with similar seismic behaviour. Theoretical considerations
(Nakamura, 1989) and experimental tests showed that amplification of horizontal motions
between bottom and top of a sedimentary cover is well related to the ratio between the spectra
of the horizontal and vertical components of the ground velocity (Nakamura, 2000). This ratio
is a measure of ellipticity of Rayleigh wave polarization, overlooking Love and body waves
contribution. Assuming that subsoil can be represented as a stack of homogeneous horizontal
layers and imposing some geometric and/or physical constraints it is possible to estimate the
parameters of the shear wave velocity model (Fäh et al., 2003; Parolai et al., 2000).
The integration of data related to HVSR and active techniques based on the analysis of
surface waves can greatly reduce the uncertainties on the interpretation models.
Because the inversion of HVSR curves implies monodimensional distribution of Vs, before
inversing the data we used a cluster analysis technique to subdivide them into subsets attributable
to areas with low horizontal velocity gradients and therefore similar seismic responses. The data
of each cluster were then interpreted by imposing conditions of maximum similarity between
the 1D models relating to each measurement point.
Clustering methods are widely used in different research fields (Hartigan, 1975; Adelfio et
al., 2012; D’Alessandro et al., 2013). In general, the cluster analysis is a good tool whenever
you have to classify a large amount of information into meaningful and manageable groups.
A modified centroid-based algorithm has been applied to HVSR datasets acquired for studies
of seismic microzoning in various urban centers of Sicilian towns (Capizzi et al., 2014). The
results obtained for Modica and Enna towns are shown. HVSR data were previously properly
processed to extract frequency and amplitude of peaks by a code based on clustering of HVSR
curves determined in sliding time windows (D’Alessandro et al., 2014).
The cluster analysis. The cluster analysis is the procedure that allows to identify within a
set of objects some subsets, called clusters, that tend to be homogeneous within them, according
to some criteria. The statistical units are divided into a number of groups according to their level
of similarity (internal cohesion), evaluated from the values that a number of variables chosen
takes in each unit. Generally, in the analysis for grouping is not necessary to have in mind
any interpretative model (Fabbris, 1983). The partition is successful if the objects within the
clusters are closer to each other than other in different clusters (Barbarito, 1999).
Many clustering algorithms exist (Gan et al., 2007; Everit et al., 2011), and can be categorized
into two main types: Hierarchical Clustering (HC) and Non-Hierarchical Clustering (NHC).
The HC have numerous advantages compared to the NHC. The HC are explorative methods
and is not necessary to define a priori the number of clusters. The HC work with a measure
of proximity between the objects to be grouped together. A type of proximity can be chosen
which is suited to the subject studied and the nature of the data. One of the results of HC is the
dendrogram which shows the progressive grouping of the data. It is then easy possible to gain
an idea of a suitable number of classes into which the data can be grouped.
To evaluate the differences between the various clustering techniques, which can also
produce results significantly different from each other, the best way is to assess how the different
techniques reproduce the structure of known data. These assessments are typically performed
on simulated data, and are often difficult to interpret and may be contradictory.
The elements that seem to influence more the results of this analysis are:
50
GNGTS 2015
SESSIONE 3.2
- shape, size (absolute and relative) and number of clusters;
- the presence of outliers;
- the level of overlap between clusters;
- the type of measure of similarity / distance chosen.
Various studies (Rand, 1971; Ohsumi, 1980) suggest that different grouping strategies
often lead to results not dissimilar while others highlight specific cases of strong divergence
(Everitt, 2011; Fabbris, 1983). However, the criteria for choosing between the two types of
algorithm (Hierarchical Clustering and Non-Hierarchical Clustering) have not yet been
sufficiently explored and literature are very different positions. Anyhow, the criteria suggested
by the authors include objectivity, for which researchers working independently on the same
set of data must arrive to same results and stability of operating results of the partition of data
equivalent (Silvestri and Hill, 1964).
In practice, you should choose the methods that are more insensitive to small changes in the
data. For example, it is considered important if, subtracting an individual from the analysis, the
partition little change (of course the elimination of outlier produces greater variations within
groups), or if, by repeating the analysis without an entire branch of the dendrogram, the structure
of the other branches remains unchanged or almost.
Broadly, we can say that, if you seek groups of statistical units, characterized by high internal
consistency, hierarchical techniques are less effective than not hierarchical ones.
Cluster analysis of HVSR data. Many HVSR data sets were acquired for studies of seismic
microzoning in various Sicilian urban centers. After many tests to assess the best clustering
techniques for our dataset and purposes, we have chosen to apply an AHC (Agglomerative
Hierarchical Clustering) algorithm to extract frequency and amplitude of HVSR curves
determined in sliding time windows (D’Alessandro et al. 2014) and a HC algorithm to group
peaks attributable to the same seismic surface.
The choice is motivated by the fact that the HC are explorative methods, which do not need
to define a priori the number of clusters, and allows to use any proximity measure considered
suitable for the data. In HC the process of agglomeration or separation is done on the basis of
a measure of proximity and of linkage criteria. The proximity between two objects is measured
by measuring at what point they are similar (similarity) or dissimilar (dissimilarity). Several
measure of proximity was proposed in literature to measure the similarity/dissimilarity between
different types of object (Gan et al., 2007; Everit, et al., 2011). Clearly, the choice of the type of
measure of proximity must respect specific criteria and would be done on the basis of the main
aims of the clusterization (Gan et al., 2007; Everit, et al., 2011).
Windows selection for best average HVSR curve estimation, are generally done by visual
inspection of the HVSR curves as function of time. Starting from the full-length records, the
HVSR curve are determine in consecutive time windows of appropriate lengths. Time windows
that at a simple visual inspection showing HVSR curve considered “anomalous” are generally
deleted and therefore not included in the calculation of the average HVSR curve. Often it is very
difficult to identify the correct time window to be used for the calculation of the mean HVSR.
The lack of a not arbitrary selection criteria making the result clearly operator dependent and
therefore not optimal.
To overcome the this problem we applied the AHC to our data, using as proximity measure
the Standard Correlation (SCxy) defined as:
where xi and yi indicate the values of the spectral ratios relative, to the i-th frequency and
the generic pair of analysis windows. The main return of the hierarchical clustering is the
dendrogram, which shows the progressive grouping of the data (Fig. 1). The selection of clusters
is done cutting the dendrogram at specific level of similarity/dissimilarity. In this application
51
GNGTS 2015
SESSIONE 3.2
Fig. 1 – Dendrogram relative to HVSR curves determined in sliding time windows.
the cutting level was chosen on the basis of some criteria that involve the maximum slope
detectable of the level bar chart and the width of the gap identifiable between two successive
levels of the hierarchy detectable in the dendrogram (Gan et al., 2007; Everitt, et al., 2011).
After defining the average HVSR curves a second multi-parametric clustering procedure has
been used to group peaks attributable to the same origin (stratigraphic, tectonic, topographic,
anthropogenic or other sources). A nonhierarchical centroid-based algorithm has been
implemented (Capizzi et al., 2014). This clustering is carried out in order to delineate the areas
inside of which it is possible to assume a continuous trend of the parameters used to describe
the subsoil and of the seismic response of the medium. Hypotheses on the cause of the HVSR
peaks are basic to extract from such kind of data reliable information on the subsurface (Capizzi
et al., 2014; Di Stefano et al., 2014; Martorana et al., 2014).
In centroid-based methods, clusters are represented by a central vector, which may not
necessarily be a member of the data set. When the number of clusters is fixed, the clustering
can be formally regarded as an optimization problem: find the cluster centers and assign each
object to the cluster, such that the parameter distances from the cluster centroid are minimized,
and calculate the new means to be the centroids of the observations in the new clusters. The
algorithm converges to a (local) optimum when the assignments no longer change. In this
procedure there is no guarantee that the global optimum is found using this algorithm.
The number of clusters, the presence of outliers and the type parameters used for distance
measures mostly affect the results of cluster analysis.
Centroid-based algorithms generally require the number k of clusters and the initial centroid
coordinates to be specified in advance. This aspect is considered one of the biggest drawbacks
of these algorithms because an inappropriate choice of k may yield poor results. Really, is hard
to choose the k parameter when missing external constraints.
The proposed algorithm does not fix the number of clusters and choose automatically, for
each possible value of k the initial centroids from data set. The distance of each unit from the
initial centroids and those obtained after each iteration was calculated as the weighted sum of
the Euclidean normalized distances of all the variables considered: coordinates (x, y and z),
52
SESSIONE 3.2
GNGTS 2015
frequency (f), amplitude (A) and lithology (L):
,
where a, b, c and d are the weights. The choice of weights and of the optimal number of k classes
have been optimized maximizing R parameter, taking into account the intra-cluster (DEVIN) and
inter-cluster (DEVOUT) variances:
.
However the use of a priori information especially on stratigraphic data is fundamental
for
.
the choice of number of partitions. Generally, experimental tests showed that different weights
should be used to identify inclined or sub-horizontal seismic discontinuities.
Application cases. During 2012, as part of an agreement with the Italian Department of
Civil Protection, the expedite seismic microzonation has been performed in 20 municipalities
of the Eastern Sicily considered at high seismic hazard. In addition to the collection of all
previous geological and geophysical data, we have performed a passive seismic campaign to
determine the resonance frequency of the investigated sites by means of the Nakamura technique
(Nakamura, 1989). The HVSR cluster analyses were applied to HVSR data acquired in Modica
and Enna towns. Multichannel Analysis of Surface Waves (MASW) were acquired in Enna
town to constrain the interpretative models and the velocity of shear waves in the subsurface.
HVSR data were inverted using similar starting models for each cluster. 1D seismic models
were calculated using the code of Lunedei and Albarello (2009) based on the assumption that
environmental noise is composed by the superimposition of random multi modal plane waves
moving in all the directions at the surface of the Earth and propagating as Rayleigh and Love
waves. Since body waves are not considered, this assumption is realistic only if sources are
located far enough from the receiver. Consequently, all the time windows of the signal showing
noise suspected to be caused by near sources must be removed.
The HVSR clusters of peaks were considered to define the seismic layers, each characterized
by a specific range of seismic velocities, and to associate them with the known geological
formations. Inversion models of the different partitions obtained using the centroid-based
algorithm were superimposed on the geological map of the analysed sites to identify possible
correlations with geology and topography.
In the case of Modica town the best cluster results seems to be a three partition, whereas in
Enna site the cluster analysis converges to five groups (Fig. 2). In both case the map of the depth
of the seismic bedrock (Fig. 3) and 3D model of seismic surfaces were reconstructed.
Discussion. One of the typical criticisms to the cluster analysis is to arrive at indeterminate
solutions, subject to arbitrary decisions relating to initial information, subjective interpretation
of the results, and not statistically verifiable. Contrary to other statistical procedures, cluster
analysis is often used when you do not have a priori hypotheses or when you are in the
exploratory phase of analysis. However the application of cluster analysis, although falling
between the methods of analysis essentially exploratory, should be preceded and accompanied
by the definition of interpretative models.
The time windows suitable for the determination of the mean HVSR are generally
arbitrarily identified by the operator by a simple visual inspection of the microtremors signals
in time or spectral domain. This can lead to an incorrect determination of the mean HVSR
curves and to an incorrect interpretation of the main peaks. An automatic procedure, based
on clustering analysis, for the determination of the appropriate windows to be used in the
average HVSR curve determination has been implemented. This procedure allowed us to
easy separate the HVSR curves and peak mainly linked to the site effects from those mainly
related to the source effects. The analysis of the HVSR curves as a function of the azimuth
result a useful tool for the characterization and discrimination of the major peaks identified
on the HVSR curves.
53
GNGTS 2015
Fig. 2 – Cluster results partition for Enna town. Clusters are reported using different colored circles.
54
SESSIONE 3.2
GNGTS 2015
SESSIONE 3.2
Fig. 3 – Depth of seismic bedrock (left) and seismic sections (right; yellow: seismic soil, green: bedrock).
Furthermore the use of techniques of cluster analysis in the noise processing has enabled to
recognize groups of similar HVSR curves and to relate them to similar interpretative seismic
models. In many cases, the HVSR cluster analysis applied for microzonation studies of some
towns in Sicily, showed good results, allowing to group peaks attributable to the same seismic
structures. The comparison of the HVSR pattern with the information about outcropping
formations allowed to assess the geological hypotheses on the heavily urbanized investigated
areas.
However obtained results underline how the most appropriate clustering algorithm for a
particular problem often needs to be chosen empirically and how the choice of the partition is
strongly linked to the choice of weights for the calculation of the distance and to the geological
and stratigraphic knowledge of the area.
Finally HVSR inversion demonstrates a valuable method, when used in accordance with
certain criteria in the acquisition and processing, not only for the estimate of the site amplification
effects, but also for the assessment of the main geological and tectonic structures that define the
seismic bedrock and coverage deposits.
References
Adelfio G., Chiodi M., D’Alessandro A., Luzio D., D’Anna and G., Mangano G.; 2012: Simultaneous seismic wave
clustering and registration. Computer and Geoscience, 44, 60-69, doi: 10.1016/j.cageo.2012.02.017.
Barbarito, L.; 1999. L’analisi di settore: metodologia e applicazioni, Milano, Franco Angeli.
Bonnefoy-Claudet S., Cotton F. and Bard, P.-Y.; 2006: The nature of noise wavefield and its applications for site effects
studies. A literature review. Earth-Science Reviews, 79, 3-4, 205-227. doi: 10.1016/j.earscirev.2006.07.004.
Capizzi P., Martorana R., Stassi G., D’alessandro A., Luzio D.; 2014: Centroid-based cluster analysis of HVSR
data for seismic microzonation. Near Surface Geoscience 2014 - 20th European Meeting of Environmental and
Engineering Geophysics, Athens, 14-18 September 2014, We Verg 02, doi: 10.3997/2214-4609.20142095.
55
GNGTS 2015
SESSIONE 3.2
Di Stefano P., Luzio D., Renda P., Martorana R., Capizzi P., D’Alessandro A., Messina N., Napoli G., Todaro S., and
Zarcone G.; 2014: Integration of HVSR measures and stratigraphic constraints for seismic microzonation studies:
the case of Oliveri (ME). Natural Hazards and Earth System Sciences Discussion, 2, 2597-2637, doi:10.5194/
nhessd-2-2597-2014.
D’Alessandro A., Luzio D., Martorana R. and Capizzi P.; 2014: Improvement of the Horizontal to Vertical Noise
Spectral Ratio technique by cluster analysis. submitted to Computer and Geosciences.
D’Alessandro A., Mangano G., D’Anna G. and Luzio D.; 2013: ��aveforms clustering and single-station location of
microearthquake multiplets recorded in the northern Sicilian offshore region. Geophysical �ournal International,
vol. 194, n. 3, pp. 1789-1809, DOI: 10.1093/gji/ggt192.
Everitt B.S., Landau S., Leese M., Stahl D.; 2011: Cluster Analysis, Wiley Series in Probability and Statistics, 5th
Edition, pp. 332, ISBN-10: 0470749911, ISBN-13: 978-0470749913.
Fabbris L.; 1983. Analisi esplorativa di dati multidimensionali, Cleup editore.
Fäh D., Kind F. and Giardini D.; 2003: Inversion of local S wave velocity structures from average H/V ratios, and
their use for the estimation of site-effects. �ournal of Seismology 7, 449–467. doi: 10.1023/B:�OSE.000000571
2.86058.42.
Gan G., Ma C., Wu J.; 2007: Data Clustering: Theory, Algorithms, and Applications, pp. 184, Cambridge University
Press, ISBN: 9780898716238.
Hartigan �.A.; 1975: Clustering algorithms. New York, Wiley.
Lunedei E. and Albarello D.; 2009: On the seismic noise wavefield in a weakly dissipative layered Earth. Geophysical
�ournal International 177, 1001–1014.
–1014.
1014. doi: 10.1111/j.1365-246X.2008.04062.x
Martorana R., Capizzi P., Avellone G., Siragusa R., D’alessandro A., Luzio D.; 2014: Seismic characterization by
inversion of HVSR data to improve geological modeling. Near Surface Geoscience 2014 - 20th European Meeting
of Environmental and Engineering Geophysics, Athens, 14-18 September 2014, We Verg 01, doi: 10.3997/22144609.20142094.
Nakamura Y.; 2000: Clear Identification of Fundamental Idea of Nakamura’s Technique and its Applications. 12th
World Conference on Earthquake Engineering, 2656.
Parolai S., Bindi D., Augliera P.; 2000: Application of the Generalized Inversion Technique (GIT) to a microzonation
study: numerical simulations and comparison with different site-estimation techniques. B. Seismol. Soc. Am.,
90, 286–297.
Oshumi N.; 1980: Evaluation procedure of agglomerative hierarchical clustering methods by fuzzy relations. Data
Analysis and Informatics, Diday et al. (a cura di), North Holland.
Rand W.M.; 1971: Objective criteria for the evaluation of clustering methods. J.A.S.A.
Silvestri L., Hill I.R.; 1964: Some problems of the taxometric approach in: Phenetic and Phylogenetic Classification,
Heywood, V.H. Mc Neil, �. (a cura di), Systematic Association Londra.
Wathelet M., Jongmans D and Ohrnberger M.; 2004: Surface-wave inversion using a direct search algorithm and its
application to ambient vibration measurements. Near Surface Geophysics, 2, 10 211–221. doi: 10.3997/18730604.2004018.
5