Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
https://doi.org/10.3847/1538-4357/ab5a7a The Astrophysical Journal, 889:153 (9pp), 2020 February 1 © 2020. The American Astronomical Society. Objectively Determining States of the Solar Wind Using Machine Learning 1 D. Aaron Roberts1 , Homa Karimabadi2, Tamara Sipes3, Yuan-Kuen Ko4 , and Susan Lepri5 Heliophysics Division, NASA Goddard Space Flight Center, Code 672, Greenbelt, MD 20771, USA; aaron.roberts@nasa.gov 2 Analytics Ventures, 6450 Lusk Blvd, Suite E208, San Diego, CA 92121, USA 3 Optum, 6195 Lusk Blvd, Suite 120, San Diego, CA 92121, USA 4 Space Science Division, Naval Research Laboratory, Washington, DC 20375, USA 5 Climate and Space Sciences and Engineering, University of Michigan, Ann Arbor, MI 48109, USA Received 2019 August 20; revised 2019 November 20; accepted 2019 November 20; published 2020 February 3 Abstract Conclusively determining the states of the solar wind will aid in tracing the origins of those states to the Sun, and in the process help to find the wind’s origin and acceleration mechanism(s). Prior studies have characterized the various states of the wind, making lists that are only partially based on objective criteria; different approaches obtain substantially different results. To uncover the unbiased states of the solar wind, we use “k-means clustering”—an unsupervised machine learning method—including constructed multipoint variables. The method allows exploration of different descriptive state variables and numbers of fundamental states (clusters). We show that the clusters reveal structures similar to those found by more ad hoc means, including coronal hole wind, interplanetary coronal mass ejections, “slow wind” (better: noncoronal hole flow), “pseudostreamers,” and stream interaction regions, but with differences that should be useful in refining our previous ideas. These results demonstrate the viability of the approach and warrant further study to understand the origin of remaining discrepancies. Complexity in k-means characterization of the wind may ultimately point to complexity at the source; studies closer to the Sun with Parker Solar Probe will help. We confirm the utility of a set of variables that can serve as a proxy for composition measurements. This proxy permits studies at high time resolution and where composition is not available. This and our recently developed unsupervised multivariate clustering technique are expected to be beneficial in the automated identification of structures and events in a variety of studies. Unified Astronomy Thesaurus concepts: Interplanetary physics (827); Solar wind (1534); Interplanetary turbulence (830); Interplanetary magnetic fields (824); Solar coronal mass ejections (310) 1. Introduction arcades known as filaments or prominences, depending on whether they are viewed against the solar surface or off to the side (on the limb), sometimes give rise to transient coronal mass ejections (CMEs) that often become interplanetary flux ropes (a type of interplanetary CME, or ICME) that are observed in situ by spacecraft. It has proven difficult to definitively characterize the various types of wind from in situ measurements, such that various schemes for classification (e.g., Jian et al. 2006; Zhao et al. 2009; Richardson & Cane 2010; Xu & Borovsky 2015) can disagree a large fraction of the time (Neugebauer et al. 2016). There may well be usefully distinct types of ICMEs or even fast coronal hole flows. All current schemes for classifying solar wind states ultimately depend on (1) a predetermined set of categories, and (2) an often-subjective evaluation of the presence or absence of those categories based on characteristics such as flow speed, the presence of smooth magnetic rotations, and the properties of charge states of particular ions. There are a number of clear-cut cases, such as the large magnetic field rotation associated with “magnetic-cloud” ICMEs or the seemingly unperturbed flow in the middle of a high-speed stream. Such relatively obvious cases can be the origin of analytical expressions (e.g., Zhao et al. 2009; Xu & Borovsky 2015) or can provide training sets for “supervised” methods (see, in this context, Camporeale et al. 2017) that can then be used to categorize the rest of the events. These approaches can be helpful for performing statistical studies, but the fact that different investigators disagree on events indicates either that we have not determined a fully convincing method The ubiquitous, continuous flow of fully ionized particles from the Sun—the solar wind—is much more variable than the Earth’s atmosphere. As with our local atmosphere, we would like to have clear and useful characterizations of the variability, things equivalent to “hot, dry, and sunny” versus “cool and foggy,” versus “category 4 hurricane.” For the solar case, such characterizations could be useful in determining the origins and effects of different flows (see, e.g., Zhao et al. 2017, for an approach starting with solar states). Time series of even a simple set of, say, speed, density, and magnetic field variables from the solar wind invite the eye to informally categorize different “states,” most obviously fast and slow, but with further examination including various transient events, “sector boundaries,” and more subtle regions such as shocked plasma or places where electron flows are anomalous. (A detailed overview of studies of solar wind states is provided by Borovsky et al. 2019; below we discuss the most relevant studies for the current work.) In a general sense, many of these states are understood: magnetically open, X-ray-weak areas of the solar corona (“holes”) are the clear origin of most very fast wind, and slow wind likely comes from near the edges of such holes and from the “streamer stalks” so prominent in deep-exposure eclipse photographs or coronal images from spacecraft. Magnetic Original content from this work may be used under the terms of the Creative Commons Attribution 3.0 licence. Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI. 1 The Astrophysical Journal, 889:153 (9pp), 2020 February 1 Roberts et al. of characterization or perhaps that there are no truly reliable general categories to be found. For a general discussion of the larger context and use of “machine learning” methods for the analysis of space physics data, see McGranaghan et al. (2017). norm and other nonstandard features (e.g., the mean of the cluster values does not end up being exactly the reported centroid), we wrote our own version of the algorithm (still in IDL), exactly implementing the process described above. We found that some initial conditions selected essentially isolated points that never acquired many other points, so we rejected initial centroid positions in which one of the initial clusters had fewer than ten points. It may be that some of the nearly isolated points are interesting outliers, but we leave this issue to future work. If the natural states (clusters) occur in well-separated convex regions, this method works quickly and well. Possible pitfalls are finding a local minimum in the space but missing the global minimum; not correctly characterizing the shapes of the clusters; and having the “wrong” value of k. Multiple runs with different ks and initial centroids can help to sort out the first and third of these problems, and visual examination of 2D projections of the distribution can give some reassurance about the second. Here we found that 30 trials were enough to arrive at a very slowly changing J, and results presented here are based on this; there is never a complete convergence in complex data such as these, but we plan to study more carefully how the results evolve with decreasing J. Changing k values will be dealt with below. While there are a variety of clustering techniques (Pedregosa et al. 2011),8 k-means is a widely used technique due to its simplicity and general utility. This technique is not sensitive to any temporal correlations of points in the time series data. Note that many of the recent advances in artificial intelligence, such as convolutional neural nets, have centered around “supervised” methods that require large numbers of labeled data sets. Such techniques are clearly not directly applicable to the solar wind categorization problem, where the objective is to find the proper “labels” for the different states of the solar wind. However, a recently developed unsupervised technique leverages advances in artificial intelligence and takes into account any underlying temporal correlations in time series data (Madiraju et al. 2018). We will explore application of this technique to the solar wind classification problem in a separate publication. 2. Method and Data 2.1. The k-means Algorithm Much of the previous work in this area has relied on finding clusters in scatter plots of various variables. This can be done visually in two dimensions, but this method rapidly fails when going to higher dimensions. Moreover, when a random set of states of the wind is viewed in such (low-dimensional) plots, it never separates out cleanly into different clumps. Even the “ideal” cases are not completely disentangled. This suggests the possibility that what is needed is a higher dimensional approach, such that additional dimensions sort out the confusion. It turns out that the problem of finding clusters in high dimensions was largely solved many years ago, and there are now a number of clustering schemes that use different criteria for grouping. One of the most intuitive and easy to calculate is the “k-means” algorithm, developed over 50 years ago and now implemented in various software packages. (The term and a standard exposition are given by MacQueen (1967), but the history goes back earlier6.) In the implementation used here, the method starts by assuming that there are k clusters, each centered on a point in an N-dimensional state space. The centroids are initially chosen randomly, and the distances from the k centroids to all points are determined. The points closest to each centroid are then used to calculate new centroids, and this process is iterated until very little change occurs. The result is k clumps that are taken to characterize interesting “natural” states of the system described by the variables. In practice, ten or so iterations are used to obtain adequate convergence. We have chosen the k-means method over others as an initial attempt to do meaningful clustering in a way that is very efficient, allowing many trials with different variables and parameters, but that also captures very directly a physicist’s notion of “natural states.” We have tried some less efficient variants of the basic procedure, such as “Gaussian mixture models”7 with essentially similar results. The k-means procedure is equivalent to minimizing the total variance of the states from the centroids. Mathematically, given M states xi in an N-dimensional state space, the “objective function” to be minimized is J= M 2.2. Data Set Used The data examined here are essentially the same as those used in the study of Neugebauer et al. (2016), namely magnetic field (from the MAG instrument), plasma parameters (density, velocity, alpha-to-proton ratio; SWEPAM), and ion composition (SWICS), all from the Advanced Composition Explorer (ACE) spacecraft upstream of the Earth at the L1 point. Hourly binned quantities were determined for each instrument so that they could be compared directly. The time period analyzed is 2002 November through 2004 May (a total of 20,448 points), when there was sufficient solar activity to produce many ICMEs. All quantities, z, were normalized to the range from 0 to 1 by taking zn=(z − min(z))/(max(z)−min(z)). This is necessary to make density (say) with a range below 20 cm−3 contribute as much to distances in the state space as, for example, the solar wind speeds of 250–1000 km s−1. Future work will examine the effects of weighting the variables differently using multipliers, but here all weights of the k åå wij∣∣xi - mj∣∣2 (1 ) i=1 j=1 where μj is the N-vector centroid of the jth cluster, wij equals one if point i is a member of cluster j and zero otherwise, and the norm is taken here to be the standard Euclidean distance in the N-dimensional state space. Other norms are possible, and in fact the IDL-provided code for k-means uses the “Manhattan” norm of the sum of absolute values of component differences instead of the sum of squares. The alternating steps of the minimization choose wij by determining the points nearest to the current centroids, and then finding a new set of values for μj by averaging over the new wij sets. Given IDL’s nonstandard 6 7 8 See scikit-learnhttps://scikit-learn.org/stable/modules/clustering. html#clustering. Seehttps://en.wikipedia.org/wiki/k-means_clustering. Seehttps://scikit-learn.org/stable/modules/mixture.html#mixture. 2 The Astrophysical Journal, 889:153 (9pp), 2020 February 1 Roberts et al. normalized variables are taken to be equal. In some cases, explicitly indicated, the log of the variable is used to spread the distribution of a variable such as density that tends to be concentrated at small values. Often a normalization of zero mean and unit standard deviation is used for the variables, but this was not found to give as clear results in this case, perhaps due to the highly non-Gaussian distributions of the variables. where vp is the wind (proton) speed, and anything else is noncoronal hole flow. A primary issue addressed by this set of criteria is that it is not a priori obvious where the boundary is between “fast” wind and “slow” wind; in fact, at times “slow” wind has most of the usual characteristics of “fast” wind, so this division does not seem accurate (see, e.g., Marsch et al. 1981; Roberts et al. 1987). The use of the charge states is based on the idea that these states tell us what the temperature of the solar corona was where the ratios of charged states were formed, in the collisional region close to the Sun, and thus these states provide a physically motivated choice for wind types. The ICME identifications come from a careful study of clear cases based on other criteria (see Zhao et al. 2009). 2.3. “Nonlocal” Variables Given the normalized variables, any set or combination of them can be used to represent the solar wind state. In addition to a number of direct measurements, we use the time correlations known as the “cross-helicity” sc = 2 ádb · dvñ á(d b )2 ñ + á(d v )2 ñ (2 ) 3. Results and the “residual energy” sr = 3.1. Clusters Using Charge State Information á(d v )2 ñ - á(d b )2 ñ á(d b )2 ñ + á(d v )2 ñ We apply the k-means method to the above ACE data set, initially choosing eight variables based on the relevance of these variables in previous studies. To capture ICMEs, we include charge state information. In the first example of the kmeans process shown here, eight variables describe the state (proton density, np; proton speed, Vp; σc, σr, o7to6, the average ionization of iron, the ratio of charged iron to charged oxygen density (fetoo), and magnetic field strength B). We show k=8 here because it provides many connections to previously studied states; there is nothing special about the equality of k and the number of variables. Higher values of k can divide clusters and yield, for example, two sets of ICMEs, and lower k values lump clusters together. In the k-means context, the choice of variables and number of clusters cannot be automated, and it is the primary subjective aspect of this method. However, this subjectivity is quite different from that involved in the point-by-point determination of wind states. The results of the present method for the case of eight variables and eight states are shown in Figures 1 and 2, which are based on an illustrative subset (2003 May–August) of the entire time range. The top panel in Figure 1 shows the proton speed, Vp, at the top, normalized x (black) and y (red) components of the magnetic fields in Geocentric Solar Ecliptic coordinates at the bottom, and traces in the middle that show high values for the presence of ICMEs as determined by Jian et al. (2006) (blue, and lower max) and Richardson & Cane (2010) (black). (An immediate note is the substantial differences between the two sets of ICMEs, indicating the need for a more objective classification of these states; see Neugebauer et al. 2016.) The colors of the Vp trace show states as identified by k-means clustering. The bottom panel in Figure 1 is a plot of o7to6 colored by the same states as above. The analytic classification is shown by a dashed horizontal line at o7to6=0.145, which is the maximum for coronal hole wind, and by a brown dotted trace that provides the lower limit for ICME identification. Noncoronal hole wind is between the dashed and dotted lines. The dotted line is not shown when it goes below 0.145. As is standard with clustering methods, the clusters do not tell us what they are in physical terms. Here we use prior expert identifications of physical regions to make associations. One qualitative test of the validity of the results is the ease with which this can be done. To begin, we immediately see the recurrence of the 27 day solar rotation in the plot of proton (3 ) where the brackets áñ represent averages, here taken over three data points, although the main results are essentially the same with, e.g., seven-point averages. The δs indicate that a running mean has been subtracted out over the averaging interval. Here,v is the wind velocity andb is the magnetic field in “Alfvén speed units, ” which allowsb andv to be compared directly: b= 21.8B / nT n / cm-3 (4 ) for a magnetic fieldB and ion density n. These two quantities have proven quite useful in the study of solar wind turbulence (e.g., Bavassano et al. 1998; Wicks et al. 2013), and here we will find that they are useful for categorizing wind states. Note that this implies that the “state” of the system depends not just on values of quantities at a given time, but also on the variables nearby in time (shear, currents, etc.). The cross-helicity is a measure of how purely the mode of the wind fluctuations is that of an Alfvén wave (σc = ±1). Such waves strongly tend to propagate outward from the Sun, and the sign is determined by the direction of the mean magnetic field. Thus, σc provides an indication of whether the flow is from the southern or northern hemisphere of the Sun. 2.4. Prior ICME Identifications The validity of the results cannot be determined by simple means, since we are not assuming any a priori categories. As guides to significance, we use two sets of ICME identifications (Jian et al. 2006; Richardson & Cane 2010) and the analytical formulae of Zhao et al. (2009). The latter workers identify three states: coronal hole flows (nominally “fast wind”), noncoronal hole flows (nominally “slow wind”), and transients (ICMEs). They use the value of the ratio of the number densities of two charge states of oxygen, O+7 to O+6 (“o7to6”). Coronal hole flows have o7to6 < 0.145. The ICMEs have o7to6 > 6.008 exp ( - 0.00578vp) (5) 3 The Astrophysical Journal, 889:153 (9pp), 2020 February 1 Roberts et al. Figure 1. A k=8 set of clusters for four months of 2003 seen as colors superimposed on the plasma speed. Top panel: speed colored by cluster (see Table 1), along with expertly determined ICMEs as deviations from a constant (Richardson: taller black, Jian: shorter blue), and with Bx and By as black and red traces below to indicate the magnetic polarity of the solar wind. Lower panel: plot of o7to6 ratio along with a horizontal line showing an empirical upper limit on coronal hole flows and brown dots from Equation (5) showing a lower limit for ICME identification (Zhao et al. 2009). they are otherwise found to be very similar. Nearly all the points identified by the analytic method as coronal hole flow are either black or red, and almost none of the latter points appear above the 0.145 line. ICMEs are found to agree significantly with Jian, Richardson, and SWICS lists and criteria, with some interesting differences. The k-means method does not tell us which states are ICMEs, but looking at the first two sets of independently identified ICMEs (nearly all in May) shows that the green state is the natural choice. The first case shows the green agreeing best with the narrower Jian identification, but with a green point slightly later that is still within the Richardson identification. The Jian portion of the interval clearly meets the analytic criterion, but the rest of the interval is marginal, with values nearly equal to the selection criterion. This case is typical of ICME identifications by the various means used here. Detailed investigation of many cases will be needed to see whether k-means can be regarded as more reliable than other methods, but it does provide a rapid, unbiased view. Some cases where only one of the two expert sets shows a “hit” do not show up in the k-means test (see early July, where the formula marginally predicts an ICME, but k-means does not). There are some identifications that agree with the formula, but not with the lists (around July 10 and 15, for example). One case identifies an ICME in both lists, but in neither k-means nor the formula (the first of the pair in mid-June; the region looks to be complex). Consistent with other work in expert identification, the “green state” of the solar wind has a large average Fe charge state as well as a large value of the field magnitude, B0. The low value of σc is consistent with having typical ICMEs connected at both ends to the Sun, and thus having waves propagating in both directions along the field lines. What appear to be interaction regions between fast and slow flow (orange) are detected mainly via compression (high density values); they also have higher relative velocity fluctuations as seen in Figure 2 (high σr). Of particular interest is the identification of slow flow regions that are not associated with changes in the magnetic polarity of the flow. These are now generally accepted to be “pseudostreamers” that come from streamer-like flows at the Sun but with bipolar regions below the open field lines, leading to the unipolar flow. (See Figure 2. The clusters from Figure 1 seen in the σc–σr projection of the state space. speed, Vp, at the top of Figure 1 of particular colors (states), especially black and red. The sector boundary structure, determined by whether the magnetic field is coming from the southern or northern pole of the Sun, is also apparent in, for example, the alternation of black and red as the spacecraft crosses the sector boundary as indicated by the flip in sign of the field in the black and red traces below the main plot in Figure 1 (top). Note that the red sector has systematically slow flow speeds, speeds that would often be termed “slow wind,” but both k-means and the analytic formulae agree (the latter having values in the lower panel of Figure 1 that are below the dotted line threshold of 0.145) that these winds are in the same class as the typical fast winds. Figure 2 shows that the black and red states differ in the sign of σc, meaning that both contain outward traveling Alfvén waves but in opposite sectors. The input set of variables included wind speed, but this variable did not dominate the choice of category. When the average features of the red and black labeled flows are examined (see Table 1), 4 The Astrophysical Journal, 889:153 (9pp), 2020 February 1 Roberts et al. Table 1 Mean Values (Centroids) of the Eight Quantities Used in the Analysis (Columns) for Each of Eight Types of Wind (Rows) Color Red (CH+) Blue (NCH−) Green (ICME) Purple (NCH+) Cyan (NCH−) Magenta (PS) Black (CH−) Orange (IR) Vp np σc o7to6 σr fetoo aveqfe b0 Nk/N 504.1889 406.2666 464.4140 394.6382 475.5256 477.4309 682.3386 451.6607 4.8210 7.5016 8.2996 6.6545 5.4553 5.4904 3.3511 11.3938 0.8140 −0.4670 0.0950 0.5520 −0.6446 −0.8311 −0.8070 −0.1592 0.1684 0.2762 0.5480 0.2840 0.4547 0.1415 0.0665 0.2711 −0.2705 −0.6343 −0.6402 −0.5866 −0.4425 −0.2954 −0.2579 0.0484 0.1576 0.1881 0.2215 0.2010 0.1667 0.1507 0.1052 0.1829 10.1349 10.0653 11.5302 9.9442 11.9122 9.9792 10.4443 10.3147 7.5191 6.8543 10.4567 6.8353 6.8356 7.1535 6.9685 9.9361 0.246 0.113 0.095 0.154 0.043 0.150 0.146 0.052 Note. The categories are CH, coronal hole; NCH, noncoronal hole; ICME; PS, pseudostreamer; and IR, interaction region; with + and − indicating the magnetic sector. Nk/N is the fraction of the number of points in the cluster. Xu & Borovsky 2015 and references therein.) Here these regions appear to be the magenta states. They show slow speed but high cross-helicity. This is consistent with a careful examination of the results of Ko et al. (2018), where row 3, column 1 of their Figure 5 has systematically higher green points (pseudostreamers) than black points (current sheet crossings). A summary of the expert identifications of regions by cluster is given in the first column of Table 1. The values of the centroids of the clusters shown there are consistent with the verbal descriptions above, e.g., ICMEs have low cross-helicity, high mean field (b0), and high values of average iron charge state (aveqfe), as expected. Coronal holes are characterized by generally higher speeds (vp) and cross-helicity. Pseudostreamers are characterized by high cross-helicity but low speed, and interaction regions have the expected high density resulting from plasma compression. All these characteristics are consistent with the discussion above. typically fast with low o7to6 and high cross-helicity. There are some remarkable agreements with the case above. For example, the first ICME, in May, is split as above, and the possible ICME in early July is not found here or above. Interestingly, the weakly identified ICME in mid-August (the first of a possible pair) is more strongly represented here. The two cases in July that are not found manually are found both here and above. The green labeled state has an enhanced average charge state of iron and o7to6 ratio (as also above), as well as large magnetic field magnitude. Thus the new set of variables captures charge state information without the need for a detailed composition instrument, which is seldom included on spacecraft. Further evidence of the efficacy of k-means is shown in Figure 4, which should be compared to Figure 3 of Xu & Borovsky (2015), which shows a similar structure although the present case is not based on pre-chosen events. As an example of the importance of the choice of state variables, Figure 5 shows what happens if the temperature variable is omitted from the set of three. In this case, the plot of Alfvén speed versus entropy gives a full characterization of the resulting clusters, and all the algorithm can do is make four more-or-less equal groups. This constitutes a test that the algorithm is working, but it does not give a useful clustering. 3.2. Clusters without Charge State Information To see whether it is possible to identify the similar sets of solar wind regions without the use of composition variables, we use a set of variables developed by Xu & Borovsky (2015) in their analytic formulation of the state determinations. The use of Texp = 1.2 ´ 10 4 (Vp /235.0)3 (the “expected” temperature for a given speed of wind, Vp) gives one variable as the ratio of this quantity to the measured proton temperature: Tratio=Texp/Tp. A second variable is the Alfvén speed Va = ∣b∣ (in “Alfvén speed units” as in Equation (4)). The third variable is the entropy S=Tp/n2/3. It is necessary to take the log of all values, as done by Xu and Borovosky, to spread out the somewhat clumped distribution of low values of the quantities. This kind of scaling is common in k-means analysis. It represents a strength and a weakness of the method in that it gives great flexibility in the description, but puts an increased burden on the verification process. In our first use of these variables, we choose four clusters to see how results compare to the study of Xu and Borovsky. Figure 3, which is in the same format as Figure 1, shows that the variable set is a good proxy (as expected) for the composition-based variable set. The four colors nicely correspond to the Xu–Borovsky states: purple is the coronal hole wind, red the current sheet wind, blue the streamer belt (and related pseudostreamer wind), and green represents the transients (ICMEs). These identifications are further confirmed by the values in Table 2, in which, for example, the ICME row shows low cross-helicity, high iron charge states, and high mean field (B0), and coronal hole flow is 3.3. Further Comparison of the Two Cases Above To make a more direct comparison with the first case above, the variable σc, not in the Xu–Borovsky list, but also independent of composition, is added to the list to keep track of magnetic sectors. For what is shown here, k is taken to be 8 for agreement with the first case above. In Figures 6 and 7, orange and blue represent the coronal hole wind in the two sectors, black shows ICMEs, red captures pseudostreamers, and green and cyan the noncoronal hole wind. Although density was not included, interaction regions appear as magenta and purple (depending on sector), but they are not well distinguished from other noncoronal hole wind. The split of the ICME in May, the nonoccurrence of an ICME in early July, the possible new ICMEs in July, and other features are common to this case and the first. There are details that differ (e.g., still a stronger first member of the possible pair of ICMEs in midAugust); these cases will guide future work. There are various statistical methods used to determine how “good” the clustering is, as well as to attempt to determine the “best” value of k. The “elbow” method finds J for k from 1 to past the point where little change occurs, and typically uses the k value where the decrease in J becomes less dramatic as the best value. (The value of J goes to zero as k→N.) This is a 5 The Astrophysical Journal, 889:153 (9pp), 2020 February 1 Roberts et al. Figure 3. A k=4 set of clusters for four months of 2003 using the variables defined by Xu and Borovsky. The panels are as in Figure 1. Table 2 Means of Eight Properties (Columns) of Four Types of Wind (Rows) found with the Three Variables of Xu & Borovsky (2015) Color Red (SR) Blue (PS/SB) Green (ICME) Purple (CH) Vp np ∣sc∣ o7to6 σr fetoo aveqfe b0 Nk/N 378.4190 475.9861 458.9792 622.5383 9.6784 5.6409 6.6945 3.2403 0.5074 0.6919 0.4618 0.7157 0.3325 0.1659 0.6466 0.1304 −0.5070 −0.3734 −0.4685 −0.2853 0.2001 0.1532 0.2550 0.1333 10.2638 10.1628 11.4062 10.3572 6.4704 7.4252 9.1196 8.2802 0.215 0.449 0.083. 0.253 Note. The categories identified are SR, sector reversal; PS/SB, pseudostreamer/streamer-belt; ICME; and CH, coronal hole. Note that the absolute value of the crosshelicity is used, since the two sectors cannot be distinguished by the three variables used here. are. From the σc–σr plots and other pairs we find that the clusters do not present themselves as being highly compact. This implies that the edges of the distributions are somewhat fuzzy, perhaps due to the evolution of the plasma from the Sun. We can hope that measurements from Parker Solar Probe (PSP) nearer the Sun will yield more compact clusters and thus clearer states, but the physics-based criteria used here imply that the clusters found are meaningful. highly subjective criterion, and, at least in the case here, it adds little to the discussion. Figure 8 shows the plot of J versus k corresponding to the set of eight variables above. A case could be made for a “best” k of 3 or 4, whereas it is clear from the above discussion that meaningful distinctions are apparent up to at least k=8. Even the k=10 case introduces a clearly identifiable new category of very fast ICMEs (not shown in the figures above). Whether or not this is a distinct category in terms of its origin at the Sun is not clear, but the method does suggest a possibly important distinction. Deeper insight will require detailed examination of each case using other solar imaging and in situ data. The “silhouette” method measures the compactness of each cluster compared to the distance to the next closest cluster. For k-means, the clusters are, by definition, distinct and nonoverlapping, so all this test shows is how compact the clusters 4. Conclusions The above results illustrate the likely utility of unsupervised machine learning techniques in general, and the k-means method in particular, for identification of distinct states of the solar wind. The method is computationally fast and only involves subjective decisions in the choice of state variables, not in any subsequent decisions on how to categorize wind 6 The Astrophysical Journal, 889:153 (9pp), 2020 February 1 Roberts et al. Figure 4. Clusters in the previous figure, plotted in the VA−Sp space; see Figure 3 of Xu & Borovsky (2015). Figure 5. Clusters for two variables, Sp and VA, and k=4. Note that all the algorithm can do in this case is make four similar partitions in the 2D space, unlike the divisions in the previous figure where a third variable allows more physically significant distinctions. states. The ability to vary the number of clusters provides a way of systematically looking for substructures. The natural next steps are to decipher the meaning of the various complex wind regions and the cases of discrepancies that remain between various methods. More quantitative tests will be useful to see the degree to which, for example, different sets of variables identify the same states. There are possible complications that none of the current solar wind classification methods address. For example, it may be that the solar wind consists of distinct as well as mixed/ hybrid states. This possibility should be considered and can be exposed to some extent using the clustering techniques. There are already suggestions of this in the solar wind. For example, ICMEs found here are often mixed with other states in time, but there may be deeper meanings to this. The standard clustering methods such as k-means are based on the underlying assumption that each point can be treated independently and is not causally connected with its neighboring points. To partially account for this, we explicitly introduced temporal correlations through the use of crosshelicity and residual energy as variables. (Surprisingly, an interesting set of correlations arises in the “σc–σr space.”) 7 The Astrophysical Journal, 889:153 (9pp), 2020 February 1 Roberts et al. Figure 6. A k=8 set of clusters for four months of 2003 using the variables defined by Xu and Borovsky along with σc. The panels are as in Figure 1. The clusters are identifiable as CH– (orange), CH+ (blue), ICMEs (black), PS/SB– (red), PS/SB+ (green), NCH– (cyan), IR–(?) (magenta), IR+ (purple). seen. As a prelude to such a study, it would be interesting to examine the average lifetime of the different states derived from clustering. Another deficiency of the standard clustering techniques is that they do not include learning of the feature space; the state variables must be specified by the user. The proper choice of feature space can have a dramatic impact on the clustering results, as illustrated by Figure 5. We are currently working on the development of a deep learning-based clustering technique that includes learning representations, and we will explore its application to the solar wind. One significant outcome of this study was the further establishment of a set of variables that can serve as a proxy for composition measurements, namely the set for Figures 3 and 4. This was known to some degree before, but the comparisons here confirmed the efficacy of these variables as proxies for composition measurement. The non-composition proxy not only opens up times and regions where composition data are not available, but also provides a straightforward means of performing classifications at higher time resolution. The ease and speed of the k-means method is also conducive to the studies with higher time resolution. These techniques used here may be helpful in determining the origin of solar wind parcels, and may also be applicable to many other space plasmas. The methods may provide an automated “scientist in the loop” for determining when to download burst mode data from spacecraft. There are a wide range of other possible directions to extend this work, including using other clustering schemes, seeing what different sets of Figure 7. A k=8 set of clusters for the same case as in the previous figure. Ideally, the unsupervised method should find such temporal relationships on its own. A recently developed unsupervised technique addresses this shortcoming by taking into account any underlying temporal correlations in time series data (Madiraju et al. 2018). Whether this technique will yield better solutions to the categorization of the solar wind remains to be 8 The Astrophysical Journal, 889:153 (9pp), 2020 February 1 Roberts et al. Figure 8. The value of J from Equation (1) as a function of k for the set of variables used in Figure 1. variables reveal, and using these unsupervised learning states to provide labeled states to train supervised methods, such as neural networks. Our most immediate next step will be to examine higher resolution data to determine whether the current clusters become more complex or just add more points to each cluster region. If the latter is true, then we can have increasing confidence in the identifications of wind state made by k-means methods. If the picture becomes increasingly complex at higher resolution, or even just in light of existing discrepancies between the cases shown above, we will need to examine the details of these cases both in situ and in correlation with studies of projections back to the Sun. It will also be revealing to see whether the state identification becomes clearer closer to the Sun in PSP data, which should help to sort out complexities at the source from those due to propagation. the resource and clicking on its name will give access and other information. Facility: ACE. ORCID iDs D. Aaron Roberts https://orcid.org/0000-0001-6565-2921 Yuan-Kuen Ko https://orcid.org/0000-0002-8747-4772 Susan Lepri https://orcid.org/0000-0003-1611-227X References Bavassano, B., Pietropaolo, E., & Bruno, R. 1998, JGR, 103, 6521 Borovsky, J. E., Denton, M. H., & Smith, C. W. 2019, JGRA, 124, 2406 Camporeale, E., Carè, A., & Borovsky, J. E. 2017, JGRA, 122, 10910 Jian, L., Russell, C. T., Luhmann, J. G., & Skoug, R. M. 2006, SoPh, 239, 393 Ko, Y.-K., Roberts, D. A., & Lepri, S. T. 2018, ApJ, 864, 139 MacQueen, J. 1967, Proc. of the Fifth Berkeley Symp. on Mathematical Statistics and Probability (Berkeley, CA: Univ. California Press), 281, https://projecteuclid.org/euclid.bsmsp/1200512992 Madiraju, N. S., Sadat, S. M., Fisher, D., & Karimabadi, H. 2018, arXiv:1802. 01059 Marsch, E., Rosenbauer, H., Schwenn, R., Muehlhaeuser, K.-H., & Denskat, K. U. 1981, JGR, 86, 9199 McGranaghan, R. M., Bhatt, A., Matsuo, T., et al. 2017, JGRA, 122, 12586 Neugebauer, M., Reisenfeld, D., & Richardson, I. G. 2016, JGRA, 121, 8215 Pedregosa, F., Varoquaux, G., Gramfort, A., et al. 2011, J. Mach. Learn., 12, 2825 Richardson, I. G., & Cane, H. V. 2010, SoPh, 264, 189 Roberts, D. A., Goldstein, M. L., Klein, L. W., & Matthaeus, W. H. 1987, JGR, 92, 12023 Wicks, R. T., Roberts, D. A., Mallet, A., et al. 2013, ApJ, 778, 177 Xu, F., & Borovsky, J. E. 2015, JGRA, 120, 70 Zhao, L., Landi, E., Lepri, S. T., et al. 2017, ApJ, 846, 135 Zhao, L., Zurbuchen, T. H., & Fisk, L. A. 2009, GeoRL, 36, L14104 D.A.R. is supported in this work by Research and Technology Operating Plan (RTOP) funding through the NASA HP Internal Scientist Funding Model. H.K. is supported by Analytics Ventures. T.S. is self-supported for this effort. Y.K.K. was supported by NNH10AO82I and the Chief of Naval Research. S.T.L. acknowledges funding from NASA grants: NNX16AH01G, NNX17AI94G and NSF grant: AGS1460170. We thank the ACE MAG, SWICS, and SWEPAM teams for provision of the data via NASA’s CDAWeb. The specific data sets are given by NASA SPASE IDs:spase://VSPO/ NumericalData/ACE/SWICS_SWIMS/L2/PT1H,spase:// VSPO/NumericalData/ACE/SWEPAM/L2/PT1H, andspase://VSPO/NumericalData/ACE/MAG/L2/PT1H. Entering the ID as a text restriction in the Heliophysics Data Portal athttps://heliophysicsdata.gsfc.nasa.gov will bring up 9