Sampling Fraction. In probability sampling, the sampling fraction is the (known) probability with which cases in the population are selected into the sample. For example, if you were to take a simple random sample with a sampling fraction of 1/10,000 from a population of 1,000,000 cases, then each case would have a 1/10,000 probability of being selected into the sample, which will consist of approximately 1/10,000 * 1,000,000 = 100 observations.

S.D. Ratio. In a regression problem, the ratio of the prediction error standard deviation to the original output data standard deviation. A lower S.D. ratio indicates a better prediction. This is equivalent to one minus the explained variance of the model. See Multiple Regression, Neural Networks.

Scalable Software Systems. Software (e.g., a data base management system, such as MS SQL Server or Oracle) that can be expanded to meet future requirements without the need to restructure its operation (e.g., split data into smaller segments) to avoid a degradation of its performance. For example, a scalable network allows the network administrator to add many additional nodes without the need to redesign the basic system. An example of a non-scalable architecture is the DOS directory structure (adding files will eventually require splitting them into subdirectories). See also Enterprise-Wide Systems.

Scaling. Altering original variable values (according to a specific function or an algorithm) into a range that meet particular criteria (e.g., positive numbers, fractions, numbers less than 10E12, numbers with a large relative variance).

Scatterplot, 2D. The scatterplot visualizes a relation (correlation) between two variables X and Y (e.g., weight and height). Individual data points are represented in two-dimensional space (see below), where axes represent the variables (X on the horizontal axis and Y on the vertical axis).

The two coordinates (X and Y) that determine the location of each point correspond to its specific values on the two variables.

See also, Data Reduction.

Scatterplot, 2D - Categorized Ternary Graph. The points representing the proportions of the component variables (X, Y, and Z) in a ternary graph are plotted in a 2-dimensional display for each level of the grouping variable (or user-defined subset of data). One component graph is produced for each level of the grouping variable (or user-defined subset of data) and all the component graphs are arranged in one display to allow for comparisons between the subsets of data (categories).

See also, Data Reduction.

Scatterplot, 2D - Double-Y. This type of scatterplot can be considered to be a combination of two multiple scatterplots for one X-variable and two different sets (lists) of Y-variables. A scatterplot for the X-variable and each of the selected Y-variables will be plotted, but the variables entered into the first list (called Left-Y) will be plotted against the left-Y axis, whereas the variables entered into the second list (called Right-Y) will be plotted against the right-Y axis. The names of all Y-variables from the two lists will be included in the legend followed either by the letter (L) or (R), denoting the left-Y and right-Y axis, respectively.

The Double-Y scatterplot can be used to compare images of several correlations by overlaying them in a single graph. However, due to the independent scaling used for the two list of variables, it can facilitate comparisons between variables with values in different ranges.

See also, Data Reduction.

Scatterplot, 2D - Frequency. Frequency scatterplots display the frequencies of overlapping points between two variables in order to visually represent data point weight or other measurable characteristics of individual data points.

See also, Data Reduction.

Scatterplot, 2D - Multiple. Unlike the regular scatterplot in which one variable is represented by the horizontal axis and one by the vertical axis, the multiple scatterplot consists of multiple plots and represents multiple correlations: one variable (X) is represented by the horizontal axis, and several variables (Y's) are plotted against the vertical axis. A different point marker and color is used for each of the multiple Y-variables and referenced in the legend so that individual plots representing different variables can be discriminated in the graph.

The Multiple scatterplot is used to compare images of several correlations by overlaying them in a single graph that uses one common set of scales (e.g., to reveal the underlying structure of factors or dimensions in Discriminant Function Analysis).

See also, Data Reduction.

Scatterplot, 2D - Regular. The regular scatterplot visualizes a relation between two variables X and Y ( e.g., weight and height). Individual data points are represented by point markers in two- dimensional space, where axes represent the variables. The two coordinates (X and Y) which determine the location of each point, correspond to its specific values on the two variables. If the two variables are strongly related, then the data points form a systematic shape (e.g., a straight line or a clear curve). If the variables are not related, then the points form an irregular "cloud" (see the categorized scatterplot below for examples of both types of data sets).

Fitting functions to scatterplot data helps identify the patterns of relations between variables (see example below).

For more examples of how scatterplot data helps identify the patterns of relations between variables, see Outliers and Brushing. See also, Data Reduction.

Scatterplot, 3D. 3D Scatterplots visualize a relationship between three or more variables, representing the X, Y, and one or more Z (vertical) coordinates of each point in 3-dimensional space (see graph below).

See also, 3D Scatterplot - Custom Ternary Graph, Data Reduction and Data Rotation (in 3D space).

Scatterplot, 3D - Raw Data. An unsmoothed surface (no smoothing function is applied) is drawn through the points in the 3D scatterplot.

See also, Data Reduction.

Scatterplot, 3D - Ternary Graph. In this type of ternary graph, the triangular coordinate systems are used to plot four (or more) variables (the components X, Y, and Z, and the responses V1, V2, etc.) in three dimensions (ternary 3D scatterplots or surface plots). Here, the responses (V1, V2, etc.) associated with the proportions of the component variables (X, Y, and Z) in a ternary graph are plotted as the heights of the points.

See also, Data Reduction.

Scatterplot Smoothers. In 2D scatterplots, various smoothing methods are available to fit a function through the points to best represent (summarize) the relationship between the variables.

Scheffe's test. This post hoc test can be used to determine the significant differences between group means in an analysis of variance setting. Scheffe's test is considered to be one of the most conservative post hoc tests (for a detailed discussion of different post hoc tests, see Winer, Michels, & Brown (1991). For more details, see the General Linear Models chapter. See also, Post Hoc Comparisons. For a discussion of statistical significance, see Elementary Concepts.

Score Statistic. This statistic is used to evaluate the statistical significance of parameter estimates computed via maximum likelihood methods. It is also sometimes called the efficient score statistic. The test is based on the behavior of the log-likelihood function at the point where the respective parameter estimate is equal to 0.0 (zero); specifically, it uses the derivative (slope) of the log-likelihood function evaluated at the null hypothesis value of the parameter (parameter = 0.0). While this test is not as accurate as explicit likelihood-ratio test statistics based on the ratio of the likelihoods of the model that includes the parameter of interest, over the likelihood of the model that does not, its computation is usually much faster. It is therefore the preferred method for evaluating the statistical significance of parameter estimates in stepwise or best-subset model building methods.

An alternative statistic is the Wald statistic.

Scree Plot, Scree Test. The eigenvalues for successive factors can be displayed in a simple line plot. Cattell (1966) proposed that this scree plot can be used to graphically determine the optimal number of factors to retain.

The scree test involves finding the place where the smooth decrease of eigenvalues appears to level off to the right of the plot. To the right of this point, presumably, one finds only "factorial scree" -- "scree" is the geological term referring to the debris which collects on the lower part of a rocky slope. Thus, no more than the number of factors to the left of this point should be retained.

For more information on procedures for determining the optimal number of factors to retain, see the section on Reviewing the Results of a Principal Components Analysis in the Factor Analysis chapter and How Many Dimensions to Specify in the Multi-dimensional Scaling chapter.

Semi-Partial (or Part) Correlation. The semi-partial or part correlation is similar to the partial correlation statistic. Like the, partial correlation, it is a measure of the correlation between two variables that remains after controlling for (i.e., "partialling" out) the effects of one or more other predictor variables. However, while the squared partial correlation between a predictor X1 and a response variable Y can be interpreted as the proportion of (unique) variance accounted for by X1, in the presence of other predictors X2, ... , Xk, relative to the residual or unexplained variance that cannot be accounted for by X2, ... , Xk, the squared semi-partial or part correlation is the proportion of (unique) variance accounted for by the predictor X1, relative to the total variance of Y. Thus, the semi-partial or part correlation is a better indicator of the "practical relevance" of a predictor, because it is scaled to (i.e., relative to) the total variability in the dependent (response) variable.

See also Correlation, Spurious Correlations, partial correlation, Basic Statistics, Multiple Regression, General Linear Models, General Stepwise Regression, Structural Equation Modeling (SEPATH).

SEMMA. See Models for Data Mining. See also Data Mining Techniques.

Sensitivity Analysis (in Neural Networks). A sensitivity analysis indicates which input variables are considered most important by that particular neural network. Sensitivity analysis can be used purely for informative purposes, or to perform input pruning.

Sensitivity analysis can give important insights into the usefulness of individual variables. It often identifies variables that can be safely ignored in subsequent analyses, and key variables that must always be retained. However, it must be deployed with some care, for reasons that are explained below.

Input variables are not, in general, independent - that is, there are interdependencies between variables. Sensitivity analysis rates variables according to the deterioration in modeling performance that occurs if that variable is no longer available to the model. In so doing, it assigns a single rating value to each variable. However, the interdependence between variables means that no scheme of single ratings per variable can ever reflect the subtlety of the true situation.

Consider, for example, the case where two input variables encode the same information (they might even be copies of the same variable). A particular model might depend wholly on one, wholly on the other, or on some arbitrary combination of them. Then sensitivity analysis produces an arbitrary relative sensitivity to them. Moreover, if either is eliminated the model may compensate adequately because the other still provides the key information. It may therefore rate the variables as of low sensitivity, even though they might encode key information. Similarly, a variable that encodes relatively unimportant information, but is the only variable to do so, may have higher sensitivity than any number of variables that mutually encode more important information.

There may be interdependent variables that are useful only if included as a set. If the entire set is included in a model, they can be accorded significant sensitivity, but this does not reveal the interdependency. Worse, if only part of the interdependent set is included, their sensitivity will be zero, as they carry no discernable information.

In summary, sensitivity analysis does not rate the "usefulness" of variables in modeling in a reliable or absolute manner. You must be cautious in the conclusions you draw about the importance of variables. Nonetheless, in practice it is extremely useful. If a number of models are studied, it is often possible to identify key variables that are always of high sensitivity, others that are always of low sensitivity, and "ambiguous" variables that change ratings and probably carry mutually redundant information.

How does sensitivity analysis work? Each input variable is treated in turn as if it were "unavailable" (Hunter, 2000). There is a missing value substitution procedure, which is used to allow predictions to be made in the absence of values for one or more inputs. To define the sensitivity of a particular variable, v, we first run the network on a set of test cases, and accumulate the network error.  We then run the network again using the same cases, but this time replacing the observed values of v with the value estimated by the missing value procedure, and again accumulate the network error.

Given that we have effectively removed some information that presumably the network uses (i.e. one of its input variables), we would reasonably expect some deterioration in error to occur. The basic measure of sensitivity is the ratio of the error with missing value substitution to the original error. The more sensitive the network is to a particular input, the greater the deterioration we can expect, and therefore the greater the ratio.

If the ratio is one or lower, then making the variable "unavailable" either has no effect on the performance of the network, or actually enhances it (!).

Once sensitivities have been calculated for all variables, they may be ranked in order.

Sequential Contour Plot, 3D. This contour plot presents a 2-dimensional projection of the spline-smoothed surface fit to the data (see 3D Sequential Surface Plot. Successive values of each series are plotted along the X-axis, with each successive series represented along the Y- axis.

Sequential/Stacked Plots. In this type of graph, the sequence of values from each selected variable is stacked on one another.

Sequential/Stacked Plots, 2D - Area. The sequence of values from each selected variable will be represented by consecutive areas stacked on one another in this type of graph.

Sequential/Stacked Plots, 2D - Column. The sequence of values from each selected variable will be represented by consecutive segments of vertical columns stacked on one another in this type of graph.

Sequential/Stacked Plots, 2D - Lines. The sequence of values from each selected variable will be represented by consecutive lines stacked on one another in this type of graph.

Sequential/Stacked Plots, 2D - Mixed Line. In this type of graph, the sequences of values of variables selected in the first list will be represented by consecutive areas stacked on one another while the sequences of values of variables selected in the second list will be represented by consecutive lines stacked on one another (over the area representing the last variable from the first list).

Sequential/Stacked Plots, 2D - Mixed Step. In this type of graph, the sequences of values of variables selected in the first list will be represented by consecutive step areas stacked on one another while the sequences of values of variables selected in the second list will be represented by consecutive step lines stacked on one another (over the step area representing the last variable from the first list).

Sequential/Stacked Plots, 2D - Step. The sequence of values from each selected variable will be represented by consecutive step lines stacked on one another in this type of graph.

Sequential/Stacked Plots, 2D - Step Area. The sequence of values from each selected variable will be represented by consecutive step areas stacked on one another in this type of graph.

Sequential Surface Plot, 3D. In this sequential plot, a spline-smoothed surface is fit to each data point. Successive values of each series are plotted along the X-axis, with each successive series represented along the Y-axis.

Sets of Samples in Quality Control Charts. While monitoring an ongoing process, it often becomes necessary to adjust the center line values or control limits, as those values are being refined over time. Also, one may want to compute the control limits and center line values from a set of samples that are known to be in control, and apply those values to all subsequent samples. Thus, each set is defined by a set of computation samples (from which various statistics are computed, e.g., sigma, means, etc.) and a set of application samples (to which the respective statistics, etc. are applied). Of course, the computation samples and application samples can be (and often are) not the same. To reiterate, you may want to estimate sigma from a set of samples that are known to be in control (the computation set), and use that estimate for establishing control limits for all remaining and new samples (the application set).

Note that each sample must be uniquely assigned to one application set; in other words, each sample has control limits based on statistics (e.g., sigma) computed for one particular set. The assignment of application samples to sets proceeds in a hierarchical manner, i.e., each sample is assigned to the first set where it "fits" (where the definition of the application sample set would include the respective sample). This hierarchical search always begins at the last set that the user specified, and not with the all-samples set. Hence, if the user-specified sets encompass all valid samples, the default all-samples set will actually become empty (since all samples will be assigned to one of the user-defined sets).

Shapiro-Wilks' W test. The Shapiro-Wilks' W test is used in testing for normality. If the W statistic is significant, then the hypothesis that the respective distribution is normal should be rejected. The Shapiro-Wilks' W test is the preferred test of normality because of its good power properties as compared to a wide range of alternative tests (Shapiro, Wilk, & Chen, 1968). Some software programs implement an extension to the test described by Royston (1982), which allows it to be applied to large samples (with up to 5000 observations). See also Kolmogorov-Smirnov test and Lilliefors test.

Shewhart Control Charts. This is a standard graphical tool widely used in statistical Quality Control. The general approach to quality control charting is straightforward: One extracts samples of a certain size from the ongoing production process. One then produces line charts of the variability in those samples, and consider their closeness to target specifications. If a trend emerges in those lines, or if samples fall outside pre-specified limits, then the process is declared to be out of control and the operator will take action to find the cause of the problem. These types of charts are sometimes also referred to as Shewhart control charts (named after W. A. Shewhart who is generally credited as being the first to introduce these methods; see Shewhart, 1931).

For additional information, see also Quality Control charts; Assignable causes and actions.

Short Run Control Charts. The short run quality control chart , for short production runs, plots transformations of the observations of variables or attributes for multiple parts, each of which constitutes a distinct "run," on the same chart. The transformations rescale the variable values of interest such that they are of comparable magnitudes across the different short production runs (or parts). The control limits computed for those transformed values can then be applied to determine if the production process is in control, to monitor continuing production, and to establish procedures for continuous quality improvement.

Shuffle data (in Neural Networks).Randomly assigning cases to the training and verification sets, so that these are (as far as possible) statistically unbiased. See, Neural Networks.

Shuffle, Back Propagation (in Neural Networks). Presenting training cases in a random order on each epoch, to prevent various undesirable effects which can otherwise occur (such as oscillation and convergence to local minima). See, Neural Networks.

Sigma Restricted Model. A sigma restricted model uses the sigma-restricted coding to represent effects for categorical predictor variables in general linear models and generalized linear models. To illustrate the sigma-restricted coding, suppose that a categorical predictor variable called Gender has two levels (i.e., male and female). Cases in the two groups would be assigned values of 1 or -1, respectively, on the coded predictor variable, so that if the regression coefficient for the variable is positive, the group coded as 1 on the predictor variable will have a higher predicted value (i.e., a higher group mean) on the dependent variable, and if the regression coefficient is negative, the group coded as -1 on the predictor variable will have a higher predicted value on the dependent variable. This coding strategy is aptly called the sigma-restricted parameterization, because the values used to represent group membership (1 and -1) sum to zero.

See also categorical predictor variables, design matrix; or General Linear Models.

Sigmoid function. An S-shaped curve, with a near-linear central response and saturating limits.

See also, logistic function and hyperbolic tangent function.

Signal detection theory (SDT). Signal detection theory (SDT) is an application of statistical decision theory used to detect a signal embedded in noise. SDT is used in psychophysical studies of detection, recognition, and discrimination, and in other areas such as medical research, weather forecasting, survey research, and marketing research.

A general approach to estimating the parameters of the signal detection model is via the use of the generalized linear model. For example, DeCarlo (1998) shows how signal detection models based on different underlying distributions can easily be considered by using the generalized linear model with different link functions.

For discussion of the generalized linear model and the link functions which it uses, see the Generalized Linear Models chapter.

Simple Random Sampling (SRS). Simple random sampling is a type of probability sampling where observations are randomly selected from a population with a known probability or sampling fraction. Typically, one begins with a list of N observations that comprises the entire population from which one wishes to extract a simple random sample (e.g., a list of registered voters); one can then generate k random case numbers (without replacement) in the range from 1 to N, and select the respective cases into the final sample (with a sampling fraction or known selection probability of k/N).

Refer to, for example, Kish (1965) for a detailed discussion of the advantages and characteristics of probability samples and EPSEM samples.

Simplex algorithm. A nonlinear estimation algorithm that does not rely on the computation or estimation of the derivatives of the loss function. Instead, at each iteration the function will be evaluated at m+1 points in the m dimensional parameter space. For example, in two dimensions (i.e., when there are two parameters to be estimated), the program will evaluate the function at three points around the current optimum. These three points would define a triangle; in more than two dimensions, the "figure" produced by these points is called a Simplex.

Single and Multiple Censoring. There are situations in which censoring can occur at different times (multiple censoring), or only at a particular point in time (single censoring). Consider an example experiment where we start with 100 light bulbs, and terminate the experiment after a certain amount of time. If the experiment is terminated at a particular point in time, then a single point of censoring exists, and the data set is said to be single-censored. However, in biomedical research multiple censoring often exists, for example, when patients are discharged from a hospital after different amounts (times) of treatment, and the researcher knows that the patient survived up to those (differential) points of censoring.

Data sets with censored observations can be analyzed via Survival Analysis or via Weibull and Reliability/Failure Time Analysis. See also, Type I and II Censoring and Left and Right Censoring.

Singular Value Decomposition. An efficient algorithm for optimizing a linear model.

See also, pseudo-inverse.

Six Sigma (DMAIC). Six Sigma is a well-structured, data-driven methodology for eliminating defects, waste, or quality control problems of all kinds in manufacturing, service delivery, management, and other business activities. Six Sigma methodology is based on the combination of well-established statistical quality control techniques, simple and advanced data analysis methods, and the systematic training of all personnel at every level in the organization involved in the activity or process targeted by Six Sigma.

Six Sigma methodology and management strategies provide an overall framework for organizing company wide quality control efforts. These methods have recently become very popular, due to numerous success stories from major US-based as well as international corporations. For reviews of Six Sigma strategies, refer to Harry and Schroeder (2000), or Pyzdek (2001).

These are organized into the categories of activities that make up the Six Sigma effort: Define (D), Measure (M), Analyze (A), Improve (I), Control (C); or DMAIC for short.

Define. The Define phase is concerned with the definition of project goals and boundaries, and the identification of issues that need to be addressed to achieve the higher sigma level.

Measure. The goal of the Measure phase is to gather information about the current situation, to obtain baseline data on current process performance, and to identify problem areas.

Analyze. The goal of the Analyze phase is to identify the root cause(s) of quality problems, and to confirm those causes using the appropriate data analysis tools.

Improve. The goal of the Improve phase is to implement solutions that address the problems (root causes) identified during the previous (Analyze) phase.

Control. The goal of the Control phase is to evaluate and monitor the results of the previous phase (Improve).

Six Sigma Process. A six sigma process is one that can be expected to produce only 3.4 defects per one million opportunities. The concept of the six sigma process is important in Six Sigma quality improvement programs. The idea can best be summarized with the following graphs.

The term Six Sigma derives from the goal to achieve a process variation, so that ± 6 * sigma (the estimate of the population standard deviation) will "fit" inside the lower and upper specification limits for the process. In that case, even if the process mean shifts by 1.5 * sigma in one direction (e.g., to +1.5 sigma in the direction of the upper specification limit), then the process will still produce very few defects.

For example, suppose we expressed the area above the upper specification limit in terms of one million opportunities to produce defects. The 6 * sigma process shifted upwards by 1.5 * sigma will only produce 3.4 defects (i.e., "parts" or "cases" greater than the upper specification limit) per one million opportunities.

Shift. An ongoing process that at some point was centered will shift over time. Motorola, in their implementation of Six Sigma strategies, determined that it is reasonable to assume that a process will shift over time by approximately 1.5 * sigma (see, for example, Harry and Schroeder, 2000). Hence, most standard Six Sigma calculators will be based on a 1.5 * sigma shift.

One-sided vs. two-sided limits. In the illustration shown above the area outside the upper specification limit (greater than USL) is defined as one million opportunities to produce defects. Of course, in many cases any "outcomes" (e.g., parts) that are produced that fall below the specification limit can be equally defective. In that case one may want to consider the lower tail of the respective (shifted) normal distribution as well.  However, in practice one usually ignores the lower tail of the normal curve because (1) in many cases, the process "naturally" has one-sided specification limits (e.g., very low delay times are not really a defect, only very long times; very few customer complaints are not a problem, only very many, etc.), and (2) when a 6 * sigma process has been achieved, the area under the normal curve below the lower specification limit is negligible.

Yield. The illustration shown above focuses on the number of defects that a process produces. The number of non-defects can be considered the Yield of the process. Six Sigma calculators will compute the number of defects per million opportunities (DPMO) as well as the yield, expressed as the percent of the area under the normal curve that falls below the upper specification limit (in the illustration above).

Skewness. Skewness (this term was first used by Pearson, 1895) measures the deviation of the distribution from symmetry. If the skewness is clearly different from 0, then that distribution is asymmetrical, while normal distributions are perfectly symmetrical.

Skewness = n*M3/[(n-1)*(n-2)*3]

where
M3     is equal to: (xi-Meanx)3
3     is the standard deviation (sigma) raised to the third power
n        is the valid number of cases.

See also, Descriptive Statistics Overview.

Smoothing. Smoothing techniques can be used in two different situations. Smoothing techniques for 3D Bivariate Histograms allow you to fit surfaces to 3D representations of bivariate frequency data. Thus, every 3D histogram can be turned into a smoothed surface providing a sensitive method for revealing non-salient overall patterns of data and/or identifying patterns to use in developing quantitative models of the investigated phenomenon.

In Time Series analysis, the general purpose of smoothing techniques is to "bring out" the major patterns or trends in a time series, while de-emphasizing minor fluctuations (random noise). Visually, as a result of smoothing, a jagged line pattern should be transformed into a smooth curve.

See also, Exploratory Data Analysis and Data Mining Techniques, and Smoothing Bivariate Distributions.

SOFMs (Self-organizing feature maps; Kohonen Networks). Neural networks based on the topological properties of the human brain, also known as Kohonen Networks (Kohonen, 1982; Fausett, 1994,; Haykin, 1994; Patterson, 1996).

Softmax. A specialized activation function for one-of-N encoded classification networks. Performs a normalized exponential (i.e. the outputs add up to 1). In combination with the cross entropy error function, allows multilayer perceptron networks to be modified for class probability estimation (Bishop, 1995; Bridle, 1990). See, Neural Networks.

Space Plots. This type of graph offers a distinctive means of representing 3D Scatterplot data through the use of a separate X-Y plane positioned at a user-selectable level of the vertical Z-axis (which "sticks up" through the middle of the plane).

The Space Plots specific layout may facilitate exploratory examination of specific types of three-dimensional data. It is recommended to assign variables to axes such that the variable that is most likely to discriminate between patterns of relation among the other two is specified as Z.

See also, Data Rotation (in 3D space) in the Graphical Techniques chapter.

Spearman R. Spearman R can be thought of as the regular Pearson product-moment correlation coefficient (Pearson r); that is, in terms of the proportion of variability accounted for, except that Spearman R is computed from ranks. As mentioned above, Spearman R assumes that the variables under consideration were measured on at least an ordinal (rank order) scale; that is, the individual observations (cases) can be ranked into two ordered series. Detailed discussions of the Spearman R statistic, its power and efficiency can be found in Gibbons (1985), Hays (1981), McNemar (1969), Siegel (1956), Siegel and Castellan (1988), Kendall (1948), Olds (1949), or Hotelling and Pabst (1936).

Spectral Plot. The original application of this type of plot was in the context of spectral analysis in order to investigate the behavior of non-stationary time series. On the horizontal axes one can plot the frequency of the spectrum against consecutive time intervals, and indicate on the Z-axis the spectral densities at each interval (see for example, Shumway, 1988, page 82).

Spectral plots have clear advantages over the regular 3D Scatterplots when you are interested in examining how a relationship between two variables changes across the levels of a third variable, as is shown in the next illustration. The advantage of Spectral Plots over the regular 3D Scatterplots is well-illustrated in the comparison of the two displays of the same data set shown below.

The Spectral Plot makes it easier to see that the relationship between Pressure and Yield changes from an "inverted U" to a "U".

See also, Data Rotation (in 3D space) in the Graphical Techniques chapter.

Spikes (3D graphs). In this type of graph, individual values of one or more series of data are represented along the X-axis as a series of "spikes" (point symbols with lines descending to the base plane). Each series to be plotted is spaced along the Y-axis. The "height" of each spike is determined by the respective value of each series.

Spline (2D graphs). A curve is fitted to the XY coordinate data using the bicubic spline smoothing procedure.

Spline (3D graphs). A surface is fitted to the XYZ coordinate data using the bicubic spline smoothing procedure.

Split Selection (for Classification Trees). Split selection for classification trees refers to the process of selecting the splits on the predictor variables which are used to predict membership in the classes of the dependent variable for the cases or objects in the analysis. Given the hierarchical nature of classification trees, these splits are selected one at time, starting with the split at the root node, and continuing with splits of resulting child nodes until splitting stops, and the child nodes which have not been split become terminal nodes.

The split selection process is described in the Computational Methods section of the Classification Trees chapter.

Spurious Correlations. Correlations that are due mostly to the influences of one or more "other" variables. For example, there is a correlation between the total amount of losses in a fire and the number of firemen that were putting out the fire; however, what this correlation does not indicate is that if you call fewer firemen then you would lower the losses. There is a third variable (the initial size of the fire) that influences both the amount of losses and the number of firemen. If you "control" for this variable (e.g., consider only fires of a fixed size), then the correlation will either disappear or perhaps even change its sign. The main problem with spurious correlations is that we typically do not know what the "hidden" agent is. However, in cases when we know where to look, we can use partial correlations that control for (i.e., partial out) the influence of specified variables.

See also Correlation, Partial Correlation, Basic Statistics, Multiple Regression, Structural Equation Modeling (SEPATH).

SQL. SQL (Structured Query Language) enables you to query an outside data source about the data it contains. You can use a SQL statement in order to specify the desired tables, fields, rows, etc. to return as data. For information on SQL syntax, please consult an SQL manual.

Square Root of the Signal to Noise Ratio (f). This standardized measure of effect size is used in the Analysis of Variance to characterize the overall level of population effects, and is very similar to the RMSSE. It is the square root of the sum of squared standardized effects divided by the number of effects. For example, in a 1-Way ANOVA, with J groups, f is calculated as

For more information see the chapter on Power Analysis.

Stacked Generalization. See Stacking.

Stacking (Stacked Generalization). The concept of stacking (short for Stacked Generalization) applies to the area of predictive data mining, to combine the predictions from multiple models. It is particularly useful when the types of models included in the project are very different.

Suppose your data mining project includes tree classifiers, such as C&RT and  CHAID, linear discriminant analysis (e.g., see GDA), and Neural Networks. Each computes predicted classifications for a crossvalidation sample, from which overall goodness-of-fit statistics (e.g., misclassification rates) can be computed. Experience has shown that combining the predictions from multiple methods often yields more accurate predictions than can be derived from any one method (e.g., see Witten and Frank, 2000). In stacking, the predictions from different classifiers are used as input into a meta-learner, which attempts to combine the predictions to create a final best predicted classification. So, for example, the predicted classifications from the tree classifiers, linear model, and the neural network classifier(s) can be used as input variables into a neural network meta-classifier, which will attempt to "learn" from the data how to combine the predictions from the different models to yield maximum classification accuracy.

Other methods for combining the prediction from multiple models or methods (e.g., from multiple datasets used for learning) are Boosting and Bagging (Voting).

Standard Deviation. The standard deviation (this term was first used by Pearson, 1894) is a commonly-used measure of variation. The standard deviation of a population of values is computed as:

= [(xi-µ)2/N]1/2

where
µ     is the population mean
N    is the population size.
The sample estimate of the population standard deviation is computed as:

s = [(xi-x-bar)2/n-1]1/2

where
xbar   is the sample mean
n        is the sample size.

See also, Descriptive Statistics Overview.

Standard Error. The standard error (this term was first used by Yule, 1897) is the standard deviation of a mean and is computed as:

std.err. = Ö(s2/n)

where
s2 is the sample variance
n is the sample size.

Standard Error of the Mean. The standard error of the mean (first used by Yule, 1897) is the theoretical standard deviation of all sample means of size n drawn from a population and depends on both the population variance (sigma) and the sample size (n) as indicated below:

= (2/n)1/2

where
2   is the population variance and
n      is the sample size.

Since the population variance is typically unknown, the best estimate for the standard error of the mean is then calculated as:

= (s2/n)1/2

where
s2    is the sample variance (our best estimate of the population variance) and
n    is the sample size.

See also, Descriptive Statistics Overview.

Standard Error of the Proportion. This is the standard deviation of the distribution of the sample proportion over repeated samples. If the population proportion is , and the sample size is N, the standard error of the proportion when sampling from an infinite population is

sp = (p(1-p)/N)**1/2

For more information see the chapter on Power Analysis.

Standard residual value. This is the standardized residual value (observed minus predicted divided by the square root of the residual mean square).

See also, Mahalanobis distance, deleted residual and Cook’s distance.

Standardization.While in the everyday language, the term "standardization" means - converting to a common standard or making something conform to a standard (i.e., its meaning is similar to the term "normalization" in data analysis, see normalization), in statistics, this term has a very specific meaning and refers to the transformation of data by subtracting each value from some reference value (typically a sample mean) and diving it by the standard deviation (typically a sample SD). This important transformation will bring all values (regardless of their distributions and original units of measurement) to compatible units from a distribution with a mean of 0 and a standard deviation of 1. This transformation has a wide variety of applications because it makes the distributions of values easy to compare across variables and/or subsets. If applied to the input data, standardization also makes the results of a variety of statistical techniques entirely independent of the ranges of values or the units of measurements (see the discussion of these issues in Elementary Concepts, Basic Statistics, Multiple Regression, Factor Analysis, and others).

Standardized DFFITS. This is another measure of impact of the respective case on the regression equation. The formula for standardized DFFITS is

SDFITi = DFFITi/(si(i)1/2)

where hi is the leverage for the ith case
and

i = 1/N + hi

See also, DFFITS, studentized residuals, and studentized deleted residuals. For more information see Hocking (1996) and Ryan (1997).

Standardized Effect (Es). A statistical effect expressed in convenient standardized units. For example, the standardized effect in a 2 Sample t-test is the difference between the two means, divided by the standard deviation, i.e.,

Es = (µ1 - µ2)/s

For more information see the chapter on Power Analysis.

Stationary Series (in Time Series). In Time Series analysis, a stationary series has a constant mean, variance, and autocorrelation through time (i.e., seasonal dependencies have been removed via Differencing).

Statistical Power. The probability of rejecting a false statistical null hypothesis.

For more information see the chapter on Power Analysis.

Statistical Process Control (SPC).The term Statistical Process Control (SPC) is typically used in context of manufacturing processes (although it may also pertain to services and other activities), and it denotes statistical methods used to monitor and improve the quality of the respective operations. By gathering information about the various stages of the process and performing statistical analysis on that information, the SPC engineer is able to take necessary action (often preventive) to ensure that the overall process stays in-control and to allow the product to meet all desired specifications. SPC involves monitoring processes, identifying problem areas, recommending methods to reduce variation and verifying that they work, optimizing the process, assessing the reliability of parts, and other analytic operations. SPC uses such basic statistical quality control methods as quality control charts (Sheward, Pareto, and others), capability analysis, gage repeatability/reproducibility analysis, and reliability analysis. However, also specialized experimental methods (DOE) and other advanced statistical techniques are often part of global SPC systems. Important components of effective, modern SPC systems are real-time access to data and facilities to document and respond to incoming QC data on-line, efficient central QC data warehousing, and groupware facilities allowing QC engineers to share data and reports (see also Enterprise SPC).

See also, Quality Control and Process Analysis.

For more information on process control systems, see the ASQC/AIAG's Fundamental statistical process control reference manual (1991).

Statistical Significance (p-level). The statistical significance of a result is an estimated measure of the degree to which it is "true" (in the sense of "representative of the population"). More technically, the value of the p-level represents a decreasing index of the reliability of a result. The higher the p-level, the less we can believe that the observed relation between variables in the sample is a reliable indicator of the relation between the respective variables in the population. Specifically, the p-level represents the probability of error that is involved in accepting our observed result as valid, that is, as "representative of the population." For example, the p-level of .05 (i.e.,1/20) indicates that there is a 5% probability that the relation between the variables found in our sample is a "fluke." In other words, assuming that in the population there was no relation between those variables whatsoever, and we were repeating experiments like ours one after another, we could expect that approximately in every 20 replications of the experiment there would be one in which the relation between the variables in question would be equal or stronger than in ours. In many areas of research, the p-level of .05 is customarily treated as a "border-line acceptable" error level.

See also, Elementary Concepts.

STATISTICA Advanced Linear/Nonlinear Models. StatSoft's STATISTICA Advanced Linear/Nonlinear Models offers a wide array of the most advanced modeling and forecasting tools on the market, including automatic model selection facilities and extensive interactive visualization tools.

General Linear Models
Generalized Linear/Nonlinear Models
General Regression Models
General Partial Least Squares Models
Variance Components
Survival Analysis
Nonlinear Estimation
Fixed Nonlinear Regression
Log-Linear Analysis of Frequency Tables
Time Series/Forecasting
Structural Equation Modeling, and more.

STATISTICA Automated Neural Networks (SANN). StatSoft's STATISTICA Automated Neural Networks (SANN) contains the most comprehensive neural network algorithms and training methods.

Automatic Search for Best Architecture and Network Solutions
Multilayer Perceptrons
Radial Basis Function Networks
Self-Organizing Feature Maps
Time Series Neural Networks for both Regression and Classification problems
A variety of algorithms for fast and efficient training of Neural Network Models including Gradient Descent, Conjugate Gradient, and BFGS
Numerous analytical graphs to aid in generating results and drawing conclusions
Sampling of data into subsets for optimizing network performance and enhancing the generalization ability
Sensitivity Analysis, Lift Charts, and ROC Curves
Creation of Ensembles out of already existing standalone networks
C-code and PMML (Predictive Model Markup Language) Neural Network Code Generators that are easy to deploy.

STATISTICA Base. StatSoft's STATISTICA Base offers a comprehensive set of essential statistics in a user-friendly package and all the performance, power, and ease of use of the STATISTICA technology.

All STATISTICA graphics tools
Basic Statistics, Breakdowns, and Tables
Distribution Fitting
Multiple Linear Regression
Analysis of Variance
Nonparametrics, and more.

STATISTICA Data Miner. StatSoft's STATISTICA Data Miner offers the most comprehensive selection of data mining solutions on the market, with an icon-based, extremely easy to use user interface (optionally Web browser based via WebSTATISTICA) and a deployment engine. It features a selection of completely integrated and automated, ready to deploy "as is" (but also easily customizable) systems of specific data mining solutions for a wide variety of business applications. A designated SPC version (QC Data Miner) to mine/analyze large streams of QC data is also available. The data mining solutions are driven by powerful procedures from five modules:

General Slicer/Dicer Explorer (with optional OLAP)
General Classifier
General Modeler/Multivariate Explorer
General Forecaster
General Neural Networks Explorer, and more

STATISTICA Data Warehouse. StatSoft's STATISTICA Data Warehouse is a complete, powerful, scalable, and customizable intelligent data warehouse solution, which also optionally offers the most complete analytic functionality available on the market, fully integrated into the system. STATISTICA Data Warehouse consists of a suite of powerful, flexible component applications, including:

STATISTICA Data Warehouse Server Database
STATISTICA Data Warehouse Query (featuring WebSTATISTICA Query)
STATISTICA Data Warehouse Analyzer (featuring WebSTATISTICA Data Miner, WebSTATISTICA Text Miner, WebSTATISTICA QC Miner, or the complete set of WebSTATISTICA analytics)
STATISTICA Data Warehouse Reporter (featuring WebSTATISTICA Knowledge Portal and/or WebSTATISTICA Interactive Knowledge Portal)
STATISTICA Data Warehouse Document Repository (featuring WebSTATISTICA Document Management System)
STATISTICA Data Warehouse Scheduler
STATISTICA Data Warehouse Real Time Monitor and Reporter (featuring WebSTATISTICA Enterprise or WebSTATISTICA Enterprise/QC)

If you are new to data warehousing, StatSoft consultants will guide you step by step through the entire process of designing the optimal data warehouse architecture - from a comprehensive review of your information storage and extraction/analysis needs, to the final training of your employees and support of your daily operations.

Crucial features and benefits. The crucial features and benefits of STATISTICA Data Warehouse solutions include, among many others:
Complete data warehousing application tailored to your business
Platform independent architecture for seamless integration with your existing infrastructure
Facilities to integrate data from a wide variety of sources
Virtually unlimited scalability
Options to update/synchronize data from multiple sources via automatic schedulers or on demand
Completely Web-enabled system architecture to provide ultimate enterprise functionality for all company locations around the world (e.g., access via Web browsers from any location)
Advanced security model and authentication of users
Complete document management options to optimize management of documents of any types and satisfy regulatory requirements (e.g., FDA 21 CFR Part 11, ISO 9000)
Advanced analytic components to clean/verify data and to integrate automated data mining, artificial intelligence, and real-time process monitoring
Options to automatically run and post on Knowledge Portals (or broadcast) highly customized reports, including interactive (i.e., drillable, sliceable, and user-customizable) reports and results of advanced analytics
Backup and archiving options
Programmable, customizable, and expandable to adapt to specific mission profiles (open architecture, exposed to extensions using the most industry standard languages, such as VB, C++, Java, HTML)
Built on robust, well tested, highly scalable, cutting-edge technology to leverage your investment [including highly optimized in-place database processing (IDP) technology, true multithreading, distributed/parallel processing, and support for pooling CPU resources of multiple servers to deliver supercomputer-like performance]

STATISTICA Data Warehouse is a complete intelligent data storage and information delivery/distribution solution that enables you to customize the flow of information through your organization, provide all authorized members of your organization with flexible, secure, and rapid access to critical information and intelligent reporting.

The system is virtually platform independent and will fit into any existing database architecture and hardware environment. It will efficiently combine information from multiple database formats and sources (from manual data entry forms to large batteries of automatic data collection devices). The system can be further enhanced through integration with other fully compatible components of the STATISTICA line of applications and solutions; to name just a few:

STATISTICA Data Miner for advanced data mining and artificial intelligence (e.g., neural networks) based solutions to provide decision support through cutting-edge methods for knowledge extraction and prediction
Quality Control Miner and Enterprise/QC for tight integration with quality control, process control, and yield management activities
STATISTICA Text Miner for automatic processing of unstructured information in documents, databases, or Web directories (Web-crawling of URLs)
STATISTICA Knowledge Portal for presenting summary reports, charts, and action items to end users (management, sales force, engineers, etc.) through secure access portals via the Web; to deliver key intelligence and decision support to stakeholders worldwide (e.g., you access the STATISTICA Knowledge Portal via standard Web browsers from anywhere in the world)

Architecture and connectivity. STATISTICA Data Warehouse connects to any platform, database, or data source, and will scale to businesses and applications of any size. The program is built on a database and database schema customized for your particular business. The solution can be installed either inclusive of a high performance database engine (SQL Server) or as a (virtual) database schema compatible with most industry standard databases; therefore, it will seamlessly integrate into existing database systems. Because STATISTICA Data Warehouse does not depend on one particular database vendor or hardware platform, it is itself entirely platform-independent. The main Data Warehouse software will connect to any database format and, hence, can efficiently combine and pool information from multiple sources. STATISTICA Data Warehouse application software will run on servers with multiple processors or banks of multiple-processor servers for super-computer like performance. The system will scale effortlessly and economically to even huge data sizes and analysis (intelligence) problems.

Web enablement. STATISTICA Data Warehouse extracts information from sources anywhere in the world and delivers intelligence anywhere in the world.

The Web component of the system is built on the proven WebSTATISTICA technology that is used by organizations worldwide to provide secure access via standard Web browsers. Unlike other Web-based solutions, STATISTICA Data Warehouse does not require any additional components to be installed on the (thin) client machines. Hence, the system can be utilized by (authorized and authenticated) users worldwide from hotel rooms via dial-up modems, from home, or from office and production facilities located at the most remote places on earth (e.g., via satellite Web links).
Advanced security and authentication. The STATISTICA Data Warehouse implements a detailed and sophisticated security system to ensure that your proprietary knowledge and intelligence is safe from unauthorized access. The system will likely become the most important repository of business intelligence and decision support resources in your organization. Therefore, the security of the system is a crucial priority so that those valuable resources are shielded from unauthorized access.

STATISTICA Data Warehouse implements the highest level of security by establishing groups of users with different levels of authority (regarding the information that is accessible and the operations that can be performed), requiring regularly updated passwords, etc. Special methods are also in place to detect and guard against systematic electronic intrusions ("hacking").

Document control. STATISTICA Data Warehouse enables full document management, compliant with government and industry standards.

STATISTICA Document Management System can be seamlessly integrated into your STATISTICA Data Warehouse application to optimize the flow of information within your organization and thus increase your productivity. This system can also be configured to comply with all (corporate) documentation management policies or regulatory requirements for document security, audit trails, and electronic signatures/authentication (as, for example, stipulated by FDA 21 CFR Part 11: Electronic Records; Electronic Signatures; or ISO 9001 4.5: Document and data control).

Advanced analytics. STATISTICA Data Warehouse can incorporate the most advanced data analysis and knowledge extraction methods available; you can go far beyond OLAP to simplify and extract knowledge about even the most complex - and inaccessible to other applications - patterns in the data.

Because STATISTICA Data Warehouse is built from the same high performance components as the entire STATISTICA line of analytic solutions software, those analytic solutions can easily and seamlessly be integrated into your Data Warehouse. STATISTICA offers the most comprehensive set of tools for data mining, text mining, data analysis, graphics and visualization, quality and process control (including Six Sigma), etc. on the market. These resources and technologies can be connected to the data sources in the STATISTICA Data Warehouse to leverage the most advanced technologies and algorithms available for analyzing and extracting key intelligence from all sources. For example, you can apply hundreds of neural networks architectures, highest performance tree classifiers (e.g., stochastic gradient boosting trees), flexible root cause analyses, control charting methods, powerful business forecasting methods, or sophisticated analytic graphics methods to convert raw data in the Data Warehouse into useful and actionable intelligence with clear implications for decisions affecting your business.

Programmability and customizability. STATISTICA Data Warehouse is an open-architecture system that will not lock you into a relation with a single vendor or solution; you can respond quickly to new business demands and requirements that need to be incorporated into the Data Warehouse.

As all applications and solutions in the STATISTICA family of products, STATISTICA Data Warehouse is fully programmable and customizable, using industry standard programming tools such as Visual Basic, C++, Java, or HTML. This feature is of key importance when your business depends on your ability to quickly adapt to new information and business realities. Because you can customize the system without being forced to rely on the programmers of a single vendor or knowledge of idiosyncratic scripting conventions (required by many competing solutions), you have the freedom to develop your proprietary extensions to the data warehouse and to add not only your own reports but also custom analytic and data transformation/cleaning procedures, using widely available resources and industry standard tools (e.g., VB, C++, Java, or HTML tools and programmers). Of course, StatSoft can always offer to you a full complement of consulting, system integration, and programming services delivered by an experienced staff, if you choose to work with us.

STATISTICA Design of Experiments. StatSoft's STATISTICA Design of Experiments features the largest selection of DOE and related visualization techniques including interactive desirability profilers (a comprehensive tool for Six Sigma methods).

Fractional Factorial Designs
Mixture Designs
Latin Squares
Search for Optimal 2(k-p) Designs
Residual Analysis and Transformations
Optimization of single/multiple response variables
Central Composite Designs
Taguchi Designs
Minimum Aberration & Maximum Unconfounding
2(k-p) Fractional Factorial Designs with Blocks
Constrained Surfaces
D- and A-Optimal Designs
Desirability profilers, and more

STATISTICA Document Management System (SDMS). StatSoft's STATISTICA Document Management System (SDMS) is a complete, highly scalable, database solution package for managing electronic documents. With the STATISTICA Document Management System, you can quickly, efficiently, and securely manage documents of any type [e.g., find them, access them, search for content, review, organize, edit (with trail logging and versioning), approve, etc.].

Extremely transparent and easy to use
Flexible, customizable (optionally browser/Web-enabled) user interface
Electronic signatures
Comprehensive auditing trails, approvals
Optimized searches
Document comparison tools
Security
Satisfies the FDA 21 CFR Part 11 requirements
Satisfies ISO 9000 (9001, 14001) documentation requirements
Unlimited scalability (from desktop or network Client-Server versions, to the ultimate size, Web-based worldwide systems)
Open architecture and compatibility with industry standards

STATISTICA Enterprise. StatSoft's STATISTICA Enterprise is an integrated multi-user system designed for general-purpose data analysis and business intelligence applications in research. STATISTICA Enterprise can optionally offer the statistical functionality available in any or all STATISTICA products.

Integration with data warehouses
Intuitive query and filtering tools
Easy-to use administration tools
Automatic report distribution
Alarm notification, and more

STATISTICA Enterprise/QC. StatSoft's STATISTICA Enterprise/QC is designed for local and global enterprise quality control and improvement applications including Six Sigma. STATISTICA Enterprise/QC offers a high-performance database (or an optimized interface to existing databases), real-time monitoring and alarm notification for the production floor, a comprehensive set of analytical tools for engineers, sophisticated reporting features for management, Six Sigma reporting options, and much more.

Web-enabled user interface and reporting tools; interactive querying tools
User-specific interfaces for operators, engineers, etc.
Groupware functionality for sharing queries, special applications, etc.
Open-ended alarm notification including cause/action prompts
Scalable, customizable, and can be integrated into existing database/ERP systems, and more

STATISTICA Monitoring and Alerting Server (MAS). StatSoft's STATISTICA Monitoring and Alerting Server (MAS) is a system that enables users to automate the continual monitoring of hundreds or thousands of critical process and product parameters. The ongoing monitoring is an automated and efficient method for:

Monitoring many critical parameters simultaneously
Providing status "snapshots" from the results of these monitoring activities to personnel based on their responsibilities.
Dashboards associated with User/Group.

STATISTICA MultiStream. StatSoft's STATISTICA MultiStream is a solution package for identifying and implementing effective strategies for advanced multivariate process monitoring and control. STATISTICA MultiStream was designed for process industries in general, but is particularly well suited to help power generation facilities leverage their data (collected into existing specialized process data bases for multivariate and predictive process control) for actionable advisory systems.

STATISTICA MultiStream is a complete enterprise system built on a robust, advanced client-server (and fully Web-enabled) architecture, offers central administration and management of deployment of models, as well as cutting edge root-cause analysis and predictive data mining technology, and its analytics are seamlessly integrated with a built-in document management system.

Automated (nonlinear) root cause analysis and feature selection for thousands of parameters, to clearly identify which ones are the most likely responsible for process problems
Automated and interactive commonality analysis to identify parameters and processes that shifted or moved from normal operations during particular time intervals
Advanced linear and nonlinear (e.g., SVM, Recursive Partitioning, Neural Nets) models for creating sensitive multivariate control schemes and work flows to identify multivariate shifts and drifts early, before they cause problems
Advanced data mining algorithms for predicting and optimizing key performance and quality indicators
Tracks hundreds of data streams simultaneously
Delivers simple summaries relevant to critical process parameters and outcomes via efficient and simple dashboards and drill-down workflows
Delivers standard and customized analytic workflows for root cause analysis, leveraging cutting-edge data analysis and data mining technologies
Warns of (predicted) problems and equipment failures before they occur (predictive alarming), thus avoiding costly shut-downs and unscheduled maintenance
Watches "everything" that impacts your process performance in real time

STATISTICA Multivariate Exploratory Techniques. StatSoft's STATISTICA Multivariate Exploratory Techniques offers a broad selection of exploratory techniques for various types of data, with extensive, interactive visualization tools.

Cluster Analysis
Factor Analysis
Principal Components/Classification Analysis
Canonical Analysis
Discriminant Analysis
General Discriminant Analysis Models
Reliability/Item Analysis
Classification Trees
Correspondence Analysis
Multidimensional Scaling, and more.

STATISTICA Multivariate Statistical Process Control (MSPC). StatSoft's STATISTICA Multivariate Statistical Process Control (MSPC) is a complete solution for multivariate statistical process control, deployed within a scalable, secure analytics software platform.

Univariate and multivariate statistical methods for quality control, predictive modeling, and data reduction
Functions to determine the most critical process, raw materials, and environment factors and their optimal settings for delivering products of the highest quality
Monitoring of process characteristics interactively or automatically during production stages
Building, evaluating, and deploying predictive models based on the known outcomes from historical data
Historical analysis, data exploration, data visualization, predictive model building and evaluation, model deployment to monitoring server
Interactive monitoring with dashboard summary displays and automatic-updating results
Automated monitoring with rules, alarm events, and configurable actions
Multivariate techniques including Partial Least Squares, Principal Components, Neural Networks, Recursive Partitioning (Tree) Methods, Support Vector Machines, Independent Components Analysis, Cluster Analysis, and more

STATISTICA PI Connector. StatSoft's STATISTICA PI Connector is an optional STATISTICA add-on component that allows for direct integration to data stored in the PI data historian. The STATISTICA PI Connector utilizes the PI user access control and security model, allows for interactive browsing of tags, and takes advantages of dedicated PI functionality for interpolation and snapshot data. STATISTICA integrated with the PI system is being used for streamlined and automated analyses for applications such as Process Analytical Technology (PAT) in FDA-regulated industries, Advanced Process Control (APC) systems in Chemical and Petrochemical industries, and advisory systems for process optimization and compliance in the Energy Utility industry.

STATISTICA Power Analysis. StatSoft's STATISTICA Power Analysis is an extremely precise and user-friendly, specialized tool for analyzing all aspects of statistical power and sample size calculation.

Sample Size Calculation
Confidence Interval Estimation
Statistical Distribution Calculators, and more.

STATISTICA PowerSolutions. StatSoft's STATISTICA PowerSolutions is a solution package aimed for use at power generation companies to optimize power plant performance, increase efficiency, and reduce emissions. This product offers a highly economical alternative to multimillion dollar investments in new or upgraded equipment (hardware). Based on more than 20 years of experience in applying advanced data driven, predictive data mining/optimization technologies for process optimization in various industries, STATISTICA PowerSolutions enables power plants to get the most out of their existing equipment and control systems by leveraging all data collected at their sites to identify opportunities for improvement, even for older designs such as coal-fired Cyclone furnaces (as well as wall-fired or T-fired designs).

STATISTICA Process Analysis. StatSoft's STATISTICA Process Analysis is a comprehensive package for Process Capability, Gage R&R, and other quality control/improvement applications (a comprehensive tool for Six Sigma methods).

Process/Capability Analysis Charts
Ishikawa (Cause and Effect) Diagrams
Gage Repeatability & Reproducibility
Variance Components for Random Effects
Weibull Analysis
Sampling plans, and more.

STATISTICA QC Miner. StatSoft's STATISTICA QC Miner is a powerful software solution designed to monitor processes and identify and anticipate problems related to quality control and improvement with unmatched sensitivity and effectiveness. STATISTICA QC Miner integrates all Quality Control Charts, Process Capability Analyses, Experimental Design procedures, and Six Sigma methods with a comprehensive library of cutting-edge techniques for exploratory and predictive data mining.

Predict QC problems with cutting edge data mining methods
Discover root causes of problem areas
Monitor and improve ROI (Return On Investment)
Generate suggestions for improvement
Monitor processes in real time over the Web
Create and deploy QC/SPC solutions over the Web
Use multithreading and distributed processing to rapidly process extremely large streams of data

STATISTICA Quality Control Charts. StatSoft's STATISTICA Quality Control Charts offers fully customizable (e.g., callable from other environments), easy and quick to use, versatile charts with a selection of automation options and user-interface shortcuts to simplify routine work (a comprehensive tool for Six Sigma methods).

Multiple Chart (Six Sigma Style) Reports and displays
X-bar and R Charts; X-bar and S Charts; Np, P, U, C Charts
Pareto Charts
Process Capability and Performance Indices
Moving Average/Range Charts, EWMA Charts
Short Run Charts (including Nominal and Target)
CuSum (Cumulative Sum) Charts
Runs Tests
Interactive
Causes and actions, customizable alarms, analytic brushing, and more.

STATISTICA Sequence, Association and Link Analysis (SAL). StatSoft's STATISTICA Sequence, Association and Link Analysis (SAL) is designed to address the needs of clients in retailing, banking, insurance, etc., industries by implementing the fastest known highly scalable algorithm with the ability to drive Association and Sequence rules in one single analysis. The program represents a stand-alone module that can be used for both model building and deployment. All tools in STATISTICA Data Miner can be quickly and effortlessly leveraged to analyze and "drill into" results generated via STATISTICA SAL.

Uses a tree-building technique to extract Association and Sequence rules from data
Uses efficient and thread-safe local relational database technology to store Association and Sequence models
Handles multiple response, multiple dichotomy, and continuous variables in one analysis
Performs Sequence Analysis while mining for Association rules in a single analysis
Simultaneously extracts Association and Sequence rules for more than one dimension
Given the ability to perform multidimensional Association and Sequence mining and the capacity to extract only rules for specific items, the program can be used for Predictive Data Mining
Performs Hierarchical Single-Linkage Cluster Analysis, which can detect the more likely cluster of items that can occur. This has extremely useful, practical real-world applications, e.g., in retailing.

STATISTICA Text Miner. StatSoft's STATISTICA Text Miner is a powerful software solution for text mining, document retrieval, and mining of unstructured data. An optional add-on product for STATISTICA Data Miner, designed and optimized for accessing and analyzing documents (unstructured information) in a variety of formats: .txt (text), .pdf (Adobe), .ps (PostScriptTM), .html, . xml (Web-formats), and most Microsoft Office formats (e.g., .doc, .rtf); optimized access to Web pages (URL addresses) is also provided.

Efficiently index very large collections of text documents; identify key terms and similarities between documents and terms, and extract the information relevant to your mission and goals
Apply stub-lists (words to ignore) and language-specific stemming algorithms (various languages are supported)
Includes numerous options for converting documents into numeric information for further processing (e.g., mapping, clustering, predictive data mining, classification of documents, etc.)
Full support for multithreaded operation on multi-processor server installations for extremely fast indexing and searching of huge document repositories
Can also be used to index, analyze, and mine other unstructured input, such as sound or image files (after domain-specific pre-processing is applied)
Fully integrated into the STATISTICA and WebSTATISTICA systems; hence, the large number of available methods for supervised and unsupervised learning (clustering), mapping, data visualization, etc., are directly and immediately available; many of the algorithms available in STATISTICA Data Miner, such as the machine learning algorithms (k-Nearest Neighbor, Naive Bayes classifiers, advanced Support Vector Machines and Kernel classifiers), are particularly well suited for text mining or the analysis of other unstructured information

STATISTICA Variance Estimation and Precision (VEPAC). StatSoft's STATISTICA Variance Estimation and Precision (VEPAC) offers a comprehensive set of techniques for analyzing data from experiments that include both fixed and random effects using REML (Restricted Maximum Likelihood Estimation). With STATISTICA VEPAC, you can obtain estimates of variance components and use them to make precision statements while at the same time comparing fixed effects in the presence of multiple sources of variation.

Variability plots
Multiple plot layouts to allow direct comparison of multiple dependent variables
Expected mean squares and variance components with confidence intervals
Flexible handling of multiple dependent variables: analyze several variables with the same or different designs at once
Graph displays of variance components

WebSTATISTICA Server. StatSoft's WebSTATISTICA Server is the ultimate enterprise system that offers full Web enablement, including the ability to run STATISTICA interactively or in batch from a Web browser on any computer (including Linux, UNIX), offload time consuming tasks to the servers (using distributed processing), use multi-tier Client-Server architecture, manage projects over the Web, and collaborate "across the hall or across continents" (supporting multithreading and distributed/parallel processing that scales to multiple server computers).

Steepest Descent Iterations. When initial values for the parameters are far from the ultimate minimum, the approximate Hessian used in the Gauss-Newton procedure may fail to yield a proper step direction during iteration. In this case, the program may iterate into a region of the parameter space from which recovery (i.e., successful iteration to the true minimum point) is not possible. One option offered by Structural Equation Modeling is to precede the Gauss-Newton procedure with a few iterations utilizing the "method of steepest descent." In the steepest descent approach, values of the parameter vector q on each iteration are obtained as

k+1 = k + kgk

In simple terms, what this means is that the Hessian is not used to help find the direction for the next step. Instead, only the first derivative information in the gradient is used.

Hint for beginners. Inserting a few Steepest Descent Iterations may help in situations where the iterative routine "gets lost" after only a few iterations.

Stemming. An important pre-processing step before indexing input documents for text mining is the stemming of words. The term stemming refers to the reduction of words to their roots so that, for example, different grammatical forms or declinations of verbs are identified and indexed (counted) as the same word. For example, stemming will ensure that both "travel" and "traveled" will be recognized by the program as the same word.

For more information, see Manning and Schütze (2002).

Steps. Repetitions of a particular analytic or computational operation or procedure. For example in the neural network time series analysis, the number of consecutive time steps from which input variable values should be drawn to be fed into the neural network input units.

Stepwise Regression. A model-building technique which finds subsets of predictor variables that most adequately predict responses on a dependent variable by linear (or nonlinear) regression, given the specified criteria for adequacy of model fit.

For an overview of stepwise regression and model fit criteria see the General Stepwise Regression chapter, or the Multiple Regression chapter; for nonlinear stepwise and best subset regression, see the Generalized Linear Models chapter.

Stiffness Parameter (in Fitting Options). The function that controls the weight is determined by the Stiffness parameter which can be modified. Thus, the stiffness parameter determines the degree to which the fitted curve depends on local configurations of the analyzed values.

The lower the coefficient, the more the shape of the curve is influenced by individual data points (i.e., the curve "bends" more to accommodate individual values and subsets of values).

The range of the stiffness parameters is 0 < s < 1. Large values of the parameter produce smoother curves that adequately represent the overall pattern in the data set at the expense of local details.

See also, McLain, 1974.

Stopping Conditions. During an iterative process (e.g., fitting, searching, training), the conditions which must be true for the process to stop. (For example, in neural networks, the stopping conditions include the maximum number of epochs, target error performance and the minimum error improvement thresholds.

Stopping Conditions (in Neural Networks). The iterative gradient-descent training algorithms (back propagation, Quasi-Newton, conjugate gradient descent, Levenberg-Marquardt, quick propagation, Delta-bar-Delta, and Kohonen) all attempt to reduce the training error on each epoch.

You specify a maximum number of epochs for these iterative algorithms. However, you can also define stopping conditions that may cause training to determine earlier.

Specifically, training may be stopped when:

The conditions are cumulative; i.e., if several stopping conditions are specified, training ceases when any one of them is satisfied. In particular, a maximum number of epochs must always be specified.

The error-based stopping conditions can also be specified independently for the error on the training set and the error on the selection set (if any).

Target Error. You can specify a target error level, for the training subset, the selection subset, or both. If the RMS falls below this level, training ceases.

Minimum Improvement. Specifies that the RMS error on the training subset, the selection subset, or both must improve by at least this amount, or training will cease (if the Window parameter is non-zero).

Sometimes error improvement may slow down for a while or even rise temporarily (particularly if the shuffle option is used with back propagation, or non-zero noise is specified, as these both introduce an element of noise into the training process).

To prevent this option from aborting the run prematurely, specify a longer Window.

It is particularly recommended to monitor the selection error for minimum improvement, as this helps to prevent over-learning.

Specify a negative improvement threshold if you want to stop training only when a significant deterioration in the error is detected. The algorithm will stop when a number of generations pass during which the error is always the given amount worse than the best it ever achieved.

Window. The window factor is the number of epochs across which the error must fail to improve by the specified amount, before the algorithm is deemed to have slowed down too much and is stopped.

By default the window is zero, which means that the minimum improvement stopping condition is not used at all.

Stopping Rule (in Classification Trees). The stopping rule for a classification tree refers to the criteria that are used for determining the "right-sized" classification tree, that is, a classification tree with an appropriate number of splits and optimal predictive accuracy. The process of determining the "right-sized" classification tree is described in the Computational Methods section of the Classification Trees chapter.

Stratified Random Sampling. In general, random sampling is the process of randomly selecting observations from a population, to create a subsample that "represents" the observations in that population (see Kish, 1965; see also Probability Sampling, Simple Random Sampling, EPSEM Samples; see also Representative Sample for a brief exploration of this often misunderstood notion). In stratified sampling one usually applies specific (identical or different) sampling fractions to different groups (strata) in the population to draw the sample.

Over-sampling particular strata to over-represent rare events. In some predictive data mining applications it is often necessary to apply stratified sampling to systematically over-sample (apply a greater sampling fraction) to particular "rare events" of interest. For example, in catalog retailing the response rate to particular catalog offers can be below 1%, and when analyzing historical data (from prior campaigns) to build a model for targeting potential customers more successfully, it is desirable to over-sample past respondents (i.e., the "rare" respondents who ordered from the catalog); one can then apply the various model building techniques for classification (see Data Mining) to a sample consisting of approximately 50% responders and 50% non-responders. Otherwise, if one were to draw a simple random sample for the analysis (with 1% of responders), then practically all model building techniques would likely predict a simple "no-response" for all cases, and would be (trivially) correct in 99% of the cases.

Stub and Banner Tables (Banner Tables). Stub-and-banner tables are essentially two-way tables, except that two lists of categorical variables (instead of just two individual variables) are crosstabulated. In the Stub-and-banner table, one list will be tabulated in the columns (horizontally) and the second list will be tabulated in the rows (vertically) of the Scrollsheet.

For more information, see the Stub and Banner Tables section of the Basic Statistics chapter.

Student's t Distribution. The Student's t distribution has density function (for = 1, 2, ...):


where
     is the degrees of freedom
    (gamma) is the Gamma function
    is the constant Pi (3.14...)


The animation above shows various tail areas (p-values) for a Student's t distribution with 15 degrees of freedom.

Studentized Deleted Residuals. In addition to standardized residuals several methods (including studentized residuals, studentized deleted residuals, DFFITS, and standardized DFFITS) are available for detecting outlying values (observations with extreme values on the set of predictor variables or the dependent variable). The formula for studentized deleted residuals is given by

SDRESIDi = DRESIDi/ s(i)

for

DRESID = ei/(1-i )

and where

s(i) = 1/(C-p-1)1/2 * ((C-p)s2/1-hi) - DRESIDi2)1/2

ei    is the error for the ith case
hi    is the leverage for the ith case
p     is the number of coefficients in the model

and

i = 1/N + hi

For more information see Hocking (1996) and Ryan (1997).

Studentized Residuals. In addition to standardized residuals several methods (including studentized residuals, studentized deleted residuals, DFFITS, and standardized DFFITS) are available for detecting outlying values (observations with extreme values on the set of predictor variables or the dependent variable). The formula for studentized residuals is

SRESi = (ei/s)/(1-i)1/2

where
ei    is the error for the ith case
hi    is the leverage for the ith case

and i = 1/N + hi

For more information see Hocking (1996) and Ryan (1997).

Sweeping. The sweeping transformation of matrices is commonly used to efficiently perform stepwise multiple regression (see Dempster, 1969, Jennrich, 1977) or similar analyses; a modified version of this transformation is also used to compute the g2 generalized inverse. The forward sweeping transformation for a column k can be summarized in the following four steps (where the e's refer to the elements of a symmetric matrix):

  1. eij = eij - ejk * ekj / ekk for i<>k, j<>k

  2. ekj = ekj / ekk

  3. eik = eik / ekk

  4. ekk = -1 / ekk

The reverse sweeping operation reverses the changes effected by these transformations. The sweeping operator is used extensively in General Linear Models, Multiple Regression, and similar techniques.

Sum-squared error function. An error function composed by squaring the difference between sets of target and actual values, and adding these together (see also, loss function.

Supervised and Unsupervised Learning. An important distinction in machine learning, and also applicable to data mining, is that between supervised and unsupervised learning algorithms. The term "supervised" learning is usually applied to cases in which a particular classification is already observed and recorded in a training sample, and you want to build a model to predict those classifications (in a new testing sample). For example, you may have a data set that contains information about who from among a list of customers targeted for a special promotion responded to that offer. The purpose of the classification analysis would be to build a model to predict who (from a different list of new potential customers) is likely to respond to the same (or a similar) offer in the future. You may want to review the methods discussed in General Classification and Regression Trees (GC&RT), General CHAID Models (GCHAID), Discriminant Function Analysis and General Discriminant Analysis (GDA), MARSplines (Multivariate Adaptive Regression Splines), and neural networks to learn about different techniques that can be used to build or fit models to data where the outcome variable of interest (e.g., customer did or did not respond to an offer) was observed. These methods are called supervised learning algorithms because the learning (fitting of models) is "guided" or "supervised" by the observed classifications recorded in the data file.

In unsupervised learning, the situation is different. Here the outcome variable of interest is not (and perhaps cannot be) directly observed. Instead, we want to detect some "structure" or clusters in the data that may not be trivially observable. For example, you may have a database of customers with various demographic indicators and variables potentially relevant to future purchasing behavior. Your goal would be to find market segments, i.e., groups of observations that are relatively similar to each other on certain variables; once identified, you could then determine how best to reach one or more clusters by providing certain goods or services you think may have some special utility or appeal to individuals in that segment (cluster). This type of task calls for an unsupervised learning algorithm, because learning (fitting of models) in this case cannot be guided by previously known classifications. Only after identifying certain clusters can you begin to assign labels, for example, based on subsequent research (e.g., after identifying one group of customers as "young risk takers").

There are several methods available for unsupervised learning, including Principal Components and Classification Analysis, Factor Analysis, Multidimensional Scaling, Correspondence Analysis, Neural Networks, Self-Organizing Feature Maps (SOFM, Kohonen networks); particularly powerful algorithms for pattern recognition and clustering are the EM and k-Means clustering algorithms.

Support Value (Association Rules). When applying (in data or text mining) algorithms for deriving association rules of the general form If Body then Head (e.g., If (Car=Porsche and Age<20) then (Risk=High and Insurance=High)), the Support value is computed as the joint probability (relative frequency of co-occurrence) of the Body and Head of each association rule.

Support Vector. A set of points in the feature space that determines the boundary between objects of different class memberships.

Support Vector Machine (SVM) Support Vector Machine (SVM) A classification method based on the maximum margin hyperplane.

Suppressor Variable. A suppressor variable (in Multiple Regression ) has zero (or close to zero) correlation with the criterion but is correlated with one or more of the predictor variables, and therefore, it will suppress irrelevant variance of independent variables. For example, you are trying to predict the times of runners in a 40 meter dash. Your predictors are Height and Weight of the runner. Now, assume that Height is not correlated with Time, but Weight is. Also assume that Weight and Height are correlated. If Height is a suppressor variable, then it will suppress, or control for, irrelevant variance (i.e., variance that is shared with the predictor and not the criterion), thus increasing the partial correlation. This can be viewed as ridding the analysis of noise.

Let t = Time, h = Height, w - Weight, rth = 0.0, rtw = 0.5, and rhw = 0.6.

Weight in this instance accounts for 25% (Rtw**2 = 0.5**2) of the variability of Time. However, if Height is included in the model, then an additional 14% of the variability of Time is accounted for even though Height is not correlated with Time (see below):

Rt.hw**2 = 0.5**2/(1 - 0.6**2) = 0.39

For more information, please refer to Pedhazur, 1982.

Surface Plot (from Raw Data). This sequential plot fits a spline-smoothed surface to each data point. Successive values of each series are plotted along the X-axis, with each successive series represented along the Y-axis.

Survival Analysis. Survival analysis (exploratory and hypothesis testing) techniques include descriptive methods for estimating the distribution of survival times from a sample, methods for comparing survival in two or more groups, and techniques for fitting linear or non-linear regression models to survival data. A defining characteristic of survival time data is that they usually include so-called censored observations, e.g., observations that "survived" to a certain point in time, and then dropped out from the study (e.g., patients who are discharged from a hospital). Instead of discarding such observations from the data analysis all together (i.e., unnecessarily loose potentially useful information) survival analysis techniques can accommodate censored observations, and "use" them in statistical significance testing and model fitting.

Typical survival analysis methods include life table, survival distribution, and Kaplan-Meier survival function estimation, and additional techniques for comparing the survival in two or more groups. Finally, Survival analysis includes the use of regression models for estimating the relationship of (multiple) continuous variables to survival times.

For more information, see the Survival Analysis chapter.

Survivorship Function. The survivorship function (commonly denoted as R(t)) is the complement to the cumulative distribution function (i.e., R(t)=1-F(t)); the survivorship function is also referred to as the reliability or survival function (since it describes the probability of not failing or of surviving until a certain time t; e.g., see Lee, 1992).

For additional information see also the Survival Analysis chapter, or the Weibull and Reliability/Failure Time Analysis section in the Process Analysis chapter.

Symmetric Matrix. A matrix is symmetric if the transpose of the matrix is itself (i.e., A = A'). In other words, the lower triangle of the square matrix is a "mirror image" of the upper triangle with 1's on the diagonal (see below).

|1 2 3 4|
|2 1 5 6|
|3 5 1 7|
|4 6 7 1|

Symmetrical Distribution. If you split the distribution in half at its mean (or median), then the distribution of values would be a "mirror image" about this central point.

See also, Descriptive Statistics Overview.

Synaptic Functions (in Neural Networks).

Dot product. Dot product units perform a weighted sum of their inputs, minus the threshold value. In vector terminology, this is the dot product of the weight vector with the input vector, plus a bias value. Dot product units have equal output values along hyperplanes in pattern space. They attempt to perform classification by dividing pattern space into sections using intersecting hyperplanes.

Radial. Radial units calculate the square of the distance between the two points in N dimensional space (where N is the number of inputs) represented by the input pattern vector and the unit's weight vector. Radial units have equal output values lying on hyperspheres in pattern space. They attempt to perform classification by measuring the distance of normalized cases from exemplar points in pattern space (the exemplars being stored by the units). The squared distance is multiplied by the threshold (which is, therefore, actually a deviation value in radial units) to produce the post synaptic value of the unit (which is then passed to the unit's activation function).

Dot product units are used in multilayer perceptron and linear networks, and in the final layers of radial basis function, PNN, and GRNN networks.

Radial units are used in the second layer of Kohonen, radial basis function, Clustering, and probabilistic and generalized regression networks.  They are not used in any other layers of any standard network architecture.

Division. This is specially designed for use in generalized regression networks, and should not be employed elsewhere. It expects one incoming weight to equal +1, one to equal -1, and the others to equal zero. The post-synaptic value is the +1 input divided by the -1 input.






© Copyright StatSoft, Inc., 1984-2008
STATISTICA is a trademark of StatSoft, Inc.