Chapter 7. Analysis tools

Data Analysis Tools

Data analysis tools package contains various tools for statistical data analysis, random number generation and data sampling. To use these tools select the ``Data Analysis...'' entry in the ``Tools'' menu. This gives you a list of tools to choose from. Select one of the tools from the list and press the OK button. The tools are described below.

All the tools have the same output options. The results can be printed out into a new sheet, into a new workbook, or into the same sheet within a given output range. To select the output method just select one of the radio buttons bellow the ``Output options:'' label. Note that if the output is pasted into a range that is too small, some of the results will not fit in it. Also not that the old data in the output range is lost.

Anova: Single Factor Tool

Use this tool to make a single factor analysis of the variances of given variables. The variables are specified by the ``Input Range:'' entry. The given range can be groupped either by columns or by rows. ``Alpha:'' entry specifies the level of significance which is by default 95%.

If you have labels, for example the names of the variables, in the first row of the given range, you should check the ``Labels'' button on. This leaves the first row (or the first column if the data is groupped by rows) of the input range out of the analysis. In addition, the names of the variables are pasted nicely into the output table.

The tool calculates the source of variation between groups, within groups, and the total source of variation. The values of these are given in the ''SS'' column. The degrees of freedom of are given in the ''df'' column. In ''MS'' column the sources of variation are divided by the degrees of freedom. The results of these divisions are used in calculating the F value in the ''F'' column. The F value is the division of MS in ''Between columns'' and ''MS'' in ''Within columns''. You can compare this value with the ''F critical'' value in the last column. The F critical value is the largest value of F that is statistically significant using the given significance level (''Alpha''). Bellow the ''P-value'' is the result of the F-test.

In addition, the tool calculates the count, sum, average, and the variance of all the variables.

Correlation Tool

The correlation tool calculates correlation coefficients of given variables. Use this tool to calculate any number of correlation coefficients at the same time. The variables for which the correlations are calulated are specified by the ``Input Range:'' entry. The given range can be groupped either by columns or by rows.

For example, you want to calculate the correlation between two variables, one in a column A and the other in a column B. Both variables have 10 values in rows between one and ten. First you should enter A1:B10 to the ``Input Range:'' entry. Secondly, select the ``Columns'' radio button next to the ``Groupped By:'' label. Then specify the output options described above and press the OK button.

If you have labels, for example the names of the variables, in the first row of the given range, you should check the ``Labels'' button on. This leaves the first row (or the first column if the data is groupped by rows) of the input range out of the analysis. In addition, the names of the variables are printed nicely into the output table.

The results are printed into a table where each column and row is labeled according to the names of the variables. If the names are not given in the input range, Gnumeric generates them for you. The calculated correlations are given in the table. For example, the correlation of the previous example, i.e, correlation between a variable in a column A and B, can be read in the second column and third row of the results table (bellow the column labeled ``Column 1'' and in the row labeled ``Column 2'').

Covariance Tool

The covariance tool calculates covariance of given variables. Use this tool to calculate any number of covariances at the same time. The variables for which the covariances are calulated are specified by the ``Input Range:'' entry. The given range can be groupped either by columns or by rows.

For example, you want to calculate the covarance of two variables, one in a column A and the other in a column B. Both variables have 10 values in rows between one and ten. First you should enter A1:B10 to the ``Input Range:'' entry. Secondly, select the ``Columns'' radio button next to the ``Groupped By:'' label. Then specify the output options described above and press the OK button.

If you have labels, for example the names of the variables, in the first row of the given range, you should check the ``Labels'' button on. This leaves the first row (or the first column if the data is groupped by rows) of the input range out of the analysis. In addition, the names of the variables are printed nicely into the output table.

The results are printed into a table where each column and row is labeled according to the names of the variables. If the names are not given in the input range, Gnumeric generates them for you. The calculated covariances are given in the table. For example, the covariance of the previous example, i.e, covariance between a variable in a column A and B, can be read in the second column and third row of the results table (bellow the column labeled ``Column 1'' and in the row labeled ``Column 2'').

Descriptive Statistics Tool

The descriptive statistics tool calculates some statistical information of the given variables. Use this tool to calculate general statistical measures and confidence levels for mean for any number variables at the same time. The variables for which the statistical analysis is made are specified by the ``Input Range:'' entry. The given range can be groupped either by columns or by rows.

This tool can produce four different kinds of statistical data. The summary statistics is produced if the button ``Summary Statistics'' is checked on. This tool calculates the mean, standard error, median, mode, standard deviation, sample variance, kurtosis, skewness, range, minimum, maximum, sum, and count for each given variable.

If ``Confidence Level for Mean'' button is checked on, the tool calculates the confidence level for mean for each variable. You should specify the percentage of the confidence in the entry box. The default value is to calculate the confidence level of 95%.

If ``Kth Largest:'' button is checked on, the tool finds the kth largest value of each of the variables. Specify the k in the entry box next to the check button.

If ``Kth Smallest:'' button is checked on, the tool finds the kth smallest value of each of the variables. Specify the k in the entry box next to the check button.

If you have labels, for example the names of the variables, in the first row of the given range, you should check the ``Labels in First Row'' button on. This leaves the first row of the input range out of the analysis. In addition, the names of the variables are printed nicely into the output table.

F-Test:Two-Sample for Variances Tool

Use the F-Test tool to analyze wheater the variances of two variables are significantly different or not. The values of the first variable are specified in the ``Variable 1 Range:'' entry. Correspondingly, the values of the second variable are given in the ``Variable 2 Range:'' entry. ``Alpha:'' entry specifies the level of significance which is by default 95%.

If the output is printed into a range, it should contain at least three columns and eight rows.

The results are given in a table. The tool calculates and prints the mean, variance, count of observations and the degree of freedom for both variables. It also calculates the F-value, the one-tailed probability for the F-value, and the F Critical value for one-tailed test. The one-tailed probability for the F-value (``P(F<=f) one-tail'' row) describes the risk of making a Type I error of one-tailed test.

Random Number Generation Tool

Use the random number generation tool to generate random numbers. This tool can generate random numbers having different probability distributions.

To use this tool, first specify the number of variables in the ``Number of Variables:'' entry box. This determines the number of columns of random values to be produced. Secondly, specify the number of random numbers. This determines the number of rows of random values to be produced. Thirdly specify the random distribution by selecting one of the list items from the random distribution list. The following random distributions are supported: Discrete, Normal, Bernoulli, and Uniform. Then specify the parameters of the distribution.

Random Distributions

For Discrete random distribution, specify the value and probability input range in the ``Value and Probability Input Range:'' entry box. That is a table consisting of two columns and any number of rows. The first column specifies the discrete random values and the second column the probabilities for them. The discrete random values do not have to be numbers, for example, strings will do as well. The sum of the probabilities in the second column should be one. For example, if you have the values A, B, C, and D in A1:A4 and values 0.1, 0.4, 0.2, and 0.3 in B1:B4, you could specify value and probability input range to be A1:B4.

For Normal random distribution, specify the mean and the standard deviation. The default values for them are 0 for mean and 1 for standard deviation.

For Bernoulli random distribution, specify the ``p Value''. This is a probability value between 0 and 1. Bernoulli distribution has two random values 0 and 1, and the ``p Value'' specifies the probability of value 1. The mean of a random variable that has a Bernoulli distribution is E(X) = 1(p) + 0(1-p) = p, and the variance is var(X) = p(1-p).

For Uniform random distribution, specify the range of the random variables by filling the ``Between:'' and ``And:'' entries. The default values for them are 0 and 1 that gives random numbers between zero and one.

Sampling Tool

Use the sampling tool to take a sample of a data set. This tool can take both a random sample of a given size or a periodic sample where, for example, every fourth value of the data set is taken to the sample.

To use this tool, first specify the data set by setting the ``Input Range:'' entry. Then select the sampling method which can be either periodic or random. Periodic sampling requires that you specify the period by typing a number to the ``Period:'' entry. For example, if you specify the period to be three then you will get a sample that has every third value of the original data set. Random sampling requires that you specify the size of the random sample in ``Number of Samples:'' entry. The size cannot be larger than the original data set since each value of the original data set is taken at most once to the sample.

z-Test:Two Samples for Means Tool

Use the z-Test tool to analyze wheater the means of two variables are significantly different or not. The values of the first variable are specified in the ``Variable 1 Range:'' entry. Correspondingly, the values of the second variable are given in the ``Variable 2 Range:'' entry. You should also specify the variances of both variables in the ``Variable 1 Variance (known):'' and ``Variable 2 Variance (known):'' entries.

If you want to test wheater the difference of the means is larger than a given value, specify also the ``Hypothesized Mean Difference:'' entry. ``Alpha:'' entry specifies the level of significance which is by default 95%.

If the output is printed into a range, it should have at least three columns and ten rows.

The results are given in a table. The tool calculates and prints the mean, known variance, and count of observations of both variables. It also calculates the z-value, the one-tailed and two-tailed probability for the z-value, and the z Critical value for one-tailed and two-tailed tests. The one-tailed probability for the z-value (``P(Z<=z) one-tail'' row) describes the risk of making a Type I error of one-tailed test.