\n",
"\n",
"Counts and Percentages | \n",
" | \n",
"Aggregated counts and fractional occurrences of an event or attribute of interest present within the raw data set. | \n",
"Counts and percentages of certain quantities within a data set are often used to produce base rates, establish incidence of an event, understand differences in group stratifications and distributions, and begin to form a characterization of the data. | \n",
"
\n",
"\n",
"Mean | \n",
" | \n",
"Average value for the collected sample. | \n",
"The sample mean varies from study to study. Ideally, the sample mean and the true population mean should be close. The mean is the most sensitive to outliers, which can translate to overestimation or underestimation of the most common values. | \n",
"
\n",
"\n",
"Median | \n",
" | \n",
"Middle value in the collected sample found when ordering all data points. | \n",
"The median is less likely to be influenced by outliers and is more resilient to skewed distributions. It is the preferred measure to the mean in this case. | \n",
"
\n",
"\n",
"Mode | \n",
" | \n",
"The value that occurs the most frequently in a data set (i.e. it occurs the highest number of times). | \n",
"The mode is least likely to be influenced by outliers. It is possible to have no mode, a single mode, or more than one mode as in a bimodal data distribution, for example. The mode is particularly useful when using nominal data. | \n",
"
\n",
"\n",
"Range | \n",
" | \n",
"A measure of the total distance between the smallest and largest data points within a sample. | \n",
"Range is a measure of the total spread in a data set. This metric has the advantage of direct interpretation because it is measured in the same units as the values in a data set. Range is perhaps most useful when comparing data collected at different points in time or from different samples, however, it is a measure not often used. | \n",
"
\n",
"\n",
"Variance | \n",
" | \n",
"A measure of the average squared distance between all data points within a sample. | \n",
"Ideally, the population and sample variance should be close. The variance is a building block for other important statistics such as the standard deviation, standard errors, and confidence intervals. | \n",
"
\n",
"\n",
"Standard Deviation | \n",
" | \n",
"A measure of how dispersed the data points are in reference to the mean of the sample. | \n",
"A low standard deviation means that the data points within the sample are tightly clustered around the mean. The standard deviation is equal to the square root of the variance. | \n",
"
\n",
"\n",
"Percentile Rank | \n",
" | \n",
"Given a rank-ordered data set, percentiles are calculated by segmenting values in the ordered data set into 100 equal parts. In general, a value having a percentile rank of n indicates that it is greater than n% of the other values within that data set. | \n",
"In this case, a value at the 50th percentile rank corresponds to the median value of the data set. | \n",
"
\n",
"\n",
"Quartile Rank | \n",
" | \n",
"Given a rank-ordered data set, quartiles are calculated by segmenting values in the ordered data set into 4 equal parts. | \n",
"There is a relationship between quartile ranks and percentile ranks. That is, the first quartile corresponds to the 25th percentile, the second quartile corresponds to the 50th percentile (and also the median value), and the third quartile corresponds to the 75th percentile. | \n",
"
\n",
"\n",
"Standard Scores (z-scores) | \n",
" | \n",
"The formula for calculating a z-score is z = (x-μ)/σ, where x is the data point, μ is the population mean, and σ is the standard deviation. | \n",
"Standard scores (also known as z-scores) indicate the number of standard deviations that a single sample data point is from the population mean of the data set. When comparing multiple samples we instead want to describe the sample deviations of those sample means by using the standard error σ/√n vs the standard deviation σ in the formula's denominator. In this case, the z-score describes the number of standard errors that are between the sample mean (x) and the population mean (μ). | \n",
"
\n",
"\n",
"Correlation | \n",
" | \n",
"A measure of the strength of the statistical relationship between two random variables. | \n",
"Common types are Pearson and Spearman Rank coefficients. Since the data set is a random sample or a representative sample of the underlying population, proper inference on the association being measured between two variables must be carefully ascertained. In order to test if the measured correlation is statistically significant, t-tests are typically used. Finally, observed correlations are not evidence for causality. | \n",
"
\n",
"\n",
"