**Statistics
can be made to prove anything -- even the truth if necessary.
**

**
**

#
REGRESSION ANALYSIS BASICS

The Right Way To Do It

Different
regression techniques give different results for the regression
equation. Simple or Linear regression is the most common form used
in petrophysical analysis, giving an equation of the form

Y = A
* X + B.

**Multiple regression related
the dependent variable Y to a number of independent variables, for
example Y = A1 * X1 + A2 * X2 ... +B.**

**Non linear or polynomial
regression provides relationships that involve powers, roots, or
other non-linear functions, such as logarithms or exponentials.
**

**Excel and Lotus 1-2-3 offer
some simple linear and non-linear regression models, but more
sophisticated software is required for multiple regression. A good
freeware package is Statcato (www.statcato.org).
It is a java based program. My copy is
HERE.**

**The "Y-on-X" line is the one
that will result from use of spreadsheet software. Y is the
dependent axis (predicted variable) and X is the independent axis
(the variable doing the predicting). The line minimized the errors
in the vertical direction (Y axis) using a least-squares solution.**

**The "X-on-Y line reverses the
roles of the two axes, minimizing the error in the horizontal
direction (as the graph is drawn here).. **

**The RMA line, the reduced
major axis, assumes that neither axis depends on the other and is
very nearly halfway between the first two lines. It minimizes the
error at right angles to the line. The ER, or error ratio line,
minimizes the error on both X and Y directions. There is not usually
much difference between the RMA and ER lines. All four lines
intersect at the centroid of the data. **

**SIMPLE
LINEAR REGRESSION and BASIC Statistical**

The
equations used are as follows:

**
Slope
of Best Fit Line **

1: A1 = (Sum (XiYi) - Sum (Xi) * Sum (Yi) / Ns) / (Sum (Xi ^ 2)
- Sum (Xi) ^ 2) / Ns)

2: A2 = (Sum (XiYi) - Sum (Yi) * Sum (Xi) / Ns) / (Sum (Yi ^ 2)
- Sum (Yi) ^ 2) / Ns)

**
Intercept
on Y Axis **

3: B1 = (Sum (Yi) - Al * Sum (Xi)) / Ns

4: B2 = (Sum (Xi) - A2 * Sum (Yi)) / Ns

**
Equation
of Best Fit Lines **

5: Y = A1 * X + B1 (Y is dependent axis)

6: X = A2 * Y + B2 (X is dependent axis)

**
The
Reduced Major Axis regression line is the regression line that
usually represents the most useful relationship between the X
and Y axes. It assumes that both axes are equally error prone.
An approximation to this line is halfway between the two independent
regression lines. Solve equation 6 for Y:**

7: Y = (1/A2) * X + B2 / A2

**
Average
slope and intercept of equations 5 and 7:**

8: A3 = (A1 + 1/A2) / 2

9: B3 = (B1 + B2 / A2) / 2

10: Y = A3 * X + B3 (reduced major axis)

**
Coefficient
of Determination **

11: Cd = (B1 * Sum (iY) + Al * Sum (Xi * Yi) - (Sum (Yi) ^ 2)
/ Ns) /

(Sum (Xi ^ 2) - (Sum (Xi) ^ 2) / Ns)

**
The
coefficient of determination is a measure of "best fit"
and is capable of being calculated as data is entered and processed
(e.g.: as in a hand calculator). Other measures of fit require
two passes through the data - the first to find the average X
and average Y values, then a second pass to find the differences
between each individual X and the average X, and the differences
between the individual Y and the average Y values. **

**
An
alternate form of the above equation is: **

12: Cd = (Sum (XiYi) - Sum (Xi) * Sum (Yi) / Ns) / (((Sum (Xi
^ 2) - Sum (Xi) ^ 2) / Ns) *

(Sum (Yi ^ 2) - Sum (Yi) ^ 2) / Ns)) ^ 0.5

**
Both
equations give the same answer. **

**
These
data are used in the following statistical measures. **

**
Arithmetic
Mean **

13: Xbar = Sum (Xi) / Ns

14: Ybar = Sum (Yi) / Ns

**
Variance
**

15: Vx = Sum ((Xi - Xbar) ^ 2) / (Ns - 1)

16: Vy = Sum ((Yi - Ybar) ^ 2) / (Ns - 1)

**
Standard
Deviation **

17: Sx = Vx ^ 0.5

18: Sy = Vy ^ 0.5

**
Correlation
Coefficient **

19: Rxy = A1 * Sx / Sy

**
T
Ratio **

20: Txy = Rxy * ((Ns - 2) / (1 - (Rxy ^ 2))) ^ 0.5

**
Skew
**

21: Ux = (Sum ((Xi - Xbar) ^ 3) / Ns) / ((Sum ((Xi - Xbar) ^ 2)
/ Ns) ^ 1.5)

22: Uy = (Sum ((Yi - Ybar) ^ 3) / Ns) / ((Sum ((Yi - Ybar) ^ 2)
/ Ns) ^ 1.5)

**
Kurtosis
**

23: Kx = (Sum ((Xi - Xbar) ^ 4) / Ns) / ((Sum ((Xi - Xbar) ^ 2)
/ Ns) ^ 2)

24: Ky = (Sum ((Yi -Ybar) ^ 4) / Ns) / ((Sum ((Yi - Ybar) ^ 2)
/ Ns) ^ 2)

**
Geometric
Mean **

25: Gx = (PROD (Xi)) ^ (1 / Ns)

26: Gy = (PROD (Yi)) ^ (1 / Ns)

**
Harmonic
Mean **

27: Hx = Ns / (Sum (1 / Xi))

28: Hy = Ns / (Sum (1 / Yi))

**
WHERE:
**

A1 = slope of best fit line (x dependent)

A2 = slope of best fit line (y dependent)

A3 = slope of best fit line (reduced major axis)

B1 = intercept of best fit line (x dependent)

B2 = intercept of best fit line (y dependent)

B3 = intercept of best fit line (reduced major axis)

Cd = coefficient of determinations

Gx = geometric mean of X values

Gy = geometric mean of Y values

Hx = harmonic mean of X values

Hy = harmonic mean of Y values

Kx = kurtosis of X values

Ky = kurtosis of Y values

Ns = number of X - Y pairs or number of samples

Rxy = correlation coefficient

Sx = standard deviation of X values

Sy = standard deviation of Y values

Txy = T ratio

Ux = skew of X values

Uy = skew of Y values

Vx = variance of X values

Vy = variance of Y values

Xi = individual X data values

Xbar = arithmetic mean of X values

XiYi = product of individual X - Y pairs

Yi = individual Y data values

Ybar = arithmetic mean of Y values

**MULTIPLE LINEAR REGRESSION **

**The model for a multiple
regression takes the form: **

*
*30:` Y = b`_{0}
+ b_{1}X_{1} + b_{2}X_{2} + b_{3}X_{3}
+ .....

**The*** b*'s are termed the
"regression coefficients". Instead of fitting a line to data, we
are now fitting a plane (for 2 independent variables), a space (for
3 independent variables).

**The estimation can still be done
according the principles of linear least squares. The algebraic
formulae for the solution (i.e. finding all the ***b*'s) are
UGLY. However, the matrix solution is elegant:

**The matrix model is: **

31: [Y] = [X] * [B]

The solution is:

32: [B] = ([X'] * [X])^{-1 }* [X'] * [Y]