Least-squares fitting of data

This applet determines a linear or polynomial function that is least-squares fitted to a discrete set of data points. Functions of one variable, Y(x), can be plotted. Such functions are: linear of degree one and polynomial of any degree. For details, click here.

Linear fit
Polynomial fit
Output
Accuracy

Linear fit
Given a discrete sampling of N data points having coordinates

P_i = (x_i1, x_i2, ... x_in, y_i) i = 1, N

it is assumed that the value of y can be correlated to the values of the remaining n coordinates

X = (x₁, x₂, ... x_n)

via an approximate function Y having the form:

Y(x₁, x₂, ... x_n) = A₁x₁ + A₂x₂ + ... + A_nx_n + Constant (1)

which is a linear expansion of n independant variables (x₁, x₂, ... x_n) or, more simply stated, a linear expansion of degree n. The expansion coefficients A_i are determined by least-squares fitting the data points to this expression. The resulting continuous function may then be used to estimate the value of y over the entire n-dimensional region of space X where the approximation has been applied.

input:

The degree n of the linear expansion (1) is entered in the text field "Degree" of the user interface.

The user can choose whether or not to compute the constant by checking the appropriate box of the user interface. To not compute the constant effectively sets Constant=0 in (1).

Before entering the data points, the program must be told how the input coordinates are ordered: will each input point be entered by first giving the value of y_i followed by the corresponding x_i values, or vice-versa? This information is determined by the value of "Data ordering" of the user interface. The following table indicates how points will be entered depending on the value of "Data ordering".

Input according to value of "Data Ordering"
x's; y	y; x's
x₁₁, x₁₂, ... x_1n, y₁ x₂₁, x₂₂, ... x_2n, y₂ . . x_N1, x_N2, ... x_Nn, y_N	y₁, x₁₁, x₁₂, ... x_1n y₂, x₂₁, x₂₂, ... x_2n . . y_N, x_N1, x_N2, ... x_Nn

Once the preceeding information has been given, the user is ready to enter the data points. To do so, just click on the "Data" button that pops up a text area and enter the data points as indicated in the preceeding table. Note that N, the number of data points, is subject to the condition N >= n. Go to user interface

Polynomial fit
Given a discrete sampling of N data points having coordinates

P_i = (x_i, y_i) i = 1, N

it is assumed that the value of y can be correlated to the value of the x coordinate via an approximate function Y having the form:

Y(x) = A_nxⁿ + A_n-1x^n-1 + ... + A₁x + Constant (2)

which corresponds to an nth degree polynomial expansion. The expansion coefficients A_i are determined by least-squares fitting the data points to this expression. The resulting continuous function may then be used to estimate the value of y over the entire x region where the approximation has been applied.

input:

The degree n of the polynomial expansion (2) is entered in the text field "Degree" of the user interface.

The user can choose whether or not to compute the constant by checking the appropriate box of the user interface. To not compute the constant effectively sets Constant=0 in (2).

Before entering the data points, the program must be told how the input coordinates are ordered: will each input point be entered by first giving the value of y_i followed by the corresponding value x_i, or vice-versa? This information is determined by the value of "Data ordering" of the user interface. The following table indicates how points will be entered depending on the value of "Data ordering".

Input according to value of "Data Ordering"
x's; y	y; x's
x₁, y₁ x₂, y₂ . . x_N, y_N	y₁, x₁ y₂, x₂ . . y_N, x_N

Output
On output, the expansion coefficients A_i of the functional approximation (1) or (2) are printed. Input x(s) and y followed by the y estimates - as computed by the approximating function - are printed. Finally the standard deviation and R-squared (square of the correlation coefficient) values, which can be used to measure the effectiveness of least-squares smoothing, are given. Note that the standard deviation is defined under the condition that the number of data points is greater than the number of fitting parameters, including the computed constant. An "Undefined" standard deviation value usually implies an obvious perfect fit: an example would be to find the 2 parameter least-squares line passing through only 2 points.

The output window is editable, so the user can conveniently select, copy and paste it's content into a local file.

If the approximating function depends on one variable, the user will be given the option to view the resulting graph Y(x). Such functions are: linear of degree one and polynomial of any degree. The input (x, y) points are also rendered on the graph. Go to user interface

Accuracy
The equation-solving technique used by the program is Gauss elimination using double precision arithmetic. The issue of accuracy of statistical software has been adressed by the National Institute of Standards and Technology (NIST) that has compiled Statistical Reference Datasets (StRD) with certified computational results. These datasets are ordered by level of difficulty (lower, average and higher). The certified values are the result of very high precision computations (accurate to 500 digits). To quote the Institute:

"Data were read in exactly as multiple precision numbers and all calculations were made with this very high precision. The results were output in multiple precision, and only then rounded to fifteen significant digits. These multiple precision results are an idealization. They represent what would be achieved if calculations were made without roundoff or other errors. Any typical numerical algorithm (i.e. not implemented in multiple precision) will introduce computational inaccuracies, and will produce results which differ slightly from these certified values."

It is a straightforward exercise to test the program using the data from the NIST site. To do so, open 2 browser windows: one containing the applet and another dedicated to browsing the NIST site. Any relevant dataset from the NIST site can easily be copied to the input window of the applet without any reorganization whatsoever. Just make sure that the appropriate "Data ordering" box is checked. The NIST site also gives the graphs of the fitted functions, so it is also instructive to compare these to the program's rendering.

All 11 reference datasets handled by this program have been tested with Netscape 4.05 for Windows. These include 2 sets of lower difficulty (a 1 degree linear set and a quadratic set), 2 sets of average difficulty (both are one degree linear sets) and 7 sets of higher difficulty (a 6th degree linear set, five 5th degree polynomial sets and one 10th degree polynomial set). Except for the higher difficulty sets of Filippelli and Wampler-1, the program yields the certified values of the fitting parameters, standard deviation and R-squared to at least 7 digits.

As concerns the Wampler-1 set (a 5th degree polynomial), the computed parameters and R-squared values are also identical to the certified results. The only discrepancy lies with the value of the standard deviation: the certified value is 0, but the program returns NaN, as a result of having taken the square root of a computed mean-square value less than zero.

The largest discrepancies between the computed and certified values occur with the 10th degree polynomial set of Filippelli. The computed standard deviation and R-squared values are 0.003637 and 0.996139, as compared to the certified values of 0.003348 and 0.996727. This is an example of the well-known roundoff errors associated with solutions of high-degree polynomial fits via the Gauss elimination algorithm. Of course in practice relatively low order polynomials are used for data fitting, since higher order polynomials have a tendency to reproduce the noise in the data anyways. Nevertheless, it is instructive to attempt to improve on the accuracy of the computed results. To do so, we follow the suggestion of the NIST:

"If your code fails to produce correct results for a dataset of higher level of difficulty, one possible remedy is to center the data and rerun the code. Centering the data, i.e., subtracting the mean for each predictor variable, reduces the degree of multicollinearity. The code may produce correct results for the centered data. You can judge this by comparing predicted values from the fit of centered data with those from the certified fit."

I have applied this prescription to both the Filippelli and Wampler-1 sets, with excellent results. Indeed, the standard deviation of the Wampler-1 set (previously NaN) turns out to be effectively 0 (the certified result) when the mean-centered data is used. Furthermore, the standard deviation and R-squared values of the Filippelli set turn out to be identical to the certified values.
Mean-centered data of Filippelli set
Mean-centered data of Wampler-1 set

Based on these tests, we conclude that accurate results can be expected for most practical applications of this program. If there is good reason to fit the data to a high degree polynomial, the user should also run the program with mean-centered data. If a discrepancy between the 2 outputs occurs, the user should associate a higher level of confidence to the mean-centered data results. Go to user interface