Đang chuẩn bị nút TẢI XUỐNG, xin hãy chờ
Tải xuống
Data Preparation for Data Mining- P10: Ever since the Sumerian and Elam peoples living in the Tigris and Euphrates River basin some 5500 years ago invented data collection using dried mud tablets marked with tax records, people have been trying to understand the meaning of, and get use from, collected data. More directly, they have been trying to determine how to use the information in that data to improve their lives and achieve their objectives. | TABLE 8.3 The effect of missing values . on the summary values of x and y. n x y x2 y2 xy 1 0.55 0.53 0.30 0.28 0.29 2 0.75 0.37 0.56 0.14 0.28 3 0.32 0.83 0.10 0.69 0.27 4 0.21 0.86 0.04 0.74 0.18 5 0.43 0.54 0.18 0.29 0.23 Sum 2.26 3.13 1.20 2.14 1.25 1 0.55 0.53 0.30 0.28 0.29 2 . 0.37 . 0.14 . 3 0.32 0.83 0.10 0.69 0.27 4 0.21 . 0.04 . . 5 0.43 0.54 0.18 0.29 0.23 Sum . . . . . The problem is what to do if values are missing when the complete totals for all the values are needed. Regressions simply do not work with any of the totals missing. Yet if any single number is missing it is impossible to determine the necessary totals. Even a single missing x value destroys the ability to know the sums for x x2 and xy What to do Since getting the aggregated values correct is critical the modeler requires some method to determine the appropriate values even with missing values. This sounds a bit like pulling one s self up by one s bootstraps Estimate the missing values to estimate the missing values However things are not quite so difficult. Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark. In a representative sample for any particular joint distribution the ratios between the various values xx and xx2 and xy and xy2 remain constant. So too do the ratios between xx and xxy and xy and xxy. When these ratios are found they are the equivalent of setting the value of n to 1. One way to see why this is so is because in any representative sample the ratios are constant regardless of the number of instance values and that includes n 1. More mathematically the effect of the number of instances cancels out. The end result is that when using ratios n can be set to unity. In the linear regression formulae values are multiplied by n and multiplying a value by 1 leaves the original value unchanged. When multiplying by n 1 the n can be left out of the expression. In the calculations that follow that piece is dropped since it has no effect on the result. The