Đang chuẩn bị nút TẢI XUỐNG, xin hãy chờ
Tải xuống
Data Preparation for Data Mining- P6: Ever since the Sumerian and Elam peoples living in the Tigris and Euphrates River basin some 5500 years ago invented data collection using dried mud tablets marked with tax records, people have been trying to understand the meaning of, and get use from, collected data. More directly, they have been trying to determine how to use the information in that data to improve their lives and achieve their objectives. | 49 63 44 25 16 5 39.4 so squaring the instance value minus the mean 49 - 39.4 2 9.62 92.16 63 - 39.4 2 23.62 556.96 44 - 39.4 2 4.62 21.16 25 - 39.4 2 -14.42 207.36 16 - 39.4 2 -23.42 547.56 and since the variance is the mean of these differences 92.16 556.96 21.16 207.36 547.56 5 285.04 This number 285.04 is the mean of the squares of the differences. It is therefore a variance of 285.04 square units. If these numbers represent some item of interest say percentage return on investments it turns out to be hard to know exactly what a variance of 285.04 square percent actually means. Square percentage is not a very familiar or meaningful measure in general. In order to make the measure more meaningful in everyday terms it is usual to take the square root the opposite of squaring which would give 16.88. For this example this would now represent a much more meaningful variance of 16.88 percent. The square root of the variance is called the standard deviation. The standard deviation is a very useful thing to know. There is a neat mathematical notation for doing all of the things just illustrated Standard deviation 1 1 where means to take the square root of everything under it S means to sum everything in the brackets following it x is the instance value m is the mean n is the number of instances For various technical reasons that we don t need to get into here when the number is divided by n it is known as the standard deviation of the population and when divided by Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark. n - 1 as the standard deviation of the sample. For large numbers of instances which will usually be dealt with in data mining the difference is miniscule. There is another formula for finding the value of the standard deviation that can be found in any elementary work on statistics. It is the mathematical equivalent of the formula shown above but gives a different perspective and reveals something else that is going on inside this