Đang chuẩn bị nút TẢI XUỐNG, xin hãy chờ
Tải xuống
Data Preparation for Data Mining- P3: Ever since the Sumerian and Elam peoples living in the Tigris and Euphrates River basin some 5500 years ago invented data collection using dried mud tablets marked with tax records, people have been trying to understand the meaning of, and get use from, collected data. More directly, they have been trying to determine how to use the information in that data to improve their lives and achieve their objectives. | letters are used to identify other programs. However by the time only the records that are relevant to the gold card upgrade program are extracted into a separate file the variable program name becomes a constant containing only G in this data set. The variable is a defining feature for the object and thus becomes a constant. Nonetheless a variable in a data set that does not change its value does not contribute any information to the modeling process. Since constants carry no information within a data set they can and should be discarded for the purposes of mining the data. Two-Valued Variables At least variables with two values do vary Actually this is a very important type of variable and when mining it is often useful to deploy various techniques specifically designed to deal with these dichotomous variables. An example of a dichotomous variable is gender. Gender might be expected to take on only values of male and female in normal use. In fact there are always at least three values for gender in any practical application male female and unknown. Empty and Missing Values A Preliminary Note A small digression is needed here. When preparing data for modeling there are a number of problems that need to be addressed. One of these is missing data. Dealing with the problem is discussed more fully later but it needs to be mentioned here that even dichotomous variables may actually take on four values. These are the two values it nominally contains and the two values missing and empty. It is often the case that there will be variables whose values are missing. A missing value for a variable is one that has not been entered into the data set but for which an actual value exists in the world in which the measurements were made. This is a very important point. When preparing a data set the miner needs to fix missing values and other problems in some way. It is critical to differentiate if at all possible between values that are missing and those that are empty. An empty .