TAILIEUCHUNG - Data Preparation for Data Mining- P5

Data Preparation for Data Mining- P5: Ever since the Sumerian and Elam peoples living in the Tigris and Euphrates River basin some 5500 years ago invented data collection using dried mud tablets marked with tax records, people have been trying to understand the meaning of, and get use from, collected data. More directly, they have been trying to determine how to use the information in that data to improve their lives and achieve their objectives. | is possible for the output. Usually the level of detail in the input streams needs to be at least one level of aggregation more detailed than the required level of detail in the output. Knowing the granularity available in the data allows the miner to assess the level of inference or prediction that the data could potentially support. It is only potential support because there are many other factors that will influence the quality of a model but granularity is particularly important as it sets a lower bound on what is possible. For instance the marketing manager at FNBA is interested in part in the weekly variance of predicted approvals to actual approvals. To support this level of detail the input stream requires at least daily approval information. With daily approval rates available the miner will also be able to build inferential models when the manager wants to discover the reason for the changing trends. There are cases where the rule of thumb does not hold such as predicting Stock Keeping Units SKU sales based on summaries from higher in the hierarchy chain. However even when these exceptions do occur the level of granularity still needs to be known. Consistency Inconsistent data can defeat any modeling technique until the inconsistency is discovered and corrected. A fundamental problem here is that different things may be represented by the same name in different systems and the same thing may be represented by different names in different systems. One data assay for a major metropolitan utility revealed that almost 90 of the data volume was in fact duplicate. However it was highly inconsistent and rationalization itself took a vast effort. The perspective with which a system of variables mentioned in Chapter 2 is built has a huge effect on what is intended by the labels attached to the data. Each system is built for a specific purpose almost certainly different from the purposes of other systems. Variable content however labeled is defined by the .

TỪ KHÓA LIÊN QUAN
TAILIEUCHUNG - Chia sẻ tài liệu không giới hạn
Địa chỉ : 444 Hoang Hoa Tham, Hanoi, Viet Nam
Website : tailieuchung.com
Email : tailieuchung20@gmail.com
Tailieuchung.com là thư viện tài liệu trực tuyến, nơi chia sẽ trao đổi hàng triệu tài liệu như luận văn đồ án, sách, giáo trình, đề thi.
Chúng tôi không chịu trách nhiệm liên quan đến các vấn đề bản quyền nội dung tài liệu được thành viên tự nguyện đăng tải lên, nếu phát hiện thấy tài liệu xấu hoặc tài liệu có bản quyền xin hãy email cho chúng tôi.
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.