9. E.g. Classification rules for the weather forecasting problem If outlook = sunny and humidity = high then play = no If outlook = rainy and windy = true then play = no If outlook = overcast then play = yes
10.
11. Same as classification learning but the outcome to be predicted is not a discreet class but a numeric quantity
19. Flat file: Each dataset is represented as a matrix of instances versus attributes, which in database terms is a single relationship, or a flat file
23. In database terms, take two relations and join them together to make one, a process of flattening that is technically called de-normalization
24.
25. We are trying to find ‘Sister of’ relation ship Each row of tree mapped to instances: We cant make sense of this with respect to our requirement or concept. Therefore …….
26. We de-normalize these tables to get: Here we can clearly see the ‘Sister of’ relationship
27. Problems with de-normalization: If relationship between large number of items is required then tables will be huge It produces irregularities in data that are completely spurious Relations might not be finite (use: Inductive logic programming) Overlay data: Sometimes data relevant to the problem at hand needs to be collected from outside of the organization. This is called overlay data.
28. Data Integration Integration of system wide databases is difficult because different departments will use/have: Different style of record keeping Different conventions Different degrees of data aggregations etc Different types of errors Different time period Different primary keys These issues are taken care by the idea of company wide databases, a process called as data warehousing
29. Data Cleaning Data cleaning is the careful checking of data It helps in resolving many architectural issues with different databases Data cleaning usually requires good domain knowledge
30. Attribute-Relation File Format (ARFF) Definition: An ARFF file is an ASCII text file that describes a list of instances sharing a set of attributes Conventions used in ARFF : ARFF Header Line beginning with % are comments To declare relation: @relation <name of relation> To declare attribute: @attribute <attribute> <data type> ARFF Data Section To start the actual data: @data, followed by row wise CS data
31. Data type for ARFF: Numeric can be real or integer numbers Nominal values are defined by providing <nominal-specification> listing the possible values: {nm-value1, nm-value2,…} e.g. {yes, no} Values separated by space must be quoted String attributes allow us to create attributes containing arbitrary textual values Date type is used as: @attribute <name> date [<date-format>] The default date format is ISO-8601 combined date and time format:”yyyy-MM-dd’T’HH:mm:ss” Missing values are represented by ?
32. Sparse ARFF files Sparse ARFF files are very similar to ARFF files, but data with value 0 are not be explicitly represented Same header as ARFF but different data section. Instead of representing each value in order, like this: @data 0, X, 0, Y, “class A” The non zero attributes are explicitly identified by attribute number(starting from zero) and their value stated , like this: @data {1X, 3Y,4 “class A”}
33. Visit more self help tutorials Pick a tutorial of your choice and browse through it at your own pace. The tutorials section is free, self-guiding and will not involve any additional support. Visit us at www.dataminingtools.net