Home
Larger Databases
Databases with hundreds of fields and tables and millions of records and of a multigigabyte size are commonplace, and terabyte (1012 bytes) databases are beginning to appear. Methods for dealing with large data volumes need more efficient algorithms, sampling, approximation, and perhaps massively parallel processing.
| Poll: Which of the following would you recommend as the best introductory book on data mining? | |
Check out more information about these books here!
-->
High dimensionality
Not only is there often a large number of records in the database, but there can also be a large number of fields (attributes, variables); so, the dimensionality of the problem is high. A high-dimensional data set creates problems in terms of increasing the size of the search space for model induction in a combinatorially explosive manner. In addition, it increases the chances that a data-mining algorithm will find spurious patterns that are not valid in general. Approaches to this problem need to include methods to reduce the effective dimensionality of the problem and the use of prior knowledge to identify irrelevant variables.
Overfitting
When the algorithm searches for the best parameters for one particular model using a limited set of data, it can model not only the general patterns in the data but also any noise specific to the data set, resulting in poor performance of the model on test data. Possible solutions include cross-validation, regularization, and other sophisticated statistical strategies.
Assessing of statistical significance
A problem (related to overfitting) occurs when the system is searching over many possible models. For example, if a system tests models at the 0.001 significance level, then on average, with purely random data, N/1000 of these models will be accepted as significant. This point is frequently missed by many initial attempts at KDD. One way to deal with this problem is to use methods that adjust the test statistic as a function of the search, for example, Bonferroni adjustments for independent tests or randomization testing.
Changing data and knowledge
Rapidly changing (non-stationary) data can make previously discovered patterns invalid. In addition, the variables measured in a given application database can be modified, deleted, or augmented with new measurements over time. Possible solutions include incremental methods for updating the patterns and treating change as an opportunity for discovery by using it to cue the search for patterns of change only.
Missing and noisy data
This problem is especially acute in business databases. U.S. census data reportedly have error rates as great as 20 percent in some fields. Important attributes can be missing if the database was not designed with discovery in mind. Possible solutions include more sophisticated statistical strategies to identify hidden variables and dependencies.
Complex relationships between fields
Hierarchically structured attributes or values, relations between attributes, and more sophisticated means for representing knowledge about the contents of a database will require algorithms that can effectively use such information. Historically, data-mining algorithms have been developed for simple attribute- value records, although new techniques for deriving relations between variables are being developed.
Understandability of patterns
In many applications, it is important to make the discoveries more understandable by humans. Possible solutions include graphic representations, rule structuring, natural language generation, and techniques for visualization of data and knowledge. Rule-refinement strategies can be used to address a related problem: The discovered knowledge might be implicitly or explicitly redundant.
User interaction and prior knowledge
Many current data mining methods and tools are not truly interactive and cannot easily incorporate prior knowledge about a problem except in simple ways. The use of domain knowledge is important in all the steps of the KDD process. Bayesian approaches use prior probabilities over data and distributions as one form of encoding prior knowledge. Others employ deductive database capabilities to discover knowledge that is then used to guide the data-mining search.
Integration with other systems
A standalone discovery system might not be very useful. Typical integration issues include integration with a database management system (for example, through a query interface), integration with spreadsheets and visualization tools, and accommodating of real-time sensor readings.