Sponsors

Sponsors


Home

Some of the current primary research and application challenges for KDD have been outlined here. This list is by no means exhaustive and is intended to give the reader a feel for the types of problem that KDD practitioners wrestle with everyday.

Larger Databases

Databases with hundreds of fields and tables and millions of records and of a multigigabyte size are commonplace, and terabyte (1012 bytes) databases are beginning to appear. Methods for dealing with large data volumes need  more efficient algorithms, sampling, approximation, and perhaps massively parallel processing.



Poll: Which of the following would you recommend as the best introductory book on data mining?
Data Mining: Concepts and Techniques - Han & Kamber
Data Preparation for Data Mining - Pyle
Introduction to Data Mining - Tan, Steinbach & Kumar
Principles of Data Mining - Hand, Mannila & Smyth
Machine Learning - Mitchell
The Elements of Statistical Learning - Hastie, Tibshirani & Friedman
Introduction to Business Data Mining - Olson & Shi
Predictive Data Mining: a practical guide - Weiss & Indurkhya
Other
Books are way too structured and expensive for me!
[View results]

Check out more information about these books here!
-->


High dimensionality

Not only is there often a large number of records in the database, but there can also be a large number of fields (attributes, variables); so, the dimensionality of the problem is high. A high-dimensional data set creates problems in terms of increasing the size of the search space for model induction in a combinatorially explosive manner. In addition, it increases the chances that a data-mining algorithm will find spurious patterns that are not valid in general. Approaches to this problem need to include methods to reduce the effective dimensionality of the problem and the use of prior knowledge to identify irrelevant variables.

Overfitting

When the algorithm searches for the best parameters for one particular model using a limited set of data, it can model not only the general patterns in the data but also any noise specific to the data set, resulting in poor performance of the model on test data. Possible solutions include cross-validation, regularization, and other sophisticated statistical strategies.

Assessing of statistical significance 

A problem (related to overfitting) occurs when the system is searching over many possible models. For example, if a system tests models at the 0.001 significance level, then on average, with purely random data, N/1000 of these models will be accepted as significant. This point is frequently missed by many initial attempts at KDD. One way to deal with this problem is to use methods that adjust the test statistic as a function of the search, for example, Bonferroni adjustments for independent tests or randomization testing.

Changing data and knowledge

Rapidly changing (non-stationary) data can make previously discovered patterns invalid. In addition, the variables measured in a given application database can be modified, deleted, or augmented with new measurements over time. Possible solutions include incremental methods for updating the patterns and treating change as an opportunity for discovery by using it to cue the search for patterns of change only.

Missing and noisy data

This problem is especially acute in business databases. U.S. census data reportedly have error rates as great as 20 percent in some fields. Important attributes can be missing if the database was not designed with discovery in mind. Possible solutions include more sophisticated statistical strategies to identify hidden variables and dependencies.

Complex relationships between fields

Hierarchically structured attributes or values, relations between attributes, and more sophisticated means for representing knowledge about the contents of a database will require algorithms that can effectively use such information. Historically, data-mining algorithms have been developed for simple attribute- value records, although new techniques for deriving relations between variables are being developed.

Understandability of patterns

In many applications, it is important to make the discoveries more understandable by humans. Possible solutions include graphic representations, rule structuring, natural language generation, and techniques for visualization of data and knowledge. Rule-refinement strategies can be used to address a related problem: The discovered knowledge might be implicitly or explicitly redundant.



User interaction and prior knowledge

Many current data mining methods and tools are not truly interactive and cannot easily incorporate prior knowledge about a problem except in simple ways. The use of domain knowledge is important in all the steps of the KDD process. Bayesian approaches use prior probabilities over data and distributions as one form of encoding prior knowledge. Others employ deductive database capabilities to discover knowledge that is then used to guide the data-mining search.

Integration with other systems

A standalone discovery system might not be very useful. Typical integration issues include integration with a database management system (for example, through a query interface), integration with spreadsheets and visualization tools, and accommodating of real-time sensor readings.  

Disclaimer
The content on this site is provided as information only and does not constitute an endorsement by the webmaster. It is your responsibility to check out suppliers thoroughly. Trademarks and Service Marks are the property of their respective companies. Note: If you think that a reference to  your work/site/tool should be added to this site or if you have any suggestions related to improvement of this site, please send an email to: admin@eruditionhome.com
This website is about data mining, data mining tutorial, data en language mining, data mining software, data mining tool, crm data mining, business data intelligence mining, data mining technique, application data mining, data mining web, data mining solution, data mining technology, data mining process, data mining warehouse, data definition mining, data mining science technology, data mining privacy, course data mining, data mining reason, data discovery knowledge mining, data data mining warehousing, data job mining, data introduction mining, data mining sas, data mining research, data mining news, concept data mining, data data mining warehouse, data mining text, data mining training, case data engineering in mining software study, consulting data mining, data decision mining thesis tree, data mining server tool, data knowledge management mining, data mining multimedia, data dmo mining sql, care data health mining, code data mining project, data mining olap, data define mining, article data mining, comparison data detection intrusion mining, data mining oracle, data mining pdf, data mining warehousing, data mining program, data mining services, application data mining statistical, association data mining, case data mining study, content data management mining, chennai data mining, data example mining, data it loc mining, data mining seminar, data government mining, audit data mining, classification data mining project report, data information mining, data mining technologies, company data mining, data mining resource, data disadvantage mining, data discovery journal knowledge mining, data marketing mining, data mining visual, data free mining software, career data mining, conference data mining, data mining model, article data data mining warehouse, benefit data mining, data faq mining, data library mining, data mining product, anova data mining, application data digital library mining, data data mining quality, data data mining reduction, data journal mining, analytic data kurt mining technologies.