NSF Award Search: Award # 0121175 - Scalable Decision Tree Construction

Award Abstract # 0121175

Scalable Decision Tree Construction

NSF Org:	IIS Div Of Information & Intelligent Systems
Recipient:	CORNELL UNIVERSITY
Initial Amendment Date:	September 13, 2001
Latest Amendment Date:	July 7, 2006
Award Number:	0121175
Award Instrument:	Continuing Grant
Program Manager:	Maria Zemankova IIS Div Of Information & Intelligent Systems CSE Direct For Computer & Info Scie & Enginr
Start Date:	October 1, 2001
End Date:	September 30, 2007 (Estimated)
Total Intended Award Amount:	$210,000.00
Total Awarded Amount to Date:	$2,187,700.00
Funds Obligated to Date:	FY 2001 = $75,000.00 FY 2002 = $305,000.00 FY 2003 = $590,000.00 FY 2004 = $433,100.00 FY 2005 = $334,600.00 FY 2006 = $450,000.00
History of Investigator:	Johannes Gehrke (Principal Investigator) johannes@cs.cornell.edu
Recipient Sponsored Research Office:	Cornell University 341 PINE TREE RD ITHACA NY US 14850-2820 (607)255-5014
Sponsor Congressional District:	19
Primary Place of Performance:	Cornell University 341 PINE TREE RD ITHACA NY US 14850-2820
Primary Place of Performance Congressional District:	19
Unique Entity Identifier (UEI):	G56PUALJ3KT5
Parent UEI:
NSF Program(s):	INFORMATION & KNOWLEDGE MANAGE
Primary Program Source:	app-0101 app-0102 app-0103 app-0104 app-0105 app-0106
Program Reference Code(s):	1655, 9216, 9218, HPCC
Program Element Code(s):	685500
Award Agency Code:	4900
Fund Agency Code:	4900
Assistance Listing Number(s):	47.070

ABSTRACT

Data mining is one of the very promising information technologies today. This project studies decision trees, one of the most widely used data mining models. The approach addresses three complementary components of decision tree construction: Bias in split selection, pruning, and regression tree construction. Bias in split selection is a very important problem, as the choice of the "wrong" split attribute destroys the interpretability of the decision tree, and users can no longer trust the information from the tree. Through a large experimental study and a theoretical investigation, this project develops a framework to devise split selection methods with absolutely zero bias. The new methods will permit users of decision trees to interpret the tree without any doubt of misinformation. The second topic addresses pruning of decision trees. Through a large experimental study of pruning of decision trees for large datasets, the project investigates the computational and qualitative trade-offs between different pruning methods, solving an ongoing debate about how to prune with large datasets. Third, this research investigates scalable regression tree construction, developing methods to construct regression trees with linear models in the leaf nodes of the tree and multivariate splits at intermediate nodes - all completely scalable over very large datasets with millions of records. The results are implemented in a publicly available decision tree construction tool and performance testbed and software contribution to the research community. This research has many applications in electronic commerce, scientific data analysis, and computational biology.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Alexandre Evfimievski, Ramakrishnan Srikant, Rakesh Agrawal, and J. E. Gehrke "Privacy Preserving Mining of Association Rules" Information Systems, invited paper for the ?Special Issue with Papers from KDD 2002.? , v.29 , 2004 , p.343

Johannes Gehrke "Letter from the Special Issue Editor" IEEE Data Engineering Bulletin , v.27 , 2004 , p.2

Johannes Gehrke, Joseph M. Hellerstein "Guest Editorial to the special issue on data stream processing" VLDB Journal , v.13 , 2004 , p.317

Manuel Calimlim, Jim Cordes, Alan J. Demers, Julia Deneva, Johannes Gehrke, Daniel Kifer, Mirek Riedewald, Jayavel Shanmugasundaram "A Vision for PetaByte Data Management and Analyis Services for the Arecibo Telescope" IEEE Data Engineering Bulletin , v.27 , 2004 , p.12

P.S. Bradley, J. E. Gehrke, R. Ramakrishnan and R. Srikant "Scaling Mining Algorithms to Large Databases" Communications of the ACM , v.45 , 2002 , p.38

S. Muthukrishnan, Rajmohan Rajaraman, Anthony Shaheen, Johannes Gehrke "Online Scheduling to Minimize Average Stretch" SIAM Journal of Computing , v.34 , 2004 , p.344

T. Joachims, F. Radlinski "Search Engines that Learn from Implicit Feedback" IEEE Computer , v.40 , 2007

T. Joachims, L. Granka, B. Pang, H. Hembrooke, and G. Gay "Evaluating the Accuracy of Implicit Feedback from Clicks and Query Reformulations in Web Search" ACM Transactions on Information Systems , v.25 , 2007

Please report errors in award information by writing to: awardsearch@nsf.gov.

Top