When University of Illinois Professor Anhai Doan went shopping for a house recently, he couldn't get what he wanted.
Oh, he found a house all right, by driving around neighborhoods and scoping out for-sale signs, among other low-tech means.
What he wanted was a way to query electronically, say, for houses less than 3,000 square feet in the Champaign-Urbana area with a particular number of bedrooms, bathrooms and the like and the possibility for certain kinds of financing. He wanted to get back a single, homogenous list of prospects – preferably one that would update itself automatically and notify him when it did so, and let him know about changes in mortgage rates, too.
Pretty much all that information exists in electronic form. But there was no one place for Doan to even ask for it, let alone get back the unified package he desired.
"There's no way you can express this in a keyword query to Google," the computer science professor said recently.
Doan and UI colleagues in a variety of fields are working to make that kind of thing possible in the future, however, for purposes not only of making your Web searches more fruitful but as diverse as studying literature, teasing new medical treatments from vast databanks of genes and identifying terrorist activity.
In the process, they are helping shift the kind of searching we know, love and sometimes loathe from Yahoo and Google to data mining: harvesting useful patterns and knowledge, often unexpected, from increasingly vast repositories of digital materials, like diamonds from mountains of rock.
The outcome looks to be potentially beneficial and even exciting. But the technology also raises some questions, particularly with respect to what's left of our privacy. How much are we willing to reveal about ourselves to get better results? Where our data is part of the mix, who controls it, where's it stored and how safely?
John Unsworth, the dean of the UI Graduate School of Library and Information Science, pointed to recent scandals involving information theft from the computers of major credit companies and the sale of personal cellphone records on the Web, as well as the government's controversial monitoring of library records, Internet communications and phone calls in anti-terrorism efforts.
"All of these things add up eventually in the public consciousness to sort of a justified paranoia about releasing any kind of information," Unsworth said.
The driving force behind efforts to improve what Michael Welge, a researcher at the UI-based National Center for Supercomputing Applications, likes to call "knowledge discovery in databases" is that we have so much stuff in digital form and online these days. More of this stuff comes by the second, and a lot of it is "born digital," meaning it will never meet paper unless it's printed out.
"I think the motivation is obvious, because now we've got a huge amount of data everywhere," said Jiawei Han, a professor who heads the Data Mining Research Group in the UI Computer Science Department.
The sheer volume makes finding what you need a challenge, nevermind the behind-the-scenes challenge of dealing with various formats it might be in and variations in the computer systems on which it lies.
At the same time, Han said, exponential increases in computing power have made it possible to use that data as never before. We don't necessarily have to sample pieces of it statistically to infer information. We can crunch whole sets; every book purchase on Amazon or every gene in the human body. Some observers tout it as an entirely new way of thinking.
Unsworth is a principal in an international test project using digitized 18th and 19th century American and British literature – it's called NORA, for no one remembers acronyms, or the name of a character in a William Gibson science fiction novel titled "Pattern Recognition" – to show how data mining can work in humanities research.
Unlike people, computers can keep thousands of words in memory and analyze them across multiple works looking for patterns that might be interesting. One test predicted which Emily Dickinson poems human scholars would rate as erotic.
Welge and his collaborators built a data mining system, called Evolution Highway, that an international team of researchers used to cross-compare human and mammal genomes and identify genetic locales where cancer and other diseases gain a toehold.
UI information science Professor Bruce Schatz and colleagues are working on a system, called BeeSpace, that not only will allow scientists to work with genetic data from bees but also millions of scientific articles and hundreds of years of natural history observations by beekeepers and others – from the same interface at the same time.
It could be a model for analyzing other animals, including humans, and, Schatz thinks, for making sense of the dispersed and diverse information mass the Web represents.
UI electrical and computer engineering Professor Thomas Huang's lab has been developing new ways to catalog and search databases of images, video and other nontextual digital materials.
Other UI researchers are working on ways to make what you get from the Web more relevant and delivered with a lot less work by creating search technologies that understand you better on a personal level.
"We think that you are the center of the Web," UI engineering professor David Goldberg said.
"There are so many things you can do," Han said. "It's just unlimited for data mining."
Plenty of challenges remain, not the least the amount of computing power and time some of those things consume and the half-life of digital materials in a world where Web pages, software formats and storage mediums are impermanent.
Computers also aren't good at semantics at this point. They have a hard time telling the difference between Java the island, Java the programming language and java a cup of coffee.
A big issue is what control we should have over our data, and how much of it we're willing to give up for better Web searches or to help identify the genetic similarities among diabetes suffers. Securing what we don't want public is likewise a point of contention, along with how far we want the government poking around in it, even in the name of security.
"People need to have the ability to make decisions about their data and how much they want to make exposed," said Welge, whose data-mining software D2K caught the eye of Admiral John Poindexter, the controversial defense surveillance czar, for one.
Welge and Unsworth said people also need guarantees the data they give up to be used anonymously, as in a medical study, will remain anonymous.
Unsworth has an interesting take on the debate. He thinks it will eventually lead to a professional class of trained, bonded, "trusted stewards of information," which library and information science schools like his could be put to work producing.
Meanwhile, UI computer science Professor ChengXiang Zhai's system for improving your Google searches by learning your personal habits approaches the privacy question by keeping the data entirely on your computer.
"You never release information to the outside," he said.
Goldberg thinks transparency is key in Web personalization efforts.
"It has to be done in a way that discloses what's being done," he said, "that gives you control over what information is learned about you."