News and Information online are so vast, they can be unmanageable at times. This is precisely why news aggregation search Web sites, like Yahoo News and Google News, are growing like crazy. Topix.net is said to be the worlds largest online news Web site with over 300,000 news categories and news sources from around the world.
Rich Skrenta, CEO of Topix.net and co-founder of Netscape’s Open Directory Project took a few minutes out of his news day to explain his huge online news creation.
DANA GREENLEE: What does your site do and how does it work?
RICH SKRENTA: Topix.net is reading all of the news published online constantly and categorizing stories, both geographically as well as by subject. Weve got a basic news roll-up for every zip code in the United States for 30,000 towns across the U.S. We also have 300,000 subject categories. We have a page about every sports team, celebrity, and music style. We even have a page about mobile home manufacturing, which has a surprisingly large amount of news on it.
GREENLEE: You cant fit 30,000 zip codes on your front page. Does the cream of the crop rise to the top so that when you go to your home page you see some of the major hot topics?
SKRENTA: We have links to the major cities in the country and users can type in zip codes and go to a page just for news about their town. Weve got some of the categories surfaced on our front page: U.S. and World news, journalism news, health, technology basically a random assortment of categories from deep within our system. But to get the full experience of Topix, you have to click around and experience the full breadth by viewing the internal parts of the site.
GREENLEE: There are a lot of people saying youre the largest news Web site thats ever been created. Would that be a true statement?
SKRENTA: Based on the number of categories, yes. If you look at Google News, theyve got eight categories: health, politics, entertainment, sports and so on, basically corresponding to the standard Associated Press taxonomy. Yahoo News has 100 full coverage sections. We have 300,000 pages. Our goal is to have a page constantly updated from the broadest variety of sources about every person, place and thing in the world. Weve got a page about every public company. There are 5,500 public companies. Were tracking references to every disease and drug – both brand and generic – 21,000 sports personalities, 45,000 celebrities, anyone whos ever been in a movie, anyone whos ever put out a music album.
GREENLEE: Building out your keyword database you must have spent a lot of long nights working.
SKRENTA: Yeah its a massive knowledge base which drives our system in conjunction with some artificial intelligence that we developed. The knowledge base knows the name of every street in the country, every bridge, tunnel, hospital, school, body of water, baseball stadium, park in addition to the other subjects and keywords its looking for. Its about 10 million lines of text that are constantly being looked for in every story that comes through our system.
GREENLEE: Whats the difference between what Topix.net is doing and what Google is doing in terms of the broad picture of what youre actually indexing?
SKRENTA: If you go to Google News and want to get information about, say, IBM, youd find a lot of stories that contained the three letters IBM, but it might not be a good relevant overview of IBMs current business. When we look at a story, were trying to determine not just if a story contains certain keywords, but actually if its about the concept that our topic is about.
A story we recently saw said Dot com survivors have aged like fine bordeaux. Now this is a reference to a style of wine and we have a wine page in Topix, but its not a wine story. Its a business story. Its a stray reference to something else. If you search for bordeaux on Google News, you would get this story. But its not what youre looking for. Our system can tell the difference between stray references to concepts and stories that are actually about the concept.
GREENLEE: You come from an incredible background of creating things we now take for granted on the web. Have you now created this artificial intelligence thats just proprietary to Topix that no one else has duplicated?
SKRENTA: Yeah, its pretty unique in the industry. I havent seen anything else like it. We looked at what had been done in academic AI. If youre going to develop an AI technique, 85% accuracy is pretty good. But for our purposes, we had to get far above 85% to make the stories look good on a page. If our AI was only 85%, that would mean that on every one of our pages, 2-3 stories would be bad. We had to get way above 99%.
GREENLEE: Its a pleasing site. You dont get just a list of headlines; its formatted beautifully and very well organized. Its a pleasure to read and its exciting.
SKRENTA: Im glad to hear you say that. When we looked at creating a look and feel for the site, we looked at a bunch of newspaper sites out there. I didnt really feel like a lot of them looked like a newspaper. The Wall Street Journal Online is the one that was closest to a newspaper look and feel.
We did some research and found that newspaper layout design is actually a rich feel with a 150-year history and there are books and guidelines about rules to follow to make things visually appealing in a print newspaper. Things like if you have a photo to a story and the photo is a picture of a person in profile, the person should be facing the text of the story; very subtle, not obvious rules about how to do newspaper layout properly. When we looked at online newspapers, many didnt follow any of these rules at all. I couldnt figure out why maybe the separation between the print and online divisions at the company. We thought wed bring some of these rules to bear on a web site design, adapt it to the web and come up with something a little more reminiscent of a newspaper.
GREENLEE: You were one of the founders of the Open Directory Project. What did you learn from building the Open Directory that youve applied to this new site?
SKRENTA: The Open Directory was built with 60,000 volunteer editors. We built a giant web directory similar to Yahoo, but 3-4 times bigger than Yahoo Directory. Its actually the directory tab on Google.com, in addition to being used by AOL and Netscape. It was a very successful project, but we sort of took the opposite tact with Topix.net. We have zero human editors at Topix its all done with AI (artificial intelligence). Humans are really good at some tasks, but the scale of what were doing here is just so vast that we couldnt have humans manually editing stories or selecting topics or categorizing them. Its too big a project for even 60,000 people to undertake.
GREENLEE: Whats fascinating is what it seems like what youre building is somewhat of a dynamically populated directory. It looks like you still have that focus on categorizations of content, which is different than a regular search engine. Was your vision to create a directory-type of search engine?
SKRENTA: What were trying to do is classify text by concept instead of keyword. When you go to Google and type in a name like Scott Peterson or Janet Jackson, these are actually relatively common names. There are thousands of people in the country with those names. Some of them make it into the media, besides the ones we commonly think of. We wanted a system that could be intelligent enough to decide a document was actually about that concept as opposed to being a strict keyword match.