When you GOOGLE ... We should thank Mr. Amit Singhal (http://singhal.info/) and his team are responsible for the Google search algorithms. He is the master of what Google calls its “ranking algorithm” — the formulas that decide which Web pages best answer each user’s question.
My research interests are in the area of information retrieval (IR), its application to web search, web graph analysis, and user interfaces for search. Here are some of my selected publications (chronologically ordered). At Google I have worked on using IR techniques to improve web search. Before joining Google in 2000. I did research in the following sub-areas of Information Retrieval:
My research interests are in the area of information retrieval (IR), its application to web search, web graph analysis, and user interfaces for search. Here are some of my selected publications (chronologically ordered). At Google I have worked on using IR techniques to improve web search. Before joining Google in 2000. I did research in the following sub-areas of Information Retrieval:
- Speech Retrieval: Increasing amounts of spoken communication are stored in digital form for archival purposes (for instance, broadcasts material). With advances in automatic speech recognition (ASR) technology, it is now possible to automatically transcribe speech with reasonable accuracy. Once transcribed, IR methods can be used to search speech collections. Think of this as a search engine for speech. However, the interesting problem is to search speech given large number of automatic speech recognition errors. More recently I have done some work in this area. When at AT&T Labs, we developed SCAN, a system that combines speech recognition, information retrieval and user interface techniques to provide a multimodal interface to speech archives.
- Document Ranking: Also called text/document searching/retrieval (that makes four phrases by the way), this is the best known part of our field. If you are reading this page, chances are that you have already used a "search engine" before. Document ranking is what search engines do: given a user query, how to rank a large collection of documents (web pages, news articles, your email, someone else's email that you happen to have hacked, ...) so that what you are looking for is ranked ahead of other less useful (or useless) documents.
- Question Answering: People have questions and they need answers, not documents. Automatic question answering will definitely be a significant advance in the state-of-art information retrieval technology. Systems that can do reliable question answering without domain restrictions have not been developed yet.I organized the first few runnings of the QA Track under theText REtrieval Conference (TREC) umbrella to advance this sub-field of language processing.
- Document Routing/Filtering: This is the "query by example" version of document ranking. Once you point the system to a few "good documents", the system then tracks all NEW documents and points you to only those ones that you should be looking at. Typically the system tries to find new documents that are similar to the documents that you said were good.
- Automatic Text Summarization: Documents are huge and we don't always want to read them all. (I don't know about you but I certainly don't have the patience. And given the stuff you find on the web ...) Techniques that automatically "summarize" documents will be tremendously useful. Domain independent text summarization is very hard, at times even for humans; typically machines do summarization by text extraction. Relevant pieces (sentences, paragraphs, ...) of text are typically extracted and presented as a "summary".
- Miscellaneous (TREC): Since 1992 National Institute of Standards in Technology (NIST) (along with DARPA) sponsors an annual conference called Text REtrieval Conference (TREC) to support research within the information retrieval community by providing the infrastructure necessary for large-scale evaluation of text retrieval methodologies. I have been actively participating in TRECs since TREC-3 (held in 1994).
