Along with the rapid development of the Internet, WEB information, in the ocean of information users to search information, like looking for a needle in a haystack, search engine technology solves this problem just (it can provide information retrieval service). Currently, the search engine technology are becoming the computer industry and academia to research and development of the object.
Search engines is growing with the increasing of Websites rapidly, WEB information from 1995 started gradually developed technology. According to the journal science in July 1999, the article “WEB accessibility of information, the world currently estimated more than 8 million pages, 9T and still more data to every four months double speed increase. Users in the ocean of information in this vast information, inevitable meeting “looking for a needle in a haystack.”
Search engine in order to solve this problem, “and” television. Search engine to certain strategy in the Internet to find information, collection, information extraction, organization and understanding, and provide information retrieval, and service purpose of navigation. Search engines provide navigation services on the Internet has become very important network service, search engine sites were also praise for “web portal”. Search engine technology and computer industry and academia to become objects of research and development. This paper intends to search engine of key technologies are briefly introduced, with a view to play a valuable role.
Search Engines Classifications
According to information collecting methods and ways of services, search engine system can be divided into three categories:
Catalogue Search Engine: artificial means or semi-automatic mode of information, information, editor view, and the formation of artificial information under the framework of predetermine classification. Most websites, provide information for browsing service directory service and retrieval. This search engine for the intelligent, who joined the accurate and navigation information so high quality, defect is needed, large amount of maintenance manual, information, not updating the information in time. This type of search engines are: LookSmart, Open Directory Yahoo, Guide Go, etc.
Robots Search Engine with Spiders a robot program (in some strategies on Internet automatically collect the information, and found by index as the information collected by the retrieval is indexed, according to the user’s query input index, and search results returned to the user. Web services is facing the full text search services. This search engine is updated timely, informative, and not artificial intervention, defect is too many, return information, users must not information from the results of screening. This kind of search engine is represented in the Northern AltaVista: one idea, and Infoseek Inktomi, FAST, and Lycos, Google;, For “tianwang” : swimming, OpenFind etc.
Combine Results Search Engine: this kind of search engines do not own data, but also the user queries to multiple search engines, and returns the result of repeated exclusion, reordering, etc, as their results returned to the user. Web services for fulltext retrieval. This type of search engine results returned information is larger, more whole, can make full use of shortcoming is not used by the search engine, users need to do more functions. This kind of search engine is represented in the WebCrawler InfoMarket, etc
Second, performance index
We will search as a WEB information retrieval problem, namely by the WEB page in the document database retrieval composed with relevant documents of user queries. So we can use the measure of traditional information retrieval system performance parameters (Recall Recall - and precision (Pricision) to measure a search engine performance.
The recall rate is related document retrieval and document repository of all relevant documents for the ratio, is measured by the retrieval system (search engine) of the recall-precision, The accuracy of the relevant documents for the retrieval of document retrieval and total ratio, is measured by the retrieval system (search engine) of precision ratio. In a retrieval system, the recall ratio and the precision and could not recall rate: high precision, low and high precision, low rate of recall. So often use of 11 11 kinds of precision under the recall rate of average (11) average precision measure precision of retrieval system. To search engine system, because no one search engine system can collect all WEB pages, so the recall rate is difficult to calculate. Current search engine system are very concerned about accuracy.
A search engine performance of systems have many factors, the main information retrieval model is, including the documents and queries, evaluation and user inquires the correlation matching strategy, inquires the sorting method and the user feedback mechanism of correlation
Third, the main technology
A search engine, by the search and retrieval index and user interface four parts, etc.
1.the search
Search on the Internet is the function of information, and found that roam. It is often a computer program, to run around the clock. It as much as possible as quickly as possible, collect all kinds of new information, and at the same time for information on the Internet update quickly, so also regularly updated has collected over the old information, so as to avoid death connection and ineffective. There are two information gathering strategies:
Responding to from a starting URL set down the URL, distribute the Hyperlink () to width and depth is preferred, or in the circular heuristic way that information. These starting URL can be arbitrary URL, but are often very popular, and the site contains many links such as Yahoo!) ( .
Responding to a Web space domain, according to the IP address or country, each domain search for an end to the space.
Search the information gathered various types, including HTML, XML, Newsgroup articles, FTP file, word processing, multimedia information document.
The realization of search with distributed and parallel computing often technology, in order to improve the speed of information found and update. Commercial search engine that can reach the information page. Every millions
2 index
Index function is to understand the search information search, extract index, and used to generate documentation of library document indexing table.
Index is objective and content index of index of two types: objective and document semantic content irrelevant, as the author’s name, the URL, update time, coding, length, Link Popularity (are); etc. Content is used to index reflects the document content, such as keywords and phrases and vocabulary weights, etc. Content index of single index and can be divided into many index (or phrases index) two kinds. A single index for English speaking English words, easy, because the words are extracted between natural separators (space), As for Chinese language, must be written for words segmentation.
In the search engine, the general to a single index weights and a fu, the index of the document, and used to distinguish the correlation calculation results. Using the method of statistical information is commonly, and the probability method. A method of extracting phrase index has statistics, probability method and language learning.
Index table generally use some form of inverted List (List), namely the Inversion of the relevant document search index. Index table could be recorded in the index of document retrieval device, so that the calculated between index or close relationship between adjacent proximity (.).
Indexing can use centralized index algorithm or distributed indexing algorithm. When the data is very large, must achieve Instant index (the Indexing), otherwise can’t keep the speed of increase sharply information. Index algorithm for performance index (such as large-scale peak query response speed) has very big effect. A search engine in the effectiveness depends largely on the index of quality
3 retrieval
Retrieval function is according to the user’s query in rapid detection of index, files and documents, the correlation of query evaluation results of the output will be sorted and achieve a user relevance feedback mechanism.
Retrieval device used information retrieval model has set theory model, the algebraic models, the probability model and hybrid model.
4 the user interface
The user interface is input user inquires query result shows, provide customer feedback mechanism between. The main objective is user-friendly, high efficiency, search engine to search engine from take effectively and timely information. User interface design and realization man-machine interaction using the theory and method, in order to adapt to human habits of thinking.
User input interface simple interface and can be divided into two kinds of complex interface.
Simple interface only provides query strings of user input text box, Complex interface allows users to restrict inquires, such as logical operations (and, or, not; -) and close relationship between adjacent, NEAR), (the scope of domain name (e.g., edu. Com). The position (such as title, content, information, length, etc. Some companies and institutions are considered the standard for inquires the option.
4 and the future trends
Search engine has become a new research and development fields. Because it will use the information retrieval, artificial intelligence, the computer network, the distributed processing and database, data mining, and digital library, natural language processing in the fields of theory and technology, and so has the comprehensive and challenging. The search engines have lots of customers, a very good economic value, so the world of computer science and information industry, the focus of research, development, and very active appeared many notable trends.
1 very attention to improve the accuracy of the information, to improve the effectiveness of the retrieval
Users in the search engines for information query, not very much attention back, but the result is consistent with their requirements. For a query, traditional search engine frequently returns, hundreds of thousands of millions of document, users had to sift in the results. To solve the phenomenon results present too several methods: one is obtained through various methods in the query of users without express real purpose, including the use of intelligent agent tracking users, analysis of user model retrieval, Using correlation user feedback mechanism, tell search engine which documents and their relevant (and related requirements, which are not related degree), through multiple interactive step by step. Two is to use Text classification (Text Categorization) technology will result classification, use visualization technology, users can display classification structure through his only interest category. Three is to site content, reduce assorts assorts or the amount of information.
2 the information filtering based on intelligent agent and personalized service
Information intelligent agent is another kind of using Internet information mechanism. It USES the field model automatically (such as the Web knowledge, information processing, and user interests related information resources organization structure, field, user model (such as user background, interest and behavior style) knowledge, information collection, index, filter (including interest filtering and bad information filtering), and automatically will be of interest to users, user useful information submitted to the user. Intelligent agent has constantly learning, to adapt to the information and user interests dynamic changes of ability, thus providing personalized service. Intelligent agent, also can be in the client can run on the server.
3 the distributed structure to improve system performance and scale
Search engines can realize by centralized system structure and distributed system structure, two methods is special. But when the system to certain degree (such as scale to billion) number of pages, must adopt some distributed method, in order to improve the performance of the system. Each part of the search engine, in addition to the user interface, can undertake distribution in search: can machines on mutual cooperation, mutual division, in order to improve the information found information found and updated speed, Index can be index distribution in different machines, to reduce the requirement of machine; index Can search on different machines of document retrieval parallel retrieval, in order to improve the speed and performance.
4 attention across language retrieval of research and development
Cross the information retrieval language user submitted by refers to the native queries, search engine in the languages of the database information retrieval, return can answer all of the questions of language user documentation. If plus machine translation, return can with the native language. This technique is still in the initial stage of study, the main difficulty lies in between language expression and the corresponding semantic indeterminacy. But for the economic globalization, the Internet today, across national borders is very important.
Five, the academic research
Current search engine of commercial development is very active, each are big the search engine companies are spending heavily developed search engine system, but also continuously emerging new characteristic of search engine products, search engine has become one of the information industry. In this case, the search engine technology in the field of academic research universities and research institutes. Stanford university, such as in the digital library development projects in the search engine, Google search, the efficient Web information document of correlation, large-scale index evaluation aspects of research has achieved very good results.
The American institute NEC Lawrence c. Lee and Steve Giles 1998 and 1999 for two consecutive years in the journal nature and science magazine articles on the research on search engine technology. The famous information retrieval conference TREC also increased since 1998, the Track Web document and other types of Web document retrieval properties in different places, and will test in large-scale Web library (such as is 100 gold bytes) on the performance of information retrieval algorithm.
Sponsored by the American Infornotics company of search engine from 1996 international conference, held once a year, to summarize the search engine technology, discussion and prospects, participants are famous the search engine companies, universities and research institutions for the scholars, search engine technology has played a very good role. Another like IEEE international conference, man-machine interactive web conference has more and more about search engine technology research published articles.
China has successively, Peking University, tsinghua university and national intelligence research center and so on universities and research institutes in the search engine technology research and develop a few good system. If by Peking University network laboratory developed “tianwang” Chinese search engine (http://pccms.pku.edu.cn:8000/gbindex.php), in the system and the system performance in size to the search engine system abroad medium for domestic users technical level, provide good service and Internet search by customers.