第6章 實驗結果與分析
6.1實驗結果
在我們的實驗數據里,總共抓取了2500篇論文,其中在我們的論文集里被其他論文引用的文章個數為1686篇,總共被引用72471次,平均每個論文被42論文引用。這些論文中,總共能找到的評論句子個數為160046個。平均每個論文有個95評論句子,每個論文在被另外一篇論文引用時,平均約被評論2.2次
根據上面的比率,可以看出,如果我們最終顯示在界面上的評論個數需要是5個,那么一篇論文,它被1到2篇論文引用時,就會獲得足夠的評論集。如果被5篇論文引用時,就會獲得效果很好的評論集了。
6.2具體分析
為了更好的說明我們所做的這個系統的效果,下面隨機選取一篇評論較多論文為例,來說明我們獲得的這些評論以及概括的作用[Elkiss, et al.,2008]。
Paper Name:
Three-level caching for efficient query processing in large Web search engines
從題目可以看出,這篇論文是用三級緩存來處理搜索引擎中大規模的請求的。
Abstract:
Large web search engines have to answer thousands of queries per second with interactive response times. Due to the sizes of the data sets involved, often in the range of multiple terabytes, a single query may require the processing of hundreds of megabytes or more of index data. To keep up with this immense workload, large search engines employ clusters of hundreds or thousands of machines, and a number of techniques such as catching, index compression, and index and query pruning are used to improve scalability. In particular, two-level caching techniques cache results of repeated identical queries at the frontend, while index data for frequently used query terms are cached in each node at a lower level. We propose and eva luate a three-level caching scheme that adds an intermediate level of caching for additional performance gains. This intermediate level attempts to exploit frequently occurring pairs of terms by caching intersections or projections of the corresponding inverted lists. We propose and study several offline and online algorithms for the resulting weighted caching problem, which turns out to be surprisingly rich in structure. Our experimental eva luation based on a large web crawl and real search engine query log shows significant performance gains for the best schemes, both in isolation and in combination with the other caching levels. We also observe that a careful selection of cache admission and eviction policies is crucial for best overall performance.
摘要部分,先說了搜索引擎的負載很重的概況;然后介紹現有的兩級catch有一定的缺點,而作者完成了一個三級緩存,在原有的緩存加入了一個中間層;最后說本文用到了一些算法,并且最終實驗結果的性能也很好。
通過閱讀摘要,我們就知道這篇論文的概況以及來龍去脈。
Comment:
(1)They may be considered separate and complementary to a cache-based approach. Raghavan and Sever [the cited paper], in one of the first papers on exploiting user query history, propose using a query base, built upon a set of persistent “optimal” queries submitted in the past, to improve the retrieva l effectiveness for similar future queries. Markatos [10] shows the existence of temporal locality in queries, and compares the performance of different catching policies.
(2)Our results show that even under the fairly general framework adopted in this paper, geographic search queries can be eva luated in a highly efficient manner and in some cases as fast as the corresponding text-only queries. The query processor that we use and adapt to geographic search queries was built by Xiaohui Long, and earlier versions were used in [26, 27]. It supports variants of all the optimizations described in Subsection 1.
(3)the survey by Gaede and G¨ nther in [17]. In particular, our u algorithms employ spatial data organizations based on R∗ -tree [5], grid files [the cited paper], and space-filling curves - see [17, 36] and the references therein. A geographic search engine may appear similar to a Geographic Information System (GIS) [20] where documents are objects in space with additional non-spatial attributes (the words they contain).
下面我們來逐條分析上面獲得的評論。
從(1)中可以看出,該條評論并沒有談到源論文的三級緩存結構,而是比較看重其中的一個方法:利用用戶請求的歷史記錄,基于以前所獲得的比較理想的查詢詞,建立一個用戶請求庫,來提高搜索引擎的中相似請求的處理速度。這句話就很好的告訴了我們源論文中三級緩存的一個方法,并且可以看出,這個方法并不僅僅可以用在三級緩存中,也可以用在個性化搜索等方面。
從(2)中可以看出,該條評論說明了它利用了源論文中的請求處理器,來搭建了一個地理搜索引擎。通過這一條評論我們可以看出源論文的后續工作,有什么用處。源論文并不僅僅在三級緩存結構上有研究,其請求處理模型很可能用處更大。
從(3)中可以看出,源論文中使用了一種grid files的系統或者算法,它和R*-tree、空間填充曲線這些算法結合,能夠形成一種特殊的數據結構。這也代表了源論文后續工作的一種,方便了讀者以更加廣闊的視野來看待該論文。
Impact-based Summary:
(1)This motivates the search for new techniques that can increase the number of queries per second that can be sustained on a given set of machines, and in addition to index compression and query pruning, caching techniques have been widely studied and deployed.
(2)Our experimental eva luation based on a large web crawl and real search engine query log shows significant performance gains for the best schemes, both in isolation and in combination with the other caching levels.
(3)To do so, the engine traverses the inverted list of each query term, and uses the information embedded in the inverted lists, about the number of occurrences of the terms in a document, their positions, and context, to compute a score for each document containing the search terms.
(4)Query characteristics: We first look at the distribution of the ratios and total costs for queries with various numbers of terms, by issuing these queries to our query processor with caching completely turned off.
(5)Thus, recent queries are analyzed by the greedy algorithm to allocate space in the cache for projections likely to be encountered in the future, and only these projections are allowed into the cache.
最后我們來分析獲得的基于影響的概括,這里,為了節省篇幅,只取了前5句來進行分析。