基于PARADISE平臺的論文檢索系統(五)

本論文在其他論文欄目，由論文格式網整理,轉載請注明來源www.donglienglish.cn,更多論文,請點論文格式范文查看
          shared_ptr<Field> field_content = Field::TextStatistics("Content");
          shared_ptr<FieldData> field_content_data(new FieldData(pContent->getContentToken()));
          field_content.get()->setFieldData(field_content_data);
          document.addField(field_content, NONSTATIC);

          shared_ptr<Field> field_ID = Field::Keywords("Url");
          shared_ptr<FieldData> field_ID_data(new FieldData(PDFFunction::Int2Str(pContent->getID())));
          field_ID->setFieldData(field_ID_data);
          document.addField(field_ID, NONSTATIC);
          …
          document.setDocId(doc_id);
          pWriter->addDocument(document);
          doc_id++;
}
上面的代碼中，首先建立一個Content域，內容為我們的文獻全文形成的字符串。然后建立了一個url域。其中，url域及其重要，是必須有的一個域，而且必須名為Url。我們知道，所謂倒排索引，是指對一系列文本的內容建立索引，通過這些內容，可以獲得這些文本的ID號，就如網頁搜索一樣，我們通過那些網頁的內容，搜索到網頁的url。這里我們將文獻的文本內容存在BDB中的，因此需要獲得每個文章的ID號。
PARADISE系統的設置是，在我們開啟一個搜索服務時，一個請求發向服務器端之后，服務器端會將搜索到得結果的url列表返回給前端，這個url列表必須是來自上面的Url域。因為PARADISE主要是針對網頁搜索的，所以稱這個域為Url，實際上應該叫DocumentID更確切一點。
5.3修改前臺部分
PARADISE的前臺部分也設計的很好，特別是摘要算法已經完成測試。因此對于前臺部分，只需要修改一點，就是提供一個候選摘要的數據庫。我們知道，不可能對整篇文章進行摘要算法，那樣會耗費大量的時間，最終會導致前端所耗費的時間比后端檢索所花費的時間還多，這顯然是用戶無法接受的。因此，前臺部分唯一需要修改的部分，就是給定一個ID號，獲得它的摘要。這里，我們利用了前面獲得的metadata.dpt文件，里面存有一篇論文的摘要，獲得摘要段落之后，對其利用摘要算法，可以獲取較好的效果。
另外，我們這個系統不是簡單的一個論文檢索系統，檢索只是方便使用的工具，更重要的，它是一個知識提取系統，因此，還需要自己編寫一些界面用來顯示知識，這些就不再贅述。
5.4系統示意圖
5.4.1主界面

5.4.2搜索結果界面

5.4.3評論界面

第6章實驗結果與分析
6.1實驗結果
在我們的實驗數據里，總共抓取了2500篇論文，其中在我們的論文集里被其他論文引用的文章個數為1686篇，總共被引用72471次，平均每個論文被42論文引用。這些論文中，總共能找到的評論句子個數為160046個。平均每個論文有個95評論句子，每個論文在被另外一篇論文引用時，平均約被評論2.2次
根據上面的比率，可以看出，如果我們最終顯示在界面上的評論個數需要是5個，那么一篇論文，它被1到2篇論文引用時，就會獲得足夠的評論集。如果被5篇論文引用時，就會獲得效果很好的評論集了。
6.2具體分析
為了更好的說明我們所做的這個系統的效果，下面隨機選取一篇評論較多論文為例，來說明我們獲得的這些評論以及概括的作用[Elkiss, et al.,2008]。
Paper Name:
Three-level caching for efficient query processing in large Web search engines
從題目可以看出，這篇論文是用三級緩存來處理搜索引擎中大規模的請求的。
Abstract:
Large web search engines have to answer thousands of queries per second with interactive response times. Due to the sizes of the data sets involved, often in the range of multiple terabytes, a single query may require the processing of hundreds of megabytes or more of index data. To keep up with this immense workload, large search engines employ clusters of hundreds or thousands of machines, and a number of techniques such as catching, index compression, and index and query pruning are used to improve scalability. In particular, two-level caching techniques cache results of repeated identical queries at the frontend, while index data for frequently used query terms are cached in each node at a lower level. We propose and eva luate a three-level caching scheme that adds an intermediate level of caching for additional performance gains. This intermediate level attempts to exploit frequently occurring pairs of terms by caching intersections or projections of the corresponding inverted lists. We propose and study several offline and online algorithms for the resulting weighted caching problem, which turns out to be surprisingly rich in structure. Our experimental eva luation based on a large web crawl and real search engine query log shows significant performance gains for the best schemes, both in isolation and in combination with the other caching levels. We also observe that a careful selection of cache admission and eviction policies is crucial for best overall performance.
摘要部分，先說了搜索引擎的負載很重的概況；然后介紹現有的兩級catch有一定的缺點，而作者完成了一個三級緩存，在原有的緩存加入了一個中間層；最后說本文用到了一些算法，并且最終實驗結果的性能也很好。
通過閱讀摘要，我們就知道這篇論文的概況以及來龍去脈。
Comment:
(1)They may be considered separate and complementary to a cache-based approach. Raghavan and Sever [the cited paper], in one of the first papers on exploiting user query history, propose using a query base, built upon a set of persistent “optimal” queries submitted in the past, to improve the retrieva l effectiveness for similar future queries. Markatos [10] shows the existence of temporal locality in queries, and compares the performance of different catching policies.
(2)Our results show that even under the fairly general framework adopted in this paper, geographic search queries can be eva luated in a highly efficient manner and in some cases as fast as the corresponding text-only queries. The query processor that we use and adapt to geographic search queries was built by Xiaohui Long, and earlier versions were used in [26, 27]. It supports variants of all the optimizations described in Subsection 1.
(3)the survey by Gaede and G¨ nther in [17]. In particular, our u algorithms employ spatial data organizations based on R∗ -tree [5], grid files [the cited paper], and space-filling curves - see [17, 36] and the references therein. A geographic search engine may appear similar to a Geographic Information System (GIS) [20] where documents are objects in space with additional non-spatial attributes (the words they contain).
下面我們來逐條分析上面獲得的評論。
從（1）中可以看出，該條評論并沒有談到源論文的三級緩存結構，而是比較看重其中的一個方法：利用用戶請求的歷史記錄，基于以前所獲得的比較理想的查詢詞，建立一個用戶請求庫，來提高搜索引擎的中相似請求的處理速度。這句話就很好的告訴了我們源論文中三級緩存的一個方法，并且可以看出，這個方法并不僅僅可以用在三級緩存中，也可以用在個性化搜索等方面。
從（2）中可以看出，該條評論說明了它利用了源論文中的請求處理器，來搭建了一個地理搜索引擎。通過這一條評論我們可以看出源論文的后續工作，有什么用處。源論文并不僅僅在三級緩存結構上有研究，其請求處理模型很可能用處更大。
從（3）中可以看出,源論文中使用了一種grid files的系統或者算法，它和R*-tree、空間填充曲線這些算法結合，能夠形成一種特殊的數據結構。這也代表了源論文后續工作的一種，方便了讀者以更加廣闊的視野來看待該論文。

Impact-based Summary:
(1)This motivates the search for new techniques that can increase the number of queries per second that can be sustained on a given set of machines, and in addition to index compression and query pruning, caching techniques have been widely studied and deployed.
(2)Our experimental eva luation based on a large web crawl and real search engine query log shows significant performance gains for the best schemes, both in isolation and in combination with the other caching levels.
(3)To do so, the engine traverses the inverted list of each query term, and uses the information embedded in the inverted lists, about the number of occurrences of the terms in a document, their positions, and context, to compute a score for each document containing the search terms.
(4)Query characteristics: We first look at the distribution of the ratios and total costs for queries with various numbers of terms, by issuing these queries to our query processor with caching completely turned off.
(5)Thus, recent queries are analyzed by the greedy algorithm to allocate space in the cache for projections likely to be encountered in the future, and only these projections are allowed into the cache.
最后我們來分析獲得的基于影響的概括，這里，為了節省篇幅，只取了前5句來進行分析。

首頁上一頁 2 3 4 5 6 下一頁尾頁 5/6/6

相關論文


上一篇：試論嘉興市的電子化政府建設	下一篇：永不停歇的永遠――談鐵凝的創作..

Tags：基于 PARADISE 平臺論文檢索系統

【收藏】【返回頂部】