24 Feb
My first paper ‘A Probabilistic Retrieval Model for Semistructured Data’ (co-work with Xiaobing Xue and W. Bruce Croft) is to be presented in ECIR’09 at Toulouse, France.
It started as a course project in Advanced Database . As a natural intersection of Database and IR, semi-structured (XML) documeent retrieval problem drew my attention.
A simple literature review revealed that most of past work focused on setting the right granularity (XML element) for the retrieval. Also, most of those work assumed a structured query (XPath) rather than keyword query.
I wanted to see the problem differently. The first obvious thing was that it’s beyond the capability of average users to formulate XPath query — it’s hard even for me!.
And a thought on the typical user’s querying behavior made me realize that we implicitly map each query-term into some aspect of the item we are looking for.
Let’s assume a user trying to find a movie ‘French Kiss’ with partial information about cast (‘meg ryan’) and genre (‘romance’). He or she may type ‘meg ryan romance’ yet it is clear which aspect of data (movie) user meant by each query-word. And we can infer this mapping between query-term and document field by bayesian estimation (more detail on paper).
<?xml version="1.0" encoding="ISO-8859-1"?>
<movie>
<title>French Kiss</title>
<year>1995</year>
<releasedate>USA:5 May 1995</releasedate>
...
<language>English</language>
...
<genre>Comedy</genre>
<genre>Romance</genre>
...
<country>USA</country>
<actors>
...
<actress>Ryan, Meg I</actress>
</actors>
<team>
<director>Kasdan, Lawrence</director>
...
</team>
<plot>
An American woman Kate(Meg Ryan) goes on trip to France
in a desperate effort to find her romance back.
...
</plot>
</movie>
Given this observation and taking into account that each aspect of information is encoded in different XML element, it is natural that raking algorithm for this kind of document can benefit from this mapping between query-word and document. The solution is to put a higher weight for the element which seems to be what user intended. In above example, ‘cast’ element needs to be weighted higher for ‘meg ryan’ and the same can be said about ‘genre’ element and ‘romance’.
This simple idea later turned out to improve retrieval performance significantly. The performance gain was more noticable for collection with clear semantics (e.g. movie descriptions) since it was easier for a system to map each query-word into correct document field.
I’m currently working on applying this retrieval model for the desktop search problem, XML data were replaced with documents with metadata fields.
Tags :
Paper
Print
Comments
Trackback
12 Jul
PubDate(2004), PubPlace(SIGIR) Author(Nallapati)
keyword(Discriminative Model,SVM,Maximum Entropy)
Summary
Applying discriminative model to IR task resulted in improved result for Homepage finding task that typically requires combining many arbitrary features.
Content
Background
- People most have approached IR using generative classifier
- LMIR : Assuming that each document forms its own class, it’s similar to classifying given query to best-matching document class using Naive-bayes classifier
Contribution
- Problem for IR as classification
- Out-of-vocabulary problem : query-word may not appear in document
- Overcome by aggregate feature
- Unbalance of relevance/non-relevance class
- Overcome by undersampling nr class
Experiment
Future Work
Comment
Reference
Tags :
Paper,
IR
Print
Comments
Trackback
10 Jul
PubDate(2008), PubPlace(TIS) Author(Cohen,Domshlak,Zwerdling)
keyword(Desktop search,Learning to rank,)
Summary
Desktop search experiment with known-item search task.
Content
Background
- Stuff I’ve seen(SIS) : users sort result by last-update date more frequently than by IR ranking
- The older the data, the less often it is used
- Type of information stored in desktop
- Ephemeral : reminder, short-lived
- Working : related to ongoing work
- Archive : long-term resource
Contribution
- Novel Feature
- Level : the distance of a file from uppermost directory
- DirRank : The probability to open a file in specific directory is proportional to the number of files previous opened from this (and its sub) directory. (normalized by files in directory)
- Selectivity ; combine values of content-similarity feature by the inverse of of the file number with non-zero feature value.
- e.g. If user query was match with the filename field of 100 files, the score of name field is divided by 100.
Experiment
- Queries are grouped by the no. of results returned
- As more results are returned by each query, date-related features became more useful. (selectivity)
- Combination of result by selectivity was proven to be as effective as learning-based methods
Future Work
Comment
Reference
Tags :
Paper,
IR
Print
Comments
Trackback
10 Jul
PubDate(1999), PubPlace() Author(Voorhees)
keyword(NLP,IR)
Summary
NLP doesn’t help ad-hoc retrieval.
Content
Background
Contribution
- Term-normalization may help
- Term-mismatch is major performance degrading factor.
- No evidence that ‘meaning’ representation can boost perf.
- Queries are troublesome for NLP-IR.
- With sufficient context, BoW model implicitly disambiguates query-term.
- e.g. Using document similarity measure to build word sense classifier (Word occurrences in similar documents can be considered to have the same meaning)
Experiment
- Will ‘concept indexing’ based on WordNet synset improve effectiveness?
- Using extended VSM : linear interpolation of similarity btw. the query and several vectors of document representation
- Disambiguation : nouns are disambiguated into a synset id
- 3 subvectors : 1) words not disambiguated 2) synset ids of disambiguated nouns 3) stems of disambiguated nouns
- Concept indexing hurt performance in most cases
- More of term mismatches due to errors in word sense disambiguation
- Some queries were helped
- ‘101’ run where non-stemmed verbs and adjectives are not counted as match showed worse perf. than baseline — impact of term mismatch again
Future Work
Comment
Reference
Tags :
Paper,
IR
Print
Comments
Trackback
12 Jul
PubDate(2002), PubPlace(SIGKDD) Author(Joachims)
keyword(SVM,Clickthrough,Pairwise preference,Learning to rank,Metasearch)
Content
Background
- LTR based on expert-judged relevance
Contribution
- Generation pairwise pref. using clickthrough
- Training this binary ordering using SVM
- By minimizing rank correlation to optimal ranking via Kendall’s

- Meta-search (Strive) engine to compare the learned ranking function with existing methods
- To perform unbiased comparison of different rankings with clickthrough data, rank combination method that equally presents the links from each system.
Experiment
- As more data was used for training, the algorithm showed lower rate of error.
Future Work
Comment
The author used the search engine himself to verify his result. I may need to build metasearch engine myself, which can be useful for a variety of tasks.
Reference
Tags :
Paper,
IR
Print
Comments
Trackback