24 Feb

A Probabilistic Retrieval Model for Semistructured Data

My first paper ‘A Probabilistic Retrieval Model for Semistructured Data’ (co-work with Xiaobing Xue and W. Bruce Croft) is to be presented in ECIR’09 at Toulouse, France.

It started as a course project in Advanced Database . As a natural intersection of Database and IR, semi-structured (XML) documeent retrieval problem drew my attention.
A simple literature review revealed that most of past work focused on setting the right granularity (XML element) for the retrieval. Also, most of those work assumed a structured query (XPath) rather than keyword query.

I wanted to see the problem differently. The first obvious thing was that it’s beyond the capability of average users to formulate XPath query — it’s hard even for me!.
And a thought on the typical user’s querying behavior made me realize that we implicitly map each query-term into some aspect of the item we are looking for.

Let’s assume a user trying to find a movie ‘French Kiss’ with partial information about cast (‘meg ryan’) and genre (‘romance’). He or she may type ‘meg ryan romance’ yet it is clear which aspect of data (movie) user meant by each query-word. And we can infer this mapping between query-term and document field by bayesian estimation (more detail on paper).

    <?xml version="1.0" encoding="ISO-8859-1"?>
    <movie>
       <title>French Kiss</title>
       <year>1995</year>
       <releasedate>USA:5 May 1995</releasedate>
       ...
       <language>English</language>
       ...
       <genre>Comedy</genre>
       <genre>Romance</genre>
       ...
       <country>USA</country>
       <actors>
          ...
          <actress>Ryan, Meg I</actress>
       </actors>

       <team>
          <director>Kasdan, Lawrence</director>
          ...
       </team>
       <plot>
          An American woman Kate(Meg Ryan) goes on trip to France
          in a desperate effort to find her romance back.
          ...
       </plot>
    </movie>

Given this observation and taking into account that each aspect of information is encoded in different XML element, it is natural that raking algorithm for this kind of document can benefit from this mapping between query-word and document. The solution is to put a higher weight for the element which seems to be what user intended. In above example, ‘cast’ element needs to be weighted higher for ‘meg ryan’ and the same can be said about ‘genre’ element and ‘romance’.

This simple idea later turned out to improve retrieval performance significantly. The performance gain was more noticable for collection with clear semantics (e.g. movie descriptions) since it was easier for a system to map each query-word into correct document field.

I’m currently working on applying this retrieval model for the desktop search problem, XML data were replaced with documents with metadata fields.

Tags : Paper Print Comments Trackback
12 Jul

Discriminative Models for Information Retrieval

PubDate(2004), PubPlace(SIGIR) Author(Nallapati)
keyword(Discriminative Model,SVM,Maximum Entropy)

Summary

Applying discriminative model to IR task resulted in improved result for Homepage finding task that typically requires combining many arbitrary features.

Content

Background

  • People most have approached IR using generative classifier
    • LMIR : Assuming that each document forms its own class, it’s similar to classifying given query to best-matching document class using Naive-bayes classifier

Contribution

  • Problem for IR as classification
    • Out-of-vocabulary problem : query-word may not appear in document
      • Overcome by aggregate feature
    • Unbalance of relevance/non-relevance class
      • Overcome by undersampling nr class

Experiment

  • Application of SVM to

Future Work

Comment

Reference

Tags : Paper,IR Print Comments Trackback
10 Jul

On Ranking Techniques for Desktop Search

PubDate(2008), PubPlace(TIS) Author(Cohen,Domshlak,Zwerdling)
keyword(Desktop search,Learning to rank,)

Summary

Desktop search experiment with known-item search task.

Content

Background

  • Stuff I’ve seen(SIS) : users sort result by last-update date more frequently than by IR ranking
    • The older the data, the less often it is used
  • Type of information stored in desktop
    • Ephemeral : reminder, short-lived
    • Working : related to ongoing work
    • Archive : long-term resource

Contribution

  • Novel Feature
    • Level : the distance of a file from uppermost directory
    • DirRank : The probability to open a file in specific directory is proportional to the number of files previous opened from this (and its sub) directory. (normalized by files in directory)
  • Selectivity ; combine values of content-similarity feature by the inverse of of the file number with non-zero feature value.
    • e.g. If user query was match with the filename field of 100 files, the score of name field is divided by 100.

Experiment

  • Queries are grouped by the no. of results returned
  • As more results are returned by each query, date-related features became more useful. (selectivity)
  • Combination of result by selectivity was proven to be as effective as learning-based methods

Future Work

Comment

Reference

Tags : Paper,IR Print Comments Trackback
10 Jul

Natural Language Processing and Information Retrieval

PubDate(1999), PubPlace() Author(Voorhees)
keyword(NLP,IR)

Summary

NLP doesn’t help ad-hoc retrieval.

Content

Background

Contribution

  • Term-normalization may help
    • Term-mismatch is major performance degrading factor.
    • No evidence that ‘meaning’ representation can boost perf.
  • Queries are troublesome for NLP-IR.
  • With sufficient context, BoW model implicitly disambiguates query-term.
    • e.g. Using document similarity measure to build word sense classifier (Word occurrences in similar documents can be considered to have the same meaning)

Experiment

  • Will ‘concept indexing’ based on WordNet synset improve effectiveness?
    • Using extended VSM : linear interpolation of similarity btw. the query and several vectors of document representation
    • Disambiguation : nouns are disambiguated into a synset id
    • 3 subvectors : 1) words not disambiguated 2) synset ids of disambiguated nouns 3) stems of disambiguated nouns
  • Concept indexing hurt performance in most cases
    • More of term mismatches due to errors in word sense disambiguation
    • Some queries were helped
  • ‘101’ run where non-stemmed verbs and adjectives are not counted as match showed worse perf. than baseline — impact of term mismatch again

Future Work

Comment

Reference

Tags : Paper,IR Print Comments Trackback
12 Jul

Optimizing Search Engines using Clickthrough Data

PubDate(2002), PubPlace(SIGKDD) Author(Joachims)
keyword(SVM,Clickthrough,Pairwise preference,Learning to rank,Metasearch)

Content

Background

  • LTR based on expert-judged relevance

Contribution

  • Generation pairwise pref. using clickthrough
  • Training this binary ordering using SVM
    • By minimizing rank correlation to optimal ranking via Kendall’s
  • Meta-search (Strive) engine to compare the learned ranking function with existing methods
    • To perform unbiased comparison of different rankings with clickthrough data, rank combination method that equally presents the links from each system.

Experiment

  • As more data was used for training, the algorithm showed lower rate of error.

Future Work

Comment

The author used the search engine himself to verify his result. I may need to build metasearch engine myself, which can be useful for a variety of tasks.

Reference

Tags : Paper,IR Print Comments Trackback