Given a *n*-word document
*a* = {*w*_{1},
*w*_{2},...*w*_{n}} and a set of
*n* recognized words, one can
represent *q* and *a* each as a vector of word frequencies
and . A common measure of
similarity between two word frequency vectors and weighted by inverse
document frequency (*idf*) is the
cosine distance between them:

= log |

Sandeep Pandey 2003-03-05