top of page

SIMI

Time: Sep 2014 - Nov 2014

Group Project

My Role: Algorithm Design

Result Page
Start Page
prj plan
Software Developement Process

Simi, written in Java, served as a Software Engineering course project with agile development process and JUnit testing. It can compute the similarity between two articles. I was responsible for designing and realizing algorithms.

 

The algorithms of Simi consist of simple algorithm, stemming algorithm and cosine algorithm. The simple algorithm just accumulates the occurrences of same words in the two articles. The stemming algorithm has a stemmer class implementing the porter stemming algorithm to stem the term. The cosine algorithm uses TF-IDF model to vectorize two articles and then generates their cosine similarity.

 

The most interesting part in this project is that I applied google search in the cosine algorithm to simulate finding documents contained a specific word in a large corpus and get IDF (IDF(t) = log(n / docs (t))), because Simi does not have any corpus. I defined the number of web pages that google crawl and index as the constant n. The value of docs (t) is also obtained by searching google using t as the query. But in this way, computing for all words in a document will be time-consuming. So I set up an IDFList storing all the words that have been searched with their corresponding IDF. IDF of a word can be calculated by searching the file first. If the file doesn’t contain the word, then the algorithm uses google to compute IDF and store (word, IDF) in the file.

bottom of page