KEYCONCEPT API DOCUMENTATION



First time project users, please study the MAINREADME and the additional README files in the tarball before proceeding further.

NOTE: - All paths are relative to the project main directory.

The project broadly consists of the following modules:

INDEX

 

-         Indexes documents using the standard tf-idf approach.

-         Indexing can be performed with creating the dictionary and posting files in memory (memory based version) or dictionary file in the memory but postings created from files stored on disk (file based version)

 

RETRIEVE

 

-         Performs keyword-based retrieval.

-         Can be run as command line module or via a web interface as a search engine.

-         In the training stage, we retreive concept-ids rather than document- ids.

TRAIN

 

-         Shell script invoking INDEX.

-         Utities in odptools are used to obtain list containing training (>=30) documents for each concept.

-         Those training documents are then indexed to create training indices.


CATEGORIZE

 

-         Categorizes documents using cosine measure between document and category keyword vectors.

-         Requires training (building dict and post with collection of documents for each category)


CATINDEX

 

-         Builds a word/concept index again using the tf-idf approach.


CATRETRIEVE

 

-         Performs retrieval operation on the word/concept index.

-         Retrieval can be done using keywords, concepts or both. Invoked by the web-based interface or through a command line.


CREATE PROFILE

 

-         Creates a profile from collected Web pages.

-         The profile consists of categories of interest and their weights.


 

INDEX

 

 

Main program source file

 

 

index/indexpgm.cc
index/createInverted.cc

 

API Definition source file

 

 

index/index.cc

 

API Function

 

 

void IndexFiles (HTMLPathname, FilesFilename, Dict, Post, PreFilename, NormFlag, MemFlag, NumDocs)

Parameters to API function

 

HTMLPathname

 

Path to the directory with HTML files to be indexed

FilesFilename

Path of the file with filenames of all files to be indexed.

 

Dict

 

Path of the dictionary file to be created.

Post

 

Path of the posting file to be created

PreFilename

 

Path to file specifying document pre-processing options

NormFlag

Normalization flag (0=do not normalize, 1=normalize the weights).

MemFlag

Flag to the determine the file creation (0=file-based, 1=memory-based)

 

NumDocs

Number of documents to be extracted from HTML directory for indexing

 

 

Compilation instructions

 

 

In the root directory, type 'make index'.

 

Execution instructions

 

 

Run the shell script bin/index.sh (Description in README and HOWTO_RUN_INDEX). The module expects either directory or file with filenames.

 

 



RETRIEVE

 

 

Main program source file

 

retrieve/retrievepgm.cc for command line version
retrieve/retrievepgm.cgi.cc for web based version

 

API Definition source file

 

retrieve/retrieve.cc

 

Include file

 

include/retrieve.h

 

API Function

 

void RetrieveDocs(dict, post, prefilename, queryFile, outfilename, fileFilenames, totalDocs, numDocs, cgi)

Parameters

dict

 

Path of the dictionary file

post

Path of the postings file file

 

prefilename

Path of the file with preprocessing instructions

 

queryFile

Path of the file with query terms and weights

 

outfilename

Output file containing the documents and weights

 

fileFilenames

File with filenams and IDs (as used when indexing - used for the web based version to display the original filenames)

 

totalDocs

Number of documents in the collection

 

numDocs

Number of documents to be returned

 

cgi

CGI flag to display the results in HTML format (1=HTML format, 0=plain text format)

 

normFlag

Normalization flag to display the results normalized with values between 0 and 1. (1=normalize, 0=don't normalize)

 

Compilation instructions

In the root directory, type 'make retrieve'.

 

Execution instructions

Set the parameters in config/retrieve.cfg. Run the shell script bin/retrieve.sh (Description in retrieve/README and bin/HOWTO_RUN_RETRIEVE)

 

 


 


TRAIN

 

Execution instructions: Run the shell script training/train.sh
Please view training/train.sh for training requirements and for information on setting parameters and paths.


CATEGORIZE

 

 

Main program source file

 

categorize/categorizepgm.cc

 

API Definition source file

 

categorize/categorize.cc

 

Include file

 

include/categorize.h

 

API Function

 

void Categorize(tdict, tpost, prefile, input, inputType, output, categorynames, totalDocs, topCats)

Parameters

tdict

 

Path of the trained dictionary file

tpost

Path of the trained postings file file

 

prefilename

Path of the file with preprocessing instructions

 

input

Filename, directory or file with filenames to be categorized (defined with inputType)

 

inputType

1-single file, 2-directory, 3-file with filenames

 

output

Output file containing the categories and weights, or directory with output files if multiple files are categorized

 

categorynames

File with categorynames and IDs (as used when training - used for the web based version to display the original category)

 

totalDocs

Number of categories in the collection

 

topWords

Number of top ranked words from the document to be used for categorizing

 

topCats

Number of top matching categories to be returned

 

Compilation instructions

In the root directory, type 'make categorize'.

 

Execution instructions

Set the parameters in config/categorize.cfg. Run the shell script bin/categorize.sh (Description in categorize/README and bin/HOWTO_RUN_CATEGORIZE)

 

 


CATINDEX

 

 

Main program source file

 

catindex/catindexpgm.cc

 

API Definition source file

 

catindex/catindex.cc

 

Include file

 

include/catindex.h

 

API Function

 

void CatIndexFiles(FilesFilename, DictFilename, PostFilename, TrainDictFilename, TrainPostFilename, PreFilename, NumCats, TopCats, TopWords);

Parameters

FilesFilename

Path to the file with filenames to be categoriezed and build the category index

 

DictFilename

Path to the conceptual dictionary file to be created (word/concept index)

  

PostFilename

Path to the conceptual posting file to be created (word/concept index)

  

TrainDictFilename

Path to the training dictionary file

 

 

TrainPostFilename

 

 

Path to the training posting file

 

PreFilename

Path to the file with preprocessing instructions

 

 

NumCats

Estimated number of categories a document can belong to

 

 

TopCats

 

 

Number of top categories to be considered for each document indexed

 

TopWords

 

Number of top words to be considered while categorizing the documents

Compilation instructions

In the root directory, type 'make catindex'

Execution instructions

Run the script bin/catindex.sh (Description in catindex/README and bin/HOWTO_RUN_CATINDEX)

 




CATRETRIEVE

 

 

Main program source file

 

catretrieve/catretrievepgm.cc

 

API Definition source file

 

catretrieve/catretrieve.cc

 

Include file

 

include/catretrieve.h

 

API Function

 

void CatRetrieve(dict, post, cat_dict, cat_post, prefile, queryfile, catfile, outfile, filesfile, category_names, numdocs, alpha, cgi);

Parameters

dict

 

Path to the word dictionary file

post

Path to word posting file

 

cat_dict

Path to concept dictionary file

 

cat_post

Path to concept posting file

 

prefile

Path to file with preprocessing instruction

 

queryfile

File containing the query keywords

 

catfile

File containing concept ids

 

outfile

Output filename (if run from the command line)

 

filesfile

Path to the file with original filenames for the documents in the collection and their ids

 

category_names

Path to the file with original category names and their ids

 

numdocs

Number of matching documents to be displayed

 

alpha

Alpha value to determine relative importance of keyword, concept in retrieval (0<=alpha<=1)

 

cgi

Flag to determine the output format (cgi=0 output in a file, cgi=1 output in HTML format)

 

Compilation instructions

In the root directory, type 'make catretrieve'

 

Execution instructions

Set the parameters in the config/catretrieve.cfg. Run the catretrieve.cgi from the command line of use the web interface. (Description in MAINREADME and bin/HOWTO_RUN_CATRETRIEVE)

 




CREATE PROFILE

 

 Main program source file

 

createprofile/createprofilepgm.cc

 API Definition source file

 

createprofile/createprofile.cc

 Include file

 

include/createprofile.h

 API Function

 

void create_profile(prefilename, dict, post, inputType, input, wtFlag, subjectTree, lcaHTMLsDir, output, numWords, numCat, prune_threshold, min_weightThold, update, date_processed)

Parameters

dict

 

Path of the training dictionary file

post

Path of the training postings file

 

prefilename

Path to file specifying document pre-processing options