| |
KEYCONCEPT API
DOCUMENTATION
First time project users, please study the MAINREADME and the additional README files
in the tarball before proceeding further.
NOTE: - All paths are relative to the project main directory.
The project broadly consists of the following modules:
INDEX
-
Indexes documents using the standard tf-idf approach.
-
Indexing can be performed with creating the dictionary and posting files in memory (memory
based version) or dictionary file in the memory but postings created from files stored on disk (file
based version)
RETRIEVE
-
Performs keyword-based retrieval.
-
Can be run as command line module or via a web interface as a search engine.
-
In the training stage, we retreive concept-ids rather than
document- ids.
TRAIN
-
Shell script invoking INDEX.
-
Utities in odptools are used to obtain list containing training
(>=30) documents for each concept.
-
Those training documents are then indexed to create training
indices.
CATEGORIZE
-
Categorizes documents using cosine measure between document and category keyword vectors.
-
Requires training (building dict and post with collection of documents for each category)
CATINDEX
-
Builds a word/concept index again using the tf-idf approach.
CATRETRIEVE
-
Performs retrieval operation on the word/concept index.
-
Retrieval can be done using keywords, concepts or both. Invoked by the
web-based interface or through a command line.
CREATE PROFILE
-
Creates a profile from collected Web pages.
-
The profile consists of categories of interest and their weights.
INDEX
|
Main program
source file
|
index/indexpgm.cc index/createInverted.cc
|
|
API Definition
source file
|
index/index.cc
|
|
API Function
|
void IndexFiles (HTMLPathname, FilesFilename, Dict, Post, PreFilename,
NormFlag, MemFlag, NumDocs)
|
|
Parameters to
API function
|
HTMLPathname
|
Path to the directory with HTML files to be indexed
|
|
FilesFilename
|
Path of the file with filenames of all files to be indexed.
|
|
Dict
|
Path of the dictionary file to be created.
|
|
Post
|
Path of the posting file to be created
|
|
PreFilename
|
Path to file specifying document pre-processing options
|
|
NormFlag
|
Normalization flag (0=do not normalize, 1=normalize the weights).
|
|
MemFlag
|
Flag to the determine the file creation (0=file-based, 1=memory-based)
|
|
NumDocs
|
Number of documents to be extracted from HTML directory for indexing
|
|
Compilation
instructions
|
In the root directory, type 'make index'.
|
|
Execution
instructions
|
Run the shell script bin/index.sh (Description in README and HOWTO_RUN_INDEX).
The module expects either directory or file with filenames.
|
RETRIEVE
|
Main program
source file
|
retrieve/retrievepgm.cc for command line version
retrieve/retrievepgm.cgi.cc for web based version
|
|
API Definition
source file
|
retrieve/retrieve.cc
|
|
Include file
|
include/retrieve.h
|
|
API Function
|
void RetrieveDocs(dict, post, prefilename, queryFile, outfilename, fileFilenames, totalDocs,
numDocs, cgi)
|
|
Parameters
|
dict
|
Path of the dictionary file
|
|
post
|
Path of the postings file
file
|
|
prefilename
|
Path of the file with preprocessing
instructions
|
|
queryFile
|
Path of the file with query terms and
weights
|
|
outfilename
|
Output file containing
the documents and weights
|
|
fileFilenames
|
File with filenams and IDs (as used when
indexing - used for the web based version to display the original filenames)
|
|
totalDocs
|
Number of documents in the collection
|
|
numDocs
|
Number of documents to
be returned
|
|
cgi
|
CGI flag to display the results
in HTML format (1=HTML format, 0=plain text format)
|
|
normFlag
|
Normalization flag to display
the results normalized with values between 0 and 1. (1=normalize, 0=don't normalize)
|
|
Compilation
instructions
|
In the root directory, type 'make retrieve'.
|
|
Execution
instructions
|
Set the parameters in config/retrieve.cfg. Run the shell script bin/retrieve.sh
(Description in retrieve/README and bin/HOWTO_RUN_RETRIEVE)
|
TRAIN
Execution instructions: Run the shell script training/train.sh
Please view training/train.sh for training requirements and for information
on setting parameters and paths.
CATEGORIZE
|
Main program
source file
|
categorize/categorizepgm.cc
|
|
API Definition
source file
|
categorize/categorize.cc
|
|
Include file
|
include/categorize.h
|
|
API Function
|
void Categorize(tdict, tpost, prefile, input, inputType, output, categorynames, totalDocs, topCats)
|
|
Parameters
|
tdict
|
Path of the trained dictionary file
|
|
tpost
|
Path of the trained postings file
file
|
|
prefilename
|
Path of the file with preprocessing
instructions
|
|
input
|
Filename, directory or file with filenames to be
categorized (defined with inputType)
|
|
inputType
|
1-single file, 2-directory,
3-file with filenames
|
|
output
|
Output file containing
the categories and weights, or directory with output files if multiple files are categorized
|
|
categorynames
|
File with categorynames and IDs (as used when
training - used for the web based version to display the original category)
|
|
totalDocs
|
Number of categories in the collection
|
|
topWords
|
Number of top ranked words
from the document to be used for categorizing
|
|
topCats
|
Number of top matching categories to
be returned
|
|
Compilation
instructions
|
In the root directory, type 'make categorize'.
|
|
Execution
instructions
|
Set the parameters in config/categorize.cfg. Run the shell script bin/categorize.sh (Description in categorize/README and
bin/HOWTO_RUN_CATEGORIZE)
|
CATINDEX
|
Main program
source file
|
catindex/catindexpgm.cc
|
|
API Definition
source file
|
catindex/catindex.cc
|
|
Include file
|
include/catindex.h
|
|
API
Function
|
void CatIndexFiles(FilesFilename, DictFilename, PostFilename,
TrainDictFilename, TrainPostFilename, PreFilename, NumCats, TopCats, TopWords);
|
|
Parameters
|
FilesFilename
|
Path to the file with filenames to be categoriezed and build the category
index
|
|
DictFilename
|
Path to the conceptual
dictionary file to be created (word/concept index)
|
|
PostFilename
|
Path to the conceptual
posting file to be created (word/concept index)
|
|
TrainDictFilename
|
Path to the training
dictionary file
|
|
TrainPostFilename
|
Path to the training
posting file
|
|
PreFilename
|
Path to the file with preprocessing instructions
|
|
NumCats
|
Estimated number of categories a document can belong to
|
|
TopCats
|
Number of top categories to be considered for each
document indexed
|
|
TopWords
|
Number of top words to be considered while
categorizing the documents
|
|
Compilation
instructions
|
In the root directory, type 'make catindex'
|
|
Execution
instructions
|
Run the script bin/catindex.sh (Description in
catindex/README and bin/HOWTO_RUN_CATINDEX)
|
CATRETRIEVE
|
Main program
source file
|
catretrieve/catretrievepgm.cc
|
|
API Definition
source file
|
catretrieve/catretrieve.cc
|
|
Include file
|
include/catretrieve.h
|
|
API Function
|
void CatRetrieve(dict, post, cat_dict, cat_post, prefile, queryfile,
catfile, outfile, filesfile, category_names, numdocs, alpha, cgi);
|
|
Parameters
|
dict
|
Path to the word dictionary file
|
|
post
|
Path to word
posting file
|
|
cat_dict
|
Path to concept
dictionary file
|
|
cat_post
|
Path to concept
posting file
|
|
prefile
|
Path to file with preprocessing
instruction
|
|
queryfile
|
File containing the
query keywords
|
|
catfile
|
File containing concept
ids
|
|
outfile
|
Output filename (if run from
the command line)
|
|
filesfile
|
Path to the file
with original filenames for the documents in the collection and their ids
|
|
category_names
|
Path to the file
with original category names and their ids
|
|
numdocs
|
Number of matching documents to be
displayed
|
|
alpha
|
Alpha value to determine relative
importance of keyword, concept in retrieval (0<=alpha<=1)
|
|
cgi
|
Flag to determine
the output format (cgi=0 output in a file, cgi=1 output in HTML format)
|
|
Compilation
instructions
|
In the root directory, type 'make catretrieve'
|
|
Execution
instructions
|
Set the parameters in the config/catretrieve.cfg. Run the
catretrieve.cgi from the command line of use the web interface.
(Description in MAINREADME and bin/HOWTO_RUN_CATRETRIEVE)
|
CREATE PROFILE
|
Main program
source file
|
createprofile/createprofilepgm.cc
|
|
API Definition
source file
|
createprofile/createprofile.cc
|
|
Include file
|
include/createprofile.h
|
|
API Function
|
void create_profile(prefilename, dict, post, inputType, input,
wtFlag, subjectTree, lcaHTMLsDir, output, numWords, numCat,
prune_threshold, min_weightThold, update, date_processed)
|
|
Parameters
|
dict
|
Path of the training dictionary file
|
|
post
|
Path of the training postings
file
|
|
prefilename
|
Path to file specifying document
pre-processing options
|
| |