View on GitHub

Semantic Annotation for ACM Categories

Given a research paper,we annotate the paper with a Wikipedia article that matches its contents most closely.

Download this project as a .zip file Download this project as a tar.gz file

Project Description

The aim of this project is to annotate a document with a label with semantic analysis. We are testing our algorithm on ACM dataset with the help of wikipedia pages of the categories.

Methodology

The following is the process that we have followed.

Pre-Process the data like removing alpha-numeric characters, punctuators etc.
Divide the data into testing and training sets.
Convert the data into vector space model using Word2Vec and Doc2Vec algorithms.
Train a classifier with features the word vectors.
Cross validate and report the results.

Source Code

ACM

run.py

This code is responsible for representing the dataset into vector space model and the models are saved.

pre_process.py

This code is responsible to make the dataset clean.

classify.py

This code is responsible for training the classifier and also reporting the results of cross-validation.

Analysis

We present our analysis using pictorial representation.

MAP value (with Stemming): MAP Comparison

NDCG value (with Stemming): NDCG Comparison

MAP value (with Stemming): MAP Comparison

Link to resources

Presentation

Video

Report

Contributors

Sakala Venkata Krishna Rohit @rohitsakala
Sharvil Katariya @scorpionhiccup
Nikhil Chavanke