Project Description
The aim of this project is to annotate a document with a label with semantic analysis. We are testing our algorithm on ACM dataset with the help of wikipedia pages of the categories.
Methodology
The following is the process that we have followed.
- Pre-Process the data like removing alpha-numeric characters, punctuators etc.
- Divide the data into testing and training sets.
- Convert the data into vector space model using Word2Vec and Doc2Vec algorithms.
- Train a classifier with features the word vectors.
- Cross validate and report the results.
Source Code
ACM
run.py
This code is responsible for representing the dataset into vector space model and the models are saved.
pre_process.py
This code is responsible to make the dataset clean.
classify.py
This code is responsible for training the classifier and also reporting the results of cross-validation.
Analysis
We present our analysis using pictorial representation.
MAP value (with Stemming):
NDCG value (with Stemming):
MAP value (with Stemming):
Tags
- Information Retrieval and Extraction Course
- IIIT Hyderabad
- Semantic
- Annotation
- Documents
- Word2Vec
- Doc2Vec
- ACM
- Research Papers
- Neural Networks
- KNN
- NDCG
- MAP
- IIIT-H
- Major Project
Link to resources
Contributors
- Sakala Venkata Krishna Rohit @rohitsakala
- Sharvil Katariya @scorpionhiccup
- Nikhil Chavanke