Dual-Mode Text Similarity Checker using TF-IDF and GloVe Embedding’s in Flask
Abstract:
This project presents a web-based application designed to compute the similarity between two text inputs using two distinct Natural Language Processing (NLP) approaches: TF-IDF with Cosine Similarity and GloVe word embeddings. Built using Flask, the application allows users to input two sentences and choose the similarity method. The TF-IDF approach focuses on word frequency patterns, while GloVe captures semantic relationships between words. This dual-mode functionality enables a broader understanding of text similarity, which can be applied in domains like plagiarism detection, duplicate content detection, and semantic search.
Introduction:
Text similarity is a fundamental task in Natural Language Processing (NLP) that determines how similar two texts are. It has widespread applications in chatbots, recommendation systems, plagiarism detection, and information retrieval. This project introduces a lightweight and user-friendly web application that provides a comparison of two widely used similarity methods—TF-IDF and GloVe embeddings—offering flexibility and insight into the strengths of each approach. The platform leverages Flask for backend logic and HTML for the frontend, enabling real-time similarity checking through a browser interface.
Problem Statement:
Traditional applications of text similarity often rely on a single method, making them less adaptable to diverse semantic and syntactic structures in language. There is a need for an accessible tool that allows users to compare and contrast different text similarity models to better understand and interpret text-based relationships.
Existing System and Disadvantages:
Existing System:
- Existing text similarity tools typically use either TF-IDF or embeddings-based approaches separately.
- Many of them are limited to offline scripts, command-line interfaces, or require high computational resources.
Disadvantages:
- Lack of user-friendly interface for real-time usage.
- Inflexibility in selecting or comparing different algorithms.
- Inability to understand the semantic relationships deeply when using TF-IDF alone.
Proposed System and Advantages:
Proposed System:
This project proposes a dual-mode web-based text similarity tool. It combines:
- TF-IDF + Cosine Similarity: Focuses on token frequency.
- GloVe Embeddings + Cosine Similarity: Captures semantic relationships.
Advantages:
- Easy-to-use web interface built using Flask.
- Users can select preferred similarity model.
- Supports semantic as well as lexical similarity analysis.
- Can be integrated into larger applications for text comparison.
Modules:
- User Interface Module
- Frontend HTML form for inputting text and selecting method.
- Text Preprocessing Module
- Converts text to lowercase, removes extra spaces, and prepares input.
- TF-IDF Similarity Module
- Converts input text to TF-IDF vectors and calculates cosine similarity.
- GloVe Similarity Module
- Uses pre-trained GloVe vectors to compute semantic similarity.
- Results Display Module
- Shows similarity percentage and stores results (optional).
Algorithms Used:
- TF-IDF (Term Frequency-Inverse Document Frequency):
- Weights words based on frequency across documents.
- Measures similarity using Cosine Similarity.
- GloVe (Global Vectors for Word Representation):
- Pre-trained word embeddings that capture word semantics.
- Sentence vectors are created by averaging individual word vectors.
- Similarity measured using Cosine Similarity.
Software Requirements:
- Python 3.x
- Flask
- Scikit-learn
- NumPy
- Pandas
- Pre-trained GloVe file (glove.6B.50d.txt)
- HTML/CSS (for frontend)
Hardware Requirements:
- Minimum 2 GB RAM
- Processor: Intel i3 or equivalent
- Disk Space: 500 MB+
- Any OS with Python support (Windows/Linux/macOS)
Conclusion:
The “Dual-Mode Text Similarity Checker” serves as a lightweight yet powerful platform to compare sentence similarities using both lexical and semantic-based approaches. With TF-IDF and GloVe under one roof, users get insights into both frequency-based and meaning-based similarities. The system bridges the gap between academic models and practical usage through an intuitive interface.
Future Enhancement:
- Add support for more advanced models like BERT or Sentence Transformers.
- Include visualization of word vectors or similarity matrix.
- Allow file uploads for batch similarity checks.
- Support for multi-language input using multilingual embeddings.
- Integration with plagiarism detection systems or search engines.


Reviews
There are no reviews yet.