Are you tired of sifting through endless PDFs to find the information you need? Say hello to pdFRAG, a RAG Chatbot, a powerful tool that transforms how you interact with documents!
Our pdFRAG is a Retrieval-Augmented Generation (RAG) chatbot built using Python and Streamlit. The chatbot allows users to interact with PDF documents by asking questions and receiving summarized responses based on the content of the documents. It leverages advanced natural language processing (NLP) techniques, including text embedding, semantic search, and language models, to provide accurate and context-aware answers. Whether you’re a researcher, student, or professional, pdFRAG can be a useful tool for unlocking the power of your academic papers or tech documents. Dive in, ask questions, and let the chatbot do the heavy lifting for you.
Key Features
- PDF Text Extraction: Extracts text from uploaded PDF documents.
- Text Chunking: Splits large documents into smaller, manageable chunks for processing.
- Semantic Search: Uses FAISS for efficient similarity search to retrieve relevant text chunks.
- Language Model Integration: Generates summaries and responses using Ollama and its language models (e.g., Qwen2.5, deepseek or lama3.* ).
- Interactive Web Interface: Built with Streamlit, providing an easy-to-use chat interface.
- Metadata Extraction: Extracts document metadata (e.g., title, authors, DOI) for better context.
Advantages in Information Security
- Local Processing: All document processing, embedding, and summarization are performed locally, ensuring that sensitive data never leaves your environment.
- No Cloud Dependency: Unlike cloud-based solutions, this chatbot operates entirely offline, reducing the risk of data breaches or unauthorized access.
- Customizable Security: You can control the entire pipeline, from document storage to model inference, ensuring compliance with your organization's security policies.
- Privacy-Preserving: No third-party APIs or external services are used, guaranteeing full data ownership and privacy.
Use Cases
- Research Assistance: Quickly extract and summarize information from academic papers or reports.
- Document Q&A: Ask questions about the content of PDFs and get instant answers.
- Knowledge Management: Organize and interact with large collections of documents.
Technologies Used
- Python Libraries: streamlit, pypdf, langchain, sentence-transformers, faiss, ollama.
- NLP Models: all-MiniLM-L6-v2 for embeddings, llama2 (or other Ollama models) for summarization.
- Vector Database: FAISS for fast and efficient similarity search.
Getting Started
- Clone the repository:
git clone https://github.com/Yu-optibayeslab/pdFRAG.git
- or download the code via visiting the github page:
https://github.com/Yu-optibayeslab/pdFRAG
- Install dependencies:
pip install -r requirements.txt
- Start the Ollama server:
ollama serve
- Run the Streamlit app:
streamlit run RAG_Chatbot_StandaloneLLM.py
- Open the provided URL in your browser and start chatting with your PDFs!
Introductory video: https://youtu.be/SMGSSTOlulk