Loading...

Drive Doc Bot demonstrates how deterministic ingestion combined with retrieval-augmented chat can bring order to sprawling Drive ecosystems. It provides a chat-based search interface over Google Drive documents, linking all answers back to their sources. This turns siloed Drive folders into a conversational knowledge base without migrating files out of Google Workspace.
Drive Doc Bot automates the end-to-end ingestion and retrieval process. A backend script monitors Drive folders and downloads new or updated files, embed_gdrive.py chunks each document (creating ~1500-character semantic chunks), and OpenAI embeddings index the content. The indexed data is served by a FastAPI backend to a React/Tailwind chat UI. Crucially, every answer includes metadata (file type, chunk index, and Drive webView link) so responses are fully grounded in the original documents. All responses remain explainable and auditable with transparent metadata. This provides a clear provenance trail: every answer shows its original source in Drive, satisfying governance and audit requirements.
Configure your LLM integration settings.
Drag and drop your JSON file for quick setup.
Connect to Google Drive with a single click
View a window displaying existing files in your Drive.
Sync data in Google Drive automatically with data engineering solutions.
Access a list of all synced files
Embed data from Google Drive into Pinecone for AI application readiness.
Review all embedded files, including the total number of chunks processed.
Chat with your Google Drive data using Generative AI services.
Submit queries and receive optimized, AI-driven business transformation insights.
Transforms siloed Drive folders into an on-demand, conversational knowledge base without migrating files out of Google Workspace.
Sales, support, and consulting teams get instant answers from up-to-date internal documents, reducing hours spent on research.
Automates document ingestion and semantic search, accelerating tasks like proposal writing, onboarding, and compliance audits.
Every answer includes a link to the original source in Drive, maintaining governance and audit trails for compliance teams.
Breaks files into ~1500-character chunks while detecting topic shifts (via cosine similarity) to avoid splitting in the middle of a thought.
Each chunk is indexed with metadata (file type, chunk index, webView link, processing method) so answers can cite the exact page or paragraph in Drive.
Tracks file changes so only new or modified documents are reprocessed, using batching and concurrency controls for efficiency.
Utilizes UnstructuredLoader to handle 60+ formats (PDFs, Word docs, spreadsheets, etc.), making virtually all Drive content searchable.
Structured logs, status trackers, and optional streaming responses ensure full visibility into the ingestion and query processes.
Python backend with embed_gdrive.py using LangChain Unstructured and Pinecone libraries.
OpenAI text-embedding-3-small model for embeddings, stored in Pinecone vector indexes.
scikit-learn for cosine similarity, numpy for vector math, nltk as a fallback tokenizer, and multithreading for concurrent file processing.
FastAPI (with Uvicorn) backend handling ingestion, queries, and streaming chat responses.
React + Next.js with Tailwind UI for the chat interface, file pickers, and citation cards.
Drive Doc Bot demonstrates how retrieval-augmented generation can be applied directly to enterprise file systems in a controlled and transparent way. By combining deterministic ingestion, semantic chunking, and metadata-driven retrieval, it enables reliable conversational access to Google Drive content without compromising accuracy or governance. The system highlights a practical pattern for turning document repositories into trustworthy, AI-powered knowledge layers that scale with enterprise needs.

Point Drive Doc Bot at your Google Drive workspace, kick off the embedding job, and start chatting with your internal knowledge base every answer will link straight to the source file. For more information and related resources, visit the GenAI Protos website and our blog.