πŸ“˜
Winter LLM Bootcamp
  • Welcome to the course. Bienvenue!
    • Course Structure
    • Course Syllabus and Timelines
    • Know your Educators
    • Action Items and Prerequisites
    • Bootcamp Kick-Off Session
  • Basics of LLMs
    • What is Generative AI?
    • What is a Large Language Model?
    • Advantages and Applications of LLMs
    • Bonus Resource: Multimodal LLMs and Google Gemini
  • Word Vectors, Simplified!
    • What is a Word Vector
    • Word Vector Relationships
    • Role of Context in LLMs
    • Transforming Vectors into LLM Responses
    • Bonus Section: Overview of the Transformers Architecture
      • Attention Mechanism
      • Multi-Head Attention and Transformers Architecture
      • Vision Transformers
    • Graded Quiz 1
  • Prompt Engineering and Token Limits
    • What is Prompt Engineering
    • Prompt Engineering and In-context Learning
    • Best Practices to Follow
    • Token Limits and Hallucinations
    • Prompt Engineering Excercise (Ungraded)
      • Story for the Excercise: The eSports Enigma
      • Your Task for the Module
  • Retrieval Augmented Generation (RAG) and LLM Architecture
    • What is Retrieval Augmented Generation (RAG)
    • Primer to RAG: Pre-trained and Fine-Tuned LLMs
    • In-Context Learning
    • High-level LLM Architecture Components for In-Context Learning
    • Diving Deeper: LLM Architecture Components
    • Basic RAG/LLM Architecture Diagram with Key Steps
    • RAG versus Fine-Tuning and Prompt Engineering
    • Versatility and Efficiency in RAG
    • Understanding Key Benefits of Using RAG in Enterprises
    • Hands-on Demo: Performing Similarity Search in Vectors (Bonus Module)
    • Using kNN and LSH to Enhance Similarity Search (Bonus Module)
    • Graded Quiz 2
  • Hands-on Development
    • Prerequisites
    • Dropbox Retrieval App
      • Understanding Docker
      • Building the Dockerized App
      • Retrofitting our Dropbox app
    • Amazon Discounts App
      • How the project works
      • Repository Walkthrough
    • How to Run 'Examples'
    • Bonus Section: Real-time RAG with LlamaIndex and Pathway
  • Bonus Resource: Recorded Interactions from the Archives
  • Final Project + Giveaways
    • Prizes and Giveaways
    • Suggested Tracks for Ideation
    • Form for Submission
Powered by GitBook
On this page
  • What are Retrievers in LlamaIndex?
  • Pathway Retriever and its Integration with LlamaIndex
  • Sample Tutorial/Implementation

Was this helpful?

  1. Hands-on Development

Bonus Section: Real-time RAG with LlamaIndex and Pathway

What is a Bonus Section? Check here: https://iitk-bhu-llm.gitbook.io/coursework/welcome-to-the-course.-bienvenue/course-syllabus-and-timelines#what-are-bonus-sections-resources

PreviousHow to Run 'Examples'NextBonus Resource: Recorded Interactions from the Archives

Last updated 1 year ago

Was this helpful?

LlamaIndex is recognized as a popular RAG framework designed to augment LLMs by seamlessly integrating external, domain-specific data.

If you're already familiar with Llamaindex and are looking to leverage its capabilities alongside Pathway for enhanced data processing and retrieval, this bonus section aims to facilitate that integration. First off, below is a Tweet which should be pretty comprehensible for you if you've already built RAG applications (if not with realtime data).

Now let's quickly understand this and help you get started. Firstly let's cover the basics.

What are Retrievers in LlamaIndex?

Retrievers play a critical role in the LlamaIndex ecosystem. They are tasked with fetching the most relevant context for a given user query or message. This process involves:

  • Efficiently retrieving relevant context from an index based on the query.

  • Being a crucial component in query engines and chat engines for delivering pertinent information.

  • The possibility of building atop indexes or being defined independently, underscoring their versatility.

Pathway Retriever and its Integration with LlamaIndex

Key Features of the Integration:

  • Live Data Indexing Pipeline: Monitors various data sources for changes, parses and embeds documents using LLaMAIndex methods, and builds a vector index.

  • Simple to Complex Pipelines: While the basic pipeline focuses on indexing files from cloud storage, Pathway supports more sophisticated operations like SQL-like operations, time-based grouping, and a wide range of connectors for comprehensive data pipeline construction.

  • Ease of Setup: The integration process involves installing necessary packages, setting up environment variables, and configuring data sources to be tracked by Pathway.

For example here's an interesting hosted showcase built by combining the power of LlamaIndex and real-time data processing via Pathway which you can try on your own. On the left bar of the Streamlit interface you can try to connect your Sharepoint or Google Drive folder and then see the tool in action. Interesingly this is a very popular use case for companies:

Sample Tutorial/Implementation

Creating a real-time Retrieval-Augmented Generation (RAG) application using Pathway and Llamaindex involves several steps, from setting up your environment to running a fully integrated application. Here's a step-by-step tutorial to guide you through this process:

Prerequisites

  • Ensure Docker, Dropbox, and Python are installed on your machine.

  • Familiarity with Docker and Python programming is beneficial.

  • Important Note: While the step-by-step below provides a non-Dockerized setup, using Docker is highly recommended as a best practice, ensuring consistency across different environments and simplifying the setup process.

  1. Installation

First, we need to install necessary packages. This includes LlamaIndex for retrieval functionalities and Pathway for data processing and indexing.

# Install LlamaIndex and Pathway packages using pip
pip install llama-index-embeddings-openai  # For embeddings using OpenAI models
pip install llama-index-retrievers-pathway  # For the Pathway retriever in LlamaIndex
pip install pathway  # The Pathway package for data processing and indexing
pip install llama-index  # Main LlamaIndex package
  1. Preparing Your Data

Create a directory to store your data and download a sample dataset. This is where Pathway will monitor for any changes to re-index the updated content.

Create a directory for data and download sample data

mkdir -p data/ wget 'https://gist.githubusercontent.com/link_to_your_data' -O data/sample_data.md

Replace the wget URL with the actual link to your sample data.

  1. Configuring Your Environment

Set up your environment variables, including the OpenAI API key if you're using OpenAI models for embeddings. This key is required for accessing OpenAI's API services.

import os
import getpass

# Set up the OpenAI API key for embedding operations
if "OPENAI_API_KEY" not in os.environ:
    os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API Key: ")
  1. Logging Configuration

Configuring logging helps monitor the pipeline's execution and debug if necessary.

import logging
import sys

# Configure basic logging to stdout to monitor the process
logging.basicConfig(stream=sys.stdout, level=logging.ERROR)
logging.getLogger().addHandler(logging.StreamHandler(sys.stdout))
  1. Defining Data Sources

Specify which data sources Pathway should monitor. This can include local directories, cloud storage, etc. Pathway supports a variety of sources, making it versatile for different use cases.

import pathway as pw

# Define the data sources Pathway will monitor
data_sources = [
    pw.io.fs.read("./data", format="binary", mode="streaming", with_metadata=True)
    # Add more sources as needed
]
  1. Creating the Indexing Pipeline

This section defines the document processing pipeline. We split the text and then embed it using OpenAI models before indexing.

from llama_index.core.retrievers import PathwayVectorServer
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core.node_parser import TokenTextSplitter

# Setup for embedding model
embed_model = OpenAIEmbedding(embed_batch_size=10)

# Define transformations for the indexing pipeline
transformations_example = [
    TokenTextSplitter(chunk_size=150, chunk_overlap=10, separator=" "),
    embed_model,
]

# Initialize the processing pipeline with defined transformations
processing_pipeline = PathwayVectorServer(
    data_sources,
    transformations=transformations_example,
)
  1. Running the Server

Start the Pathway server to begin monitoring the data sources and indexing new or updated documents.

# Specify host and port for the Pathway server
PATHWAY_HOST = "127.0.0.1"
PATHWAY_PORT = 8754

# Run the Pathway server in a non-blocking mode
processing_pipeline.run_server(host=PATHWAY_HOST, port=PATHWAY_PORT, with_cache=False, threaded=True)

Configure LlamaIndex to use the indexed data for retrieval. This involves setting up the PathwayRetriever.

from llama_index.retrievers.pathway import PathwayRetriever

#Initialize the PathwayRetriever with the server's host and port
retriever = PathwayRetriever(host=PATHWAY_HOST, port=PATHWAY_PORT)

Now you can perform queries against the indexed data:

# Perform a retrieval queryΒ 

response = retriever.retrieve("What is pathway?")
print(response)

This setup provides a foundation for building applications that require real-time data processing and retrieval. Remember, deploying this setup within a Docker container is recommended to avoid random dependency errors and to ensure consistency and ease of deployment.

Conclusion

This integration guide between Pathway and LlamaIndex serves as a comprehensive tutorial for you to get started. Below are a few additional links and examples which may be helpful.

If you're a first time LLM/RAG App developer, you can consider going for a more minimalistic approach to showcase an impactful project.

While so far we've used Pathway's LLM app, you might know that Pathway stands out as an open data processing framework, ideal for developing data transformation pipelines and machine learning applications that deal with live and evolving data sources. Interesingly it's the world fastest framework for stream data processing. ()

Now, the integration with LlamaIndex is facilitated through the and . Here our focus is on which taps into Pathway's dynamic indexing capabilities to provide always up-to-date answers. This here is also quite comprehensive but let us give you a quick walkthrough once.

Retrieval with LlamaIndex

The key thing is utility of your project and not much whether you're using Pathway's LLM App end-to-end or coupling it with LlamaIndex/Langchain, etc. to harness the power of realtime LLMs applications.

πŸ¦™
πŸ˜„
πŸ˜‰
ArXiV paper
PathwayReader
PathwayRetriever
PathwayRetriever
linked documentation
Pathway Retriever | LlamaIndex Documentation
Pathway Reader | LlamaIndex Documentation
Connecting various data sources such as a Google Drive | Pathway documentation
Showcase: Pathway + LlamaIndex + Streamlit | GitHub
Streamlit
Page cover image
Logo