From final couple of years AI and Machine Studying has proven adjustments dramatically. The affordability of GPUs has democratized entry to highly effective computing sources, facilitating the coaching of huge language fashions. With cheaper GPUs, researchers and builders can experiment extra freely, accelerating innovation in pure language processing. Decrease prices allow wider participation in AI analysis, fostering collaboration and variety of views. This accessibility has fuelled breakthroughs in understanding and producing human-like textual content, pushing the boundaries of what’s attainable in language modelling. In the end, the affordability of GPUs has catalysed progress in direction of extra superior and inclusive AI applied sciences.
Everybody as of late is speaking about AI and desires to implement the functionalities of AI to their merchandise or companies.
What’s RAG? Why ought to we use it when we’ve LLMs skilled on billions of parameters?
LLMs are big language fashions skilled on big quantity of web information. Each time we ask any query or inform to carry out a sure process, it doesn’t solely provides the output, it acts as extra of a reasoning engine to get that reply from a context.
LLMs has limitations of information reduce off. If they’re skilled on older information (shall we say information accessible until Jan 2022) then they will’t reply to the questions which must check with the date after Jan 2022. That may be a big limitation of those fashions.
It has the potential to reply the questions which have been accessible on web throughout its coaching. However what concerning the questions associated to our private or group degree information?
To beat the above issues we will implement RAG (Retrieval Augmented Technology), which is a positive tuning method used to positive tune a big language mannequin to reply the questions having our personal information because the context.
We are going to implement it from scratch to create a RAG based mostly app which might transcribe a YouTube video and provides solutions by taking that transcription because the context. Under is the code walkthrough with detailed clarification to create the app.
Set up all of the dependencies.
# You should use conda or pip to put in the required packages# pip set up python-dotenv
# !pip set up langchain
# !pip set up openai
# !pip set up whisper
# !pip set up pytube
# !pip set up youtube
# !pip set up ffmpeg
# !pip set up git+https://github.com/openai/whisper.git
# !pip set up ffmpeg-python
# !choco set up ffmpeg
# !pip set up librosa
# !pip set up langchain-openai
# !pip set up "langchain[docarray]"
# !pip set up docarray
# !pip set up langchain-pinecone
# !pip set up -U sentence-transformers
# !pip set up --upgrade pip
# !pip set up --upgrade --quiet "docarray"
# !pip set up pinecone-client
Earlier than creating the app from the transcription, we should perceive some core ideas of Langchain and Language fashions in order that we will write efficient prompts and create some superb chains.
Importing setting variables:
import os
from dotenv import load_dotenv, dotenv_valuesload_dotenv()
OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')
# OPENAI_API_KEY = "xyz" #incase you need to simply use the api key immediately(not beneficial)
YOUTUBE_VIDEO = "https://www.youtube.com/watch?v=zduSFxRajkE"
Organising the mannequin:
from langchain_community.chat_models import ChatOpenAImannequin = ChatOpenAI(openai_api_key=OPENAI_API_KEY, mannequin="gpt-3.5-turbo")
# We're utilizing "gpt-3.5-turbo" mannequin of OpanAI
mannequin.invoke("Write a poem on rain in 6 strains") # Invoke asks the mannequin specified query.
Output: AIMessage(content material=”Raindrops falling from the sky,nCreating patterns as they fly.nPitter patter on the bottom,nA soothing symphony throughout.nNature’s tears, cleaning our souls,nIn the rain, we discover our roles.”, response_metadata={‘token_usage’: {‘completion_tokens’: 44, ‘prompt_tokens’: 16, ‘total_tokens’: 60}, ‘model_name’: ‘gpt-3.5-turbo’, ‘system_fingerprint’: ‘fp_c2295e73ad’, ‘finish_reason’: ‘cease’, ‘logprobs’: None}) That is the uncooked output we obtained from our language mannequin. It gave us the output with metadata, which is undesirable.
Parsing the output to get it in a nicer format:
from langchain_core.output_parsers import StrOutputParserparser = StrOutputParser() # This removes pointless issues from the output.
chain = mannequin | parser
chain.invoke('Write a poem on rain in 6 strains')
Output: “Raindrops falling from the sky,nCreating patterns as they fly.nPitter patter on the bottom,nA soothing symphony throughout.nNature’s tears, cleaning our souls,nIn the rain, we discover our roles.”
Including prompts utilizing immediate templates:
from langchain.prompts import ChatPromptTemplatetemplate = """ Reply the query based mostly on the beneath context. If you do not know the reply, reply "I do not know".
Context: {context}
Query: {query}
"""
ChatPromptTemplate(input_variables=['context', 'question'], messages=[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['context', 'question'], template=' Reply the query based mostly on the beneath context. If you do not know the reply, reply "I do not know".nnContext: {context}nQuestion: {query}nn'))])
immediate = ChatPromptTemplate.from_template(template)
immediate
Output: ChatPromptTemplate(input_variables=[‘context’, ‘question’], messages=[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=[‘context’, ‘question’], template=’ Reply the query based mostly on the beneath context. If you do not know the reply, reply “I do not know”.nnContext: {context}nQuestion: {query}nn’))])
immediate.format(context="Sidhartha works at Google", query="Who's Sidhartha?")
Output: ‘Human: Reply the query based mostly on the beneath context. If you happen to don’t know the reply, reply “I don’t know”.nnContext: Sidhartha works at GooglenQuestion: Who’s Sidhartha?nn’
Our immediate is prepared, mannequin is prepared and output parser is prepared. We will chain all to get our remaining output.
chain = immediate | mannequin | parser # Chaining
chain
Output: ChatPromptTemplate(input_variables=[‘context’, ‘question’], messages=[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=[‘context’, ‘question’], template=’ Reply the query based mostly on the beneath context. If you happen to don’t the know reply, reply “I don’t know”.nnContext: {context}nQuestion: {query}nn’))]) | ChatOpenAI(consumer=<openai.sources.chat.completions.Completions object at 0x00000232C8010E80>, async_client=<openai.sources.chat.completions.AsyncCompletions object at 0x00000232C8012860>, openai_api_key=’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’, openai_proxy=’’) | StrOutputParser()
chain.invoke({
"context": "Sidhartha works at Google",
"query": "Who's Sidhartha?"}) # Invoke asks the LLM to get the reply about one thing.
Output: ‘Sidhartha is an worker at Google.’
By being a bit artistic we will use the mix of prompts and chains to do loopy stuffs.
Translating from one language to a different:
from operator import itemgettertranslation_prompt = ChatPromptTemplate.from_template(
"Translate {reply} to {language}"
)
translation_chain = (
{"reply": chain, "language": itemgetter("language")} | translation_prompt | mannequin | parser
)
translation_chain.invoke({
"context": "Sidhartha have 2 bikes. One is Triumph and one other is Ninja.",
"query": "What number of vehicles Sidhartha have?",
"language": "Hindi",
})
Output: ‘मुझे नहीं पता। (mujhe nahin pata)’
We’re accomplished with the fundamentals. We will now transcribe our YouTube video and supply it because the context to our language mannequin.
Transcribing the YouTube video:
import tempfile
import whisper
from pytube import YouTube# Let's do that provided that we've not created the transcription file but.
if not os.path.exists("transcription.txt"):
youtube = YouTube(YOUTUBE_VIDEO)
audio = youtube.streams.filter(only_audio=True).first()
# Let's load the bottom mannequin. This isn't probably the most correct
# mannequin but it surely's quick.
whisper_model = whisper.load_model("base")
with tempfile.TemporaryDirectory() as tmpdir:
file = audio.obtain(output_path=tmpdir)
transcription = whisper_model.transcribe(file, fp16=False)["text"].strip()
with open("transcription.txt", "w") as file:
file.write(transcription)
Utilizing transcription because the context:
attempt:
chain.invoke({
"context": transcription,
"query": "who's siddharth?"
})
besides Exception as e:
print(e)# The beneath error message tells us that the transcription textual content is just too giant to present as an enter to the mannequin as it might take most of 16385 tokens.
Output:Error code: 400 — {‘error’: {‘message’: “This mannequin’s most context size is 16385 tokens. Nevertheless, your messages resulted in 25843 tokens. Please scale back the size of the messages.”, ‘sort’: ‘invalid_request_error’, ‘param’: ‘messages’, ‘code’: ‘context_length_exceeded’}}
We’re getting this error as our language fashions have restricted context size i.e 16385 tokens however we’ve supplied 25843 tokens. To beat this problem we will break up the context to a number of chunks and provides it because the context.
Splitting the paperwork to chunks and offering because the context:
from langchain_community.document_loaders import TextLoaderloader = TextLoader("transcription.txt")
text_documents = loader.load()
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=60)
text_splitter.split_documents(text_documents)[:10]
paperwork = text_splitter.split_documents(text_documents)
Embedding the chunks to create vectors:
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import DocArrayInMemorySearchmodel_name = 'text-embedding-ada-002'
embeddings = OpenAIEmbeddings(
mannequin=model_name,
openai_api_key=OPENAI_API_KEY
)
vectorstore2 = DocArrayInMemorySearch.from_documents(paperwork, embeddings)
from langchain_core.runnables import RunnableParallel, RunnablePassthroughchain = (
{"context": vectorstore2.as_retriever(), "query": RunnablePassthrough()}
| immediate
| mannequin
| parser
)
chain.invoke("What's the abstract of the dialogue clarify in 20 phrases?")
Output:’Coaching tokenizer for textual content compression, introducing new tokens to vocabulary, potential for transformers to course of a number of modalities concurrently.’ (In line with your video hyperlink you’re going to get the output.)
Now we have to add the embeddings to a vector database to retailer the embeddings. There are a number of vector shops like FAISS, ChromaDB, Pinecone and so on. Right here we’ll use Pinecone. We have to create an account on Pinecone by following the steps beneath.
You should register on Pinecone. https://www.pinecone.io/
Then we will see the beneath dashboard and might create an index.
We may give a reputation to index. Right here it’s youtube-rag. We will click on on the setup by mannequin button and choose the mannequin and choose the area and click on create index.
As soon as the vector retailer setup is finished, get the pinecone API key from the API KEYS tab in left and retailer it in setting variable or in a variable. Then we will move the index title to do a similarity search from the vector retailer.
from langchain_pinecone import PineconeVectorStorepinecone_api_key = os.environ.get('PINECONE_API_KEY')
# pinecone_api_key = "YOUR_PINECONE_APIKEY"
pinecone = PineconeVectorStore.from_documents(
paperwork, embeddings, index_name="youtube-rag"
)
pinecone.similarity_search("what's tokenization?")
Output: [Document(page_content=”so here they mentioned that they trained on two trillion tokens of data and so on. So we’re going to build our own tokenizer luckily the byte bearing coding algorithm is not that super complicated and we can build it from scratch ourselves and we’ll see exactly how this works. Before we dive into code I’d like to give you a brief taste of some of the complexities that come from the tokenization because I just want to make sure that we motivate it sufficiently for why we are doing all of this and why this is so gross. So tokenization is at the heart of a lot of weirdness in large language models and I would advise that you do not brush it off. A lot of the issues that may look like just issues with the neural work architecture or the large language model itself are actually issues with the tokenization and fundamentally trace back to it. So if you’ve noticed any issues with large language models can not able to do spelling tasks very easily that’s usually due to tokenization. Simple”, metadata={‘source’: ‘transcription.txt’}), Document(page_content=”the GPT2 paper and if you scroll down here to the section input representation this is where they cover tokenization the kinds of properties that you’d like the tokenization to have and they conclude here that they’re going to have a tokenizer where you have a vocabulary of 50,200 and 57 possible tokens and the context size is going to be 1,024 tokens. So in the attention layer of the transformer neural network every single token is attending to the previous tokens in the sequence and it’s going to see up to 1,024 tokens. So tokens are this like fundamental unit the atom of large language models if you will and everything is in units of tokens everything is about tokens and tokenization is the process for translating strings or text into sequences of tokens and vice versa. When you go into the Lama 2 paper as well I can show you that when you search token you’re going to get 63 hits and that’s because tokens are again pervasive so here they mentioned that they trained on two trillion”, metadata={‘source’: ‘transcription.txt’})]
Creating our remaining chain utilizing Pincone vector retailer:
pinecone_chain = (
{"context": pinecone.as_retriever(), "query": RunnablePassthrough()}
| immediate
| mannequin
| parser
)pinecone_chain.invoke("What's byte pair encoding?")
Output: ‘Byte pair encoding is an algorithm that enables for compressing byte sequences to a variable quantity, enabling assist for bigger vocabulary sizes whereas nonetheless utilizing the utf8 encoding of strings.’
pinecone_chain.invoke("What are the primary ideas of the context?")
Output: ‘Tokenization, sentence piece, transformer neural community, vocabulary, tokens, modality processing, structure of transformers.’
The video hyperlink used right here is the lecture on Tokenization by Andrej Karpathy.
References:
https://python.langchain.com/docs/get_started/introduction/ https://www.youtube.com/watch?v=BrsocJb-fAo&t=728s
This one is my second story. If you happen to prefer it hit the like button and share. If you happen to get any doubt otherwise you get any problem whereas implementing the code, you may ask on remark right here or you may attain me on my social handles.
Linkedin : https://www.linkedin.com/in/smohanty93/