After scraping the data that shall be used as enter for the LLM to generate output, we wanted to create a dataset to finetune an LLM for Tarwiiga AdGen, a Google Advertisements generator utilizing AI developed at Tarwiiga. The device was taking enter and producing a JSON output. Whereas we have been counting on LLMs like OpenAI’s GPT, Google’s Gemini, and Anthropic’s Claude for producing advertisements with particular prompts and utilizing LangChain parsers to get a JSON, we wished to make use of this strategy for producing the dataset. Right here, I’m discussing our strategy for producing a 10K dataset.
However earlier than that, I wish to point out that I first tried to make the LLM generate every little thing from enter to output. I requested it to offer me a listing of 10 inputs after which looped by this listing to generate the JSON output and save them in a CSV file. Nonetheless, I discovered that every time I requested a listing of inputs, it generated many duplicates. I feel this occurred as a result of the LLM’s API was caching the responses. Whereas this problem may very well be labored round to cut back the variety of duplicates, I made a decision to work with actual information that I count on to obtain sooner or later when utilizing the device. In addition to, it was taking too lengthy to generate all of the inputs after which proceed to generate the output.
That’s why I scraped data to make use of as enter. With the strategy I adopted, as talked about within the article, I used to be in a position to scrape tens of millions of information factors. Particularly, I scraped information from 12 classes, with every class containing 5,000 pages. Every web page had about 20 inputs, leading to a complete of 12 * 5,000 * 20 = 1,200,000 inputs, or roughly a million 200 thousand. In actuality, some pages contained greater than 20 inputs, so I ended up with 1,239,232 information factors. There have been loads of duplicate inputs — 1,173,847 to be actual — leaving me with 65,385 distinctive information factors. Whereas this strategy didn’t utterly remove duplicate inputs, it was a lot quicker to get inputs from one other supply quite than counting on the LLM. Now, the LLM can focus solely on producing outputs.
As I used to be sending API requests to LLM APIs, I wanted to discover a solution to handle the technology course of effectively. I began with one class and looped by 200 pages, with every web page together with round 20 inputs, typically a bit roughly. This course of allowed me to generate round 3,859 information factors for the primary class. For one more class, I generated round 3,899 information factors, and for a 3rd class, I generated 2,171 information factors. In whole, this amounted to three,859 + 3,899 + 2,171 = 9,929 information factors, which is roughly a 10K dataset.
Through the technology course of, I used to be in a position to fine-tune Google’s Gemma 2B on a 1K dataset, which yielded excellent outcomes. I’ll focus on fine-tuning in a future put up, however for now, I wish to give attention to how I dealt with the technology course of.
The method is fundamental, and I didn’t do any optimization initially; I simply wished to start out and see how issues would go. To grasp it, let’s begin from the underside up. First, we’ve got the AdGen code that takes an enter and generates a JSON output representing the Google Advert parts. That is crafted with a particular immediate and parsers to extract JSON.
With round 20 inputs per web page, I divided them into chunks of measurement 5. Above this, there’s a loop that goes by pages to get inputs. I made it loop by 10 pages to get the 20 inputs from every web page, then divided these 20 inputs into chunks of 5. For every enter, a request was despatched to the LLM, and the output was saved in a CSV file. This resulted in a class folder with 200 subfolders for pages, and inside every web page, there have been 4 dataset CSV recordsdata.
This course of took a very long time on some LLMs like GPT-4 and was quicker on others like GPT-3.5 and Gemini Professional 1.5. I feel GPT-4 was slower as a result of it was busy with different customers’ requests, although I’m not sure. There have been additionally some points with Gemini making loads of retries. I ended up working the identical script a number of instances, altering the vary of pages every time: the primary script from web page 0 to 10, the second script from web page 10 to twenty, and so forth.
Whereas I feel this strategy may very well be optimized and improved, my objective was to shortly generate a dataset for fine-tuning. With this strategy, I used to be in a position to generate a 10K dataset, which is excellent for fine-tuning any LLM, although it comprises duplicate inputs. The distinctive inputs, as talked about above, have been round 65K. Producing a 65K dataset would require optimizing the code to make it quicker, however that’s not crucial for now; it may be performed later.
I hope this text was useful to you. Please don’t hesitate to ask me any questions, and you may attain me on Twitter (X) and Linkedin.