OpenAI’s latest blunder shows the challenges facing Chinese AI models

In reality, among the many few lengthy Chinese language tokens in GPT-4o that aren’t both pornography or playing nonsense, two are “socialism with Chinese language traits” and “Folks’s Republic of China.” The presence of those phrases suggests {that a} important a part of the coaching information truly is from Chinese language state media writings, the place formal, lengthy expressions are extraordinarily frequent.

OpenAI has traditionally been very tight-lipped concerning the information it makes use of to coach its fashions, and it in all probability won’t ever inform us how a lot of its Chinese language coaching database is state media and the way a lot is spam. (OpenAI didn’t reply to MIT Know-how Overview’s detailed questions despatched on Friday.)

However it isn’t the one firm combating this drawback. Folks inside China who work in its AI business agree there’s a scarcity of high quality Chinese language textual content information units for coaching LLMs. One motive is that the Chinese language web was, and largely stays, divided up by huge corporations like Tencent and ByteDance. They personal a lot of the social platforms and aren’t going to share their information with opponents or third events to coach LLMs.

In reality, that is additionally why engines like google, together with Google, kinda suck relating to looking in Chinese language. Since WeChat content material can solely be searched on WeChat, and content material on Douyin (the Chinese language TikTok) can solely be searched on Douyin, this information shouldn’t be accessible to a third-party search engine, not to mention an LLM. However these are the platforms the place precise human conversations are taking place, as an alternative of some spam web site that retains making an attempt to attract you into on-line playing.

The dearth of high quality coaching information is a a lot larger drawback than the failure to filter out the porn and common nonsense in GPT-4o’s token-training information. If there isn’t an present information set, AI corporations should put in important work to determine, supply, and curate their very own information units and filter out inappropriate or biased content material.

It doesn’t appear OpenAI did that, which in equity makes some sense, given that individuals in China can’t use its AI fashions anyway.

Nonetheless, there are numerous folks dwelling outdoors China who need to use AI companies in Chinese language. And so they deserve a product that works correctly as a lot as audio system of every other language do.

How can we resolve the issue of the shortage of fine Chinese language LLM coaching information? Inform me your concept at zeyi@technologyreview.com.

Source link

OpenAI’s latest blunder shows the challenges facing Chinese AI models

What are Large Language Models (LLM)?

Google DeepMind trained a robot to beat humans at table tennis

Advancing to adaptive cloud | MIT Technology Review

How to Assist Human Agents & Transform Customer Experience with Conversational AI?

Salesforce Introduces Agentforce Testing Center: AI Agent Lifecycle Management Tooling for Testing Autonomous AI Agents at Scale

70% of Firms Disrupted by AI: New Endava Research

How Real-Time Data Analytics and AI Are Transforming Heavy Equipment Operations

NVIDIA Accelerates Google Quantum AI Processor Design With Simulation of Quantum Device Physics

Our Picks

Building an Alkaloid Detection Model with RDKit and Machine Learning | by Emmanuel Gabriel | Jun, 2024

BigQuery ML Tutorial. BigQuery ML Tutorial | by Akash from JustAcademy | Apr, 2024

Importing and Exporting Data using Python | by Mayur Dalvi | Jun, 2024

Most Popular

Revolutionizing the Way We Find Love

Will GenAI Replace Data Engineers? No – And Here’s Why.

Assortment Optimization Machine Learning | by Danishaliarshar | Mar, 2024

OpenAI’s latest blunder shows the challenges facing Chinese AI models

Related Posts