紐時賞析/AI沒能通過越南語測驗 哪些「低資源語言」在科技世代被忽略?

南非約翰尼斯堡的Lelapa AI公司正發展根據社會實際需求的研究,支持非洲語言人工智慧科技發展。(紐約時報)

When AI Fails the Language Test, Who Is Left Out of the Conversation?

AI沒能通過測驗 哪些語言遭忽略?

Stanford researchers gave a popular artificial intelligence chatbot a language test.

史丹福大學研究員針對熱門人工智慧聊天機器人進行語言測試。

They asked the bot in Vietnamese to write a traditional poem in the form known as “song thất lục bát” that follows a pattern of lines made up of seven, seven, six, then eight words. When the bot spit out an answer, it wrote a poem but didn’t follow the format.

他們要求越南語機器人寫一首傳統詩歌,以詩句依序爲七字、七字、六字接着八字的「雙七六八體」格式撰寫。機器人吐出答案,寫了一首詩,但沒有遵循格式。

The team tried a different prompt, asking what the proper Vietnamese word was for a mother’s younger brother, and it responded with the words for a father’s younger and older siblings.

這個團隊試了不同指令,詢問稱呼母親的弟弟的適當越南語單字是什麼,它卻回答關於父親手足的越南語單字。

While the use of AI has exploded in the West, much of the rest of the world has been left out of the conversation since most of the technology is trained in English. AI experts worry that the language gap could exacerbate technological inequities and that it could leave many regions and cultures behind.

儘管西方人工智慧使用量激增,世界其他許多地方卻被排除在對話外,因爲這項科技大部分以英語訓練。人工智慧專家憂心,語言鴻溝可能加劇科技不平等,也可能將許多地區和文化拋在後頭。

A delay of access to good technology of even a few years “can potentially lead to a few decades of economic delay,” said Sang Truong, a doctoral candidate at the Stanford Artificial Intelligence Laboratory at Stanford University on the team that built and tested a Vietnamese language model against others.

史丹福大學「史丹福人工智慧實驗室」博士候選人張創,是負責打造並測試越南語模型團隊的成員。他說,只是晚了短短几年才取得優良科技,「也可能導致經濟延遲發展數十年」。

The tests his team ran found that AI tools across the board could get facts and diction wrong when working with Vietnamese, likely because it is a “low-resource” language by industry standards, which means that there aren’t sufficient data sets and content available online for the AI model to learn from.

他的團隊進行測試發現,整體而言人工智慧工具在處理越南語時,可能發生事實和措辭上的錯誤,這可能是因爲以行業標準而言,越南語是個「低資源語言」,意味着越南語在線上沒有足夠的資料集和內容讓人工智慧模型學習。

Low-resource languages are spoken by tens and sometimes hundreds of millions of people around the world, but they yield less digital data because AI tech development and online engagement is centered in the United States and China.

低資源語言被世界各地上千萬甚至上億人使用,但它們產生的數位資料較少,因爲人工智慧科技開發和線上參與集中在美國和中國。

An analysis of top websites by W3Techs, a tech survey company, found that English makes up more than 60% of the internet’s language data. While English is widely spoken globally, native English speakers make up about 5% of the population, according to Ethnologue, a research organization that collects language data. Mandarin and Spanish are other examples of languages with a significant online presence and reliable digital data sets.

科技調查公司W3Techs針對主要網站的一項分析發現,英語佔網際網路語言資料的60%以上。收集語言資料的研究組織「民族語」指出,儘管英語在全球被廣爲使用,但英語母語者僅佔世界人口的5%。中文和西班牙文是具有重大線上存在感和可信數位資料集語言的其他範例。

“Large companies like Google, Apple, OpenAI, for example, have not necessarily trained their models for tools that serve these markets,” Chinasa T. Okolo, a fellow at the Center for Technology Innovation at the Brookings Institution, said about communities with low-resource languages. “They don’t provide enough market value for them to do so.”

布魯金斯研究院科技創新中心研究員奇娜薩.奧科洛提到使用低資源語言的社羣時表示:「谷歌、蘋果和OpenAI這類大型公司,未必會爲了服務這些市場的工具來訓練他們的模型。它們沒有提供足夠的市場價值讓這些公司這麼做」。

文/Sara Ruberg 譯/羅方妤