| 雙語(yǔ)學(xué)習(xí) > 正文 |
|
Unravel the past 大語(yǔ)言模型“荀子”飽讀經(jīng)書,算力十足 來源:融媒體采編平臺(tái) 作者:張語(yǔ)迎 日期: 2024-01-22
Thousands of years ago, texts appeared on animal bones, bronzes, bamboo slips, and silk brocades (織錦) before they were written on paper. But now these ancient Chinese texts have a “new container” in the modern age. 幾千年前,文字先是寫在獸骨、青銅器、竹簡(jiǎn)和織錦上,然后才被人們寫在紙上。但如今,這些古老的中文文本在現(xiàn)代有了“新容器”。 Recently, a research team from Nanjing Agricultural University has rolled out Xunzi, a large language model (LLM) and XunziChat in association with Gulian, a leading ancient Chinese text publisher. 近日,南京農(nóng)業(yè)大學(xué)的研究團(tuán)隊(duì),與一流的古籍出版公司古聯(lián)聯(lián)手,推出大型語(yǔ)言模型“荀子”和“荀子對(duì)話模型”。 Wang Dongbo, the leader of the research team, said that the large language model was named after Xunzi because Xunzi was not only a prominent Confucian philosopher during the late Warring States Period (475-221 BC), but also a pioneer in presenting and explaining theories of linguistics in ancient China. 研究團(tuán)隊(duì)帶頭人王東波表示,大型語(yǔ)言模型以荀子的名字命名,是因?yàn)檐髯硬粌H是戰(zhàn)國(guó)(公元前475-221年)晚期著名的儒學(xué)思想家,還是提出和解釋中國(guó)古代語(yǔ)言學(xué)理論的先驅(qū)者。 When asked why he and his partners made the large language model, Wang explained that “traditional Chinese characters, vertical layout (豎版), the absence of pausing and punctuation (句讀) are all obstacles that readers have to overcome when they read traditional texts”. 當(dāng)被問及他和他的同伴制作這個(gè)大型語(yǔ)言模型的原因時(shí),王東波解釋道:“繁體字、豎版、缺少停頓和標(biāo)點(diǎn)符號(hào)(句讀)都是讀者在閱讀繁體文本時(shí)需要克服的障礙。” To create Xunzi the LLM, Wang and his partners first needed to do a lot of research. Since 2013, his team has worked tirelessly to digitize Chinese classics like the Siku Quanshu, or the Complete Library in Four Sections. “The hard work involves a large-scale corpus (語(yǔ)料庫(kù)) of two billion Chinese characters, which has laid a solid foundation for the large language model,” said Wang. 為了創(chuàng)建大型語(yǔ)言模型“荀子”,王東波和他的同伴們需要先做大量的研究。自2013年以來,他的團(tuán)隊(duì)始終致力于將《四庫(kù)全書》等中國(guó)經(jīng)典書籍?dāng)?shù)字化。“經(jīng)過辛勤努力,我們建立了20億漢字的大型語(yǔ)料庫(kù),為建立大型語(yǔ)言模型奠定了堅(jiān)實(shí)的基礎(chǔ),”王東波說。 But their efforts seem to have paid off. Now Xunzi the LLM can tag (標(biāo)記), translate, punctuate, and understand scraps (片段) of ancient Chinese texts. It can even do part-of-speech analysis and retrieve (檢索) specific information, such as names, events, and places from a text. 他們的努力得到了回報(bào)?,F(xiàn)在,大型語(yǔ)言模型“荀子”可以對(duì)中國(guó)古代文本的片段進(jìn)行標(biāo)記、翻譯、加標(biāo)點(diǎn)和閱讀理解。它甚至可以進(jìn)行詞性分析并檢索特定信息,例如文本中的名稱、事件和地點(diǎn)。 With this LLM, ancient Chinese texts can be accessed by more Chinese people, including students. For instance, if users type “shangu” into the chat box, they will not only discover that it translates to “valley” but also see that it can refer to a person’s courtesy name (字) in certain ancient Chinese texts. Through Xunzi’s retrieval function, users can get more specific cultural information based on courtesy names. 通過這個(gè)大型語(yǔ)言模型,包括學(xué)生在內(nèi)的更多中國(guó)人,可以接觸到中國(guó)古籍。例如,如果用戶在聊天框中輸入“shangu”的拼音,其不僅能識(shí)別出“山谷”一詞,它還會(huì)給用戶指出與這個(gè)詞相關(guān)的、古籍中一個(gè)中國(guó)文人的字號(hào)等。通過“荀子”的檢索功能,用戶可以根據(jù)字獲取更具體的文化信息。 “The model can help us mine for more information hidden in our cultural legacy and find unnoticed models and connections,” said Wang. “這個(gè)模型可以幫助我們挖掘更多隱藏在文化遺產(chǎn)中的信息,找到未被注意到的樣本和關(guān)聯(lián),”王東波說。 But Wang and his team aren’t simply focused on target users in China. They are aiming at the rest of the world as well. They have shared the LLM on GitHub and other websites, allowing users to download and use it for free. “Our team is committed to the philosophy of making our data and model globally accessible. We hope this will encourage more people to appreciate traditional Chinese culture,” Wang explained. 但王東波和他的團(tuán)隊(duì)不僅著眼于中國(guó)的目標(biāo)用戶,還將目光投向了世界其他地區(qū)。他們?cè)?GitHub 和其他網(wǎng)站上共享了“荀子”,允許用戶免費(fèi)下載和使用。 “我們團(tuán)隊(duì)秉持著讓我們的數(shù)據(jù)和模型能在全球范圍內(nèi)被人們使用的理念,希望以此鼓勵(lì)更多人了解中國(guó)傳統(tǒng)文化,”王東波解釋道。 以上文章內(nèi)容選自《21世紀(jì)英文報(bào)》高三831期 |
| ||||||||||||||
主辦
|
|
21世紀(jì)報(bào)社版權(quán)所有,未經(jīng)書面授權(quán),禁止轉(zhuǎn)載或建立鏡像。 主辦單位:中國(guó)日?qǐng)?bào)社 Copyright by 21st Century English Education Media All Rights Reserved 版權(quán)所有 復(fù)制必究 京ICP備2024066071號(hào)-1 京公網(wǎng)安備 11010502033664號(hào)
|