Home » 2011 » August

Monthly Archives: August 2011

中文信息处理

作者:江铭虎 [清华大学(北京)教授、博士生导师]
译者:梅中伟

中国文字出现于3500年前,即公元前1400年前后的中国商代的甲骨文。当时字符被刻在兽骨或龟甲上。通过对甲骨文到现代汉语的演化过程的调研,发现由甲骨铭文发展而来的汉语历经了3000多年的历史后,在形态、结构和义汇方面虽然发生了深刻变化,但它们仍然属于同一语言体系;甲骨文和现代汉语的关系呈现出连续的线性关系。汉字作为世界上唯一沿用至今的表意文字使用区域广泛,但存在问题众多。中国人占世界人口的1/4强,他们都用汉语来相互交流, 也就是说,有超过14亿的人在用汉语进行阅读,书写,并借助汉语进行其它活动。目前,随着中国经济的崛起和计算机信息处理技术的快速发展,世界各国已经建立了数千所孔子学院,它的目的是促进中国知识的传播。 因此通过汉语来进行沟通显得愈加重要。与此同时,中文计算机信息处理,如机器翻译、信息检索等显得越来越重要;该领域的研究必将促进不同语言之间的沟通,改变我们的生活方式。中文不同于英语;汉语句子没有时态和形态的变化。人脑对汉语词汇信息的反应主要是基于概念和语义属性,很少用汉语的句法和语法特征来体现。英语的自然语言处理技术不能直接用来处理汉语,因此中文信息处理领域还有很多研究要做,让我们携手共进吧!

Chinese information processing

From Minghu Jiang

 

Chinese characters are produced before 3500 years, i.e., oracle-bone inscription, which is the ancient Chinese character in dynastyin SHANG at 1400 B.C. These characters were engraved on animal bones or tortoise shells. By surveying the evolving process from oracle-bone inscription to modern Chinese language, although the development of oracle-bone inscription enable morphology, structure and semantic glossary to change greatly over 3000 years, they are belong to the same systematic language, the relation of modern Chinese and oracle-bone inscription comes down in one continuous line which is exhibited. Chinese characters are ideograph which are used continuously so far in the world, its using regions and issues are very large and broad, Chinese takes more than world population 1/4 which uses and communicate with each other. i.e., more than 14 hundred millions of persons read, write and use it. Currently along with China economy rise abruptly and with computer information processing technology develop fast, world countries have established thousands of Confucius institutes, its aim is Chinese knowledge propagation, it is increasingly important to communicate with Chinese. At the same time, Chinese computer information prosessing, such as machine translation, information retrieval etc., take increasingly important, the research may enhance the communication of different languages, change the way of our life. Chinese is different from English, Chinese sentence has not the change of tense and morphology. The response of human brain for Chinese lexical information is based mainly on conceptual and semantic attributes, seldom uses Chinese syntax and grammar features. The technology of natural English processing cannot be used straightway in Chinese processing, these are much work to need be down in Chinese information processing, lets us begin to do it now.

《迷失的语言》

作者:王保硕

《迷失的语言: 无法破译的世界之谜》(Robinson, 2002)[1] 这本书重点阐述了有关汉语语音方面的观点。

以下是几句引文:

所有的书写系统都是表音和表义单位的组合。在芬兰语中,表音占主导地位;而汉语却是表意居首优先,尽管她的表音成分比人们印象中的要多。无论是字母文字还是汉语、日语的书写系统 – 抑或是埃及的象形文字和巴比伦楔形文字 – 都是用符号来表示声音(即,符号具有音值);同样,书写系统都使用的注音符号和表义符号来共同表示词和概念(即,表意图形)。表音符号在书写中系统中的比例越高,越容易猜出的书面字符的发音。相比而言,该比例在英语中高,在汉语中低。因此,英语逐音拼写的准确率高于汉语(普通话);但对芬兰语而言,其准确率则更高。芬兰语的字母书写系统最易拼读,而汉语、日语的书写符号则很难拼读;至于密码符号,拼读辨音的效率几乎为零。

参考文献

Robinson, A., 2002. Lost languages : the enigma of the world’s undeciphered scripts, New York: McGraw-Hill,.

  • [1] 译者注:Robinson  Andrew (1957- )作者在该书中对世界现存的一些难以辨认的书写系统做了较为深入的研究。对于自然语言处理专业的研究着,该书有一定的参考价值。点击作者名字,你会看到作者对语言的书写系统的诸多著作。所译内容不代表译者观点。

Lost Languages

From Paul Wang

The book Lost languages : the enigma of the world’s undeciphered scripts  (Robinson, 2002) highlights the points concerning the phonetic aspect of the Chinese language.

Here are the quotes:

All writing systems are a mixture of phonetic and logographic elements. In Finnish, phoneticism predominates,while the Chinese script is chiefly logographic,though it contains more phoneticism than many people think. Both alphabets and the Chinese and Japanese scripts – and indeed the Egyptian hieroglyphs and Babylonian cuneiform – use symbols to represent sounds (i.e.,signs with phonetic values);and similarly all written systems use a mixture of phonetic signs and semantic signs standing for words and concepts (i.e.logograms). The higher the proportion of phonetic symbols in a script,the easier it is to guess the pronunciation of written words.In English the proportion is high,in Chinese it is low. Thus English spelling represents English speech sound by sound more accurately than Chinese characters represent Mandarin speech (Putonghua ); but Finnish spelling represents the Finnish language even better. The Finnish  alphabetic script is phonetically  highly efficient,while the Chinese and Japanese logosyllabic script is phonetically seriously deficient; and cryptographic codes have a phonetic efficiency of almost zero.

References
Robinson, A., 2002. Lost languages : the enigma of the world’s undeciphered scripts, New York: McGraw-Hill,.

Thomas Jefferson on Chinese Language

From Paul Wang

 

It seems to me Thomas Jefferson did care about Chinese language deeper than most Chinese !In his letter to Charles Jared Ingersoll 192 years ago,written at his beloved Monticello,he concisely pointed out the problems of the language and the need of reform.It is interesting these are still pretty much  relevant today because the problems have never been solved satisfactory!  ———Jefferson’s Letters,Arranged by Wilson Whitman,E.M.Hale and Company,Eau Claire,Wisconsin,

Chinese Alphabet

July 20,1818

Sir,—On my return the day before yesterday after a long absence from this place,I found here your favor of July 4 with the two Chinese works from Mr.Wilcox which accompanied it.I pray you to accept my thanks for the trouble have taken in forwarding them,and if you are in correspondence with Mr.Wilcox and should have other occasion to write to him I must request you to express to him my sense of his kind attention in sending me these works.

They are real curiosities and give us a better idea of the state of science in China than the relations of travellers have effected. It is surely impossible that they can make much progress with characters so complicated, so voluminous and inadequate as these are.It must take a life to learn the characters only and then their expression of ideas must be very imperfect. I image that some fortuitous circumstances will someday call their attention to the simple alphabets of Europe,which with proper improvements may be made to express the sounds of their language as well as of others,and that then they may enter on the field of science. I think missionaries to instruct them in our alphabets would be more likely to take good effect and lead them to the object of our religious missionaries than an abrupt introduction of new doctrines for which their minds are in no wise prepared.

Chinese language has been quite abused by literate

From Paul Wang

“For some two thousands years,from 221 B.C. to 1912 A.D.,a man who had to be able to read,write and keep accounts in a language so difficult that many believe it was purposefully designed that way only the sons of the rich could afford the education needed to perform the job.” from Hannah Pakula,”The last empress : Madame Chiang Kai-Shek and the birth of modern China“,Simon & Schuster,2009, New York.

For centuries,Chinese intellectuals had made language their own special domain of cultural activity. It was the lofty position to which the classical language had been elevated in the course of Chinese history that guaranteed some semblance of autonomy to intellectuals,who otherwise were thoroughly subordinated to nobles,and later,emperors. Before 403 B.C.  before Confucius’ birth,ru had been in charge of burial ceremonies,astrological calendars,ancestor worship.

403-211 B.C.    shi had adopted a martial spirit,manifested in vigorous intellectual controversy over statecraft and philosophy.

221 B.C.        the ancient shi,special class of servant-advisor’s,their literary skill needed by a warlike but illiterate nobility.

202 B.C.-       intellectuals lost some of the autonomy and vigor during A.D. 202 Warring States period.shi were unable to demand contractual mutuality.They wrapped themselves up in scarves,hats,feathers–symbols that signaled their unique relations to language and books.

A.D. 960-      scholars were known as rujia,their sphere of autonomy A.D.1127 became even  more circumscribed when the examination system became the formal means of recruiting into the service of the state.

 

Symposium on Computing with Words

Symposium on Computing with Words in the 2013 IEEE SSCI which will be held in Singapore on April 16-19 2013.

China,Country of Contrasts

From Paul Wang:

China, country of contrasts by Mary A.Nourse and Delia Goetz,Harcourt,brace and Co.,New York,published in 1944

Renaissance movement to deal with the difficulty classical Chinese language is still in our memory.One of the major leader in this movement is Hu Shih who also received PhD degree from Columbia University.[if my memory serve me right,Lotfi A.Zadeh also received PhD degree from Columbia].Hu Shih wanted Chinese scholars to give up the old,stilted,classical written language which no one but a few scholars understood,and write their books in the spoken language.Novels,he pointed out,had always been written in colloquial Chinese and they were living literature.In time his idea gained support,and today newspapers,magazines,and books use this simpler language.Another idea of his was to get all China to adopt one spoken language.The Mandarin or Peking [Beijing] official dialect has been chosen and is called the national language.All these happen in less than one hundred years ago.

With the hype by computer scientists,huge sum of research money has been spent ,to find out the AI,artificial intelligence actually is a linguistic issue.

Regardless of the nice evolution and many efforts to improve Chinese language since Hu Shih era,Chinese language still has many problems with degree of difficulties to solve them.On the other hand,Chinese language is very different from that of the English language.Most users know its stronger characteristics and its weak features.

Should we be serious about the CWW research,then it is a great idea to tackle both problems at the same time!

 

Computing with Words (CW or CWW)—A Paradigm Shift by Lotfi A. Zadeh

Computing with Words (CW or CWW) is a system of computation which offers an important capability that traditional systems of computation do not have—a capability to compute with information described in a natural language. In the main, CW is concerned with solution of problems which are stated in a natural language. The importance of CW derives from the fact that much of human knowledge is perception-based and is described in a natural language.

The point of departure in CW is a question, q, of the form: What is the value of a variable, X? q is associated with a question-relevant information set, I, an association expressed as X is I, meaning that the answer to q, Ans(q/I), is to be deduced (computed) from I. Typically, I consists of a collection of propositions, p1, …, pn, which individually or collectively are carriers of information about the value of X. In I, some or all of the pi, i=1, …, n,  are expressed in a natural language. Some of the pi may be drawn from external sources of information, typically from world knowledge. I is open if it includes propositions drawn from external sources of information. I is closed if inclusion is not allowed.

Precisiation of meaning is a prerequisite to computation with information which is described in a natural language. If p is a proposition drawn from a natural language, then precisiation of p leads to a computation-ready proposition, p*, which may be viewed as a computational meaning of p or, equivalently, as a model of p. p* is assumed to be mathematically well-defined and is intended to serve as an object of computation. In CW, there are two levels of generality, Level 1 and Level 2. In Level 1, the carriers of information are words. In the more recent Level 2 CW, the carriers of information are words and propositions. Today, the bulk of the literature is still focused on Level 1 CW. In particular, the widely used calculi of fuzzy if-then rules fall within the province of Level 1 CW.

Computation of Ans(q/I) is carried out in two phases. Phase 1, called Precisiation, involves precisiation of q and I, leading to precisiated q, q*, and precisiated I, I*. Phase 2, called Computation, involves computation with q* and I*, leading to Ans(q/I). This is done through the use of an aggregation function which has the pi* as its arguments. In CW, the pi* are represented as generalized assignment statements or, equivalently, as generalized constraints. Computation with the pi* involves propagation and counterpropagation of generalized constraints. In CW, precisiation and computation employ the machinery of fuzzy logic.

CW has important applications to decision analysis, question-answering systems, system modeling, specification and optimization, and mechanization of natural language understanding. Basically, CW opens the door to a wide-ranging enlargement of the role of natural languages in scientific theories.

 

CWW—why we need a major reform in Chinese language?

From Paul Wang:

As you are aware,my heritage is Chinese culture and language.I was ,frankly,shocked to learn the following facts about Chinese language.

1)There are two major reforms in the 20th century alone.In both reforms,some progress have been made,but far from solving fundamental problems!

2)Can you believe there is no” grammar” for Chinese language?Several books are beginning to appear,but all aim med for non Chinese speakers how to learn Chinese!Most are self styled,some of them have very little training,like myself,in linguistics.

3)Many books,some of them I find out to be hundred/hundreds years old written by European and American experts in linguistics or highly regarded literatures authors.They all frankly pointed out the ‘weakness’ of Chinese language as compared with other languages.By no means,the Chinese language is at the bottom,but not at the top for sure!

On the other side of the coin,Chinese language has many “strong” points as well.We will avoid the debate here because this issue has always been widely and intensively debated!

An important factor must be taken into consideration at this juncture in time;that is the usage of digital computer and only then,we now know what to expect or what ought to be a language which is friendly to “computer”.An important case in point is CWW.CWW is the product of research in AI and the design of the expert system.

However,my argument is that there are so many things which are fundamental to the usage of a language.I quo t few of them here simply to show the need of a major reform:

1)In US alone,there is a trend developed that many grammar school kids choose to learn Chinese over,say,German.If the trend continues,a huge demand on learning Chinese must occur.If you look at Barnes Noble or Borders language section,many publications aimed at Chinese language learning have already crowded the book displays.

2)Historical experience has taught us that majority of Chinese speaking people are against to abandon Chinese language.Then the digital library and preservations of Chinese arts etc. must be coped with.

3)Huge knowledge base is there to help us reform Chinese language and the outcome can be a real surprise in terms of efficiency,productivity and economic growth!

The conclusion is that we must get together to discuss the issues.This is,frankly,something Chinese speaking governments everywhere must do.However,some of them are so carried over by the huge scope of research in didigtal computer and neglect the most fundamental problem important to them!

Please do wake up!

We are serious in our first effort in organizing a smaller scale workshop on CWW–Chinese language and we need your help in spread the words,suggest the topics,as well as who may be a speaker or contributors?

Some of you have indeed advised many PhD theses on this or related topics and we really do need you assistance in order to get the ball rolling!

For those of you already responded,we do appreciate your sincere effort and your high ideal of glorify a wonderful “ideographical based ” language.To make it useful,functional as well as elegance!

We look forward to your help and participation!

css.php