大數(shù)據(jù)“染指”文學、藝術(shù)和電影領(lǐng)域
ANY list of the leading novelists of the 19th century, writing in English, would almost surely include Charles Dickens, Thomas Hardy, Herman Melville, Nathaniel Hawthorne and Mark Twain.
在任何一份“19世紀頂尖英文小說家”列表中,幾乎肯定都會有查爾斯·狄更斯(Charles Dickens)、托馬斯·哈代(Thomas Hardy)、赫爾曼·梅爾維爾(Herman Melville)、納撒尼爾·霍桑(Nathaniel Hawthorne)和馬克·吐溫(Mark Twain)的名字。
But they do not appear at the top of a list of the most influential writers of their time. Instead, a recent study has found, Jane Austen, author of “Pride and Prejudice, “ and Sir Walter Scott, the creator of “Ivanhoe,” had the greatest effect on other authors, in terms of writing style and themes.
但是在一份“19世紀最具影響力的作家”列表里,他們的名字卻沒有出現(xiàn)在前面。最近的一項研究發(fā)現(xiàn),在寫作風格和主題方面,《傲慢與偏見》 (Pride and Prejudice)的作者簡·奧斯汀(Jane Austen)和《劫后英雄傳》(Ivanhoe)的作者沃爾特·斯科特爵士(Sir Walter Scott)對其他作家產(chǎn)生的影響最大。
These two were “the literary equivalent of Homo erectus, or, if you prefer, Adam and Eve,” Matthew L. Jockers wrote in research published last year. He based his conclusion on an analysis of 3,592 works published from 1780 to 1900. It was a lot of digging, and a computer did it.
馬修·L·喬克斯(Matthew L. Jockers)在去年發(fā)表的一份研究報告中寫道,奧斯汀和斯科特相當于是“文學領(lǐng)域的直立猿人,如果你愿意的話,也可以稱他們是亞當和夏娃”。在分析了 3592部1780年至1900年出版的作品后,他得出了這個結(jié)論。該研究所需的大量數(shù)據(jù)挖掘工作由一臺計算機完成。
The study, which involved statistical parsing and aggregation of thousands of novels, made other striking observations. For example, Austen’s works cluster tightly together in style and theme, while those of George Eliot (a k a Mary Ann Evans) range more broadly, and more closely resemble the patterns of male writers. Using similar criteria, Harriet Beecher Stowe was 20 years ahead of her time, said Mr. Jockers, whose research will soon be published in a book, “Macroanalysis: Digital Methods and Literary History” (University of Illinois Press).
喬克斯把數(shù)千本小說聚集在一起,對它們進行了統(tǒng)計句法分析,該研究獲得的其他一些發(fā)現(xiàn)也很引人注目,比如,奧斯汀的作品在風格和主題方面比較統(tǒng)一,而喬治·艾略特(George Eliot,即瑪麗·安·埃文斯[Mary Ann Evans])的作品有更加多變的風格和主題,模式更接近于男性作家。喬克斯說,以類似的標準來看,哈里特·比徹·斯托(Harriet Beecher Stowe)領(lǐng)先于她的時代20年。喬克斯的研究成果很快將發(fā)表在《宏觀分析:數(shù)字方法與文學史》(Macroanalysis: Digital Methods and Literary History)(伊利諾伊大學出版社[University of Illinois Press])一書中。
These findings are hardly the last word. At this stage, this kind of digital analysis is mostly an intriguing sign that Big Data technology is steadily pushing beyond the Internet industry and scientific research into seemingly foreign fields like the social sciences and the humanities. The new tools of discovery provide a fresh look at culture, much as the microscope gave us a closer look at the subtleties of life and the telescope opened the way to faraway galaxies.
這些發(fā)現(xiàn)算不上是蓋棺定論。就目前來說,這樣的數(shù)字分析主要還是一種有趣的跡象:大數(shù)據(jù)(Big Data)技術(shù)正在向互聯(lián)網(wǎng)和科研以外的領(lǐng)域穩(wěn)步推進,出現(xiàn)在了一些看似陌生的地帶,比如社會科學和人文科學。這些新的探索工具為我們提供了一種審視文化的新視角,就像顯微鏡讓我們仔細查看生活的細微之處,望遠鏡為我們打開了看向遙遠星系的通路一樣。
“Traditionally, literary history was done by studying a relative handful of texts,” says Mr. Jockers, an assistant professor of English and a researcher at the Center for Digital Research in the Humanities at the University of Nebraska. “What this technology does is let you see the big picture — the context in which a writer worked — on a scale we’ve never seen before.”
“傳統(tǒng)上來說,文學史研究使用的文本相對較少。”喬克斯說。他是內(nèi)布拉斯加(Nebraska)大學人文科學數(shù)字研究中心的研究員,也是英語專業(yè)的助理教授。“這項技術(shù)能讓你以前所未見的宏大規(guī)模統(tǒng)觀全局——作家寫作的背景。”
Mr. Jockers, 46, personifies the digital advance in the humanities. He received a Ph.D. in English literature from Southern Illinois University, but was also fascinated by computing and became a self-taught programmer. Before he moved to the University of Nebraska last year, he spent more than a decade at Stanford, where he was a founder of the Stanford Literary Lab, which is dedicated to the digital exploration of books.
喬克斯現(xiàn)年46歲,是在人文科學推動數(shù)字進步的代表性人物。他是南伊利諾伊大學的英語文學博士,但是對計算機技術(shù)也十分著迷,是一位自學成才的程序員。去年他搬到了內(nèi)布拉斯加大學,此前他在斯坦福大學工作了十多年,參與創(chuàng)建了致力于用數(shù)字技術(shù)探索圖書的斯坦福大學文學實驗室(Stanford Literary Lab)。
Today, Mr. Jockers describes the tools of his trade in terms familiar to an Internet software engineer — algorithms that use machine learning and network analysis techniques. His mathematical models are tailored to identify word patterns and thematic elements in written text. The number and strength of links among novels determine influence, much the way Google ranks Web sites.
如今,喬克斯用互聯(lián)網(wǎng)軟件工程師熟悉的術(shù)語來描述他在工作中用到的工具——使用機器學習和網(wǎng)絡(luò)分析技術(shù)的計算方法。他的數(shù)學模型是專門為識別書面文字的用詞模式和主題元素建立的。小說的影響力則依據(jù)小說之間聯(lián)系的數(shù)量和強度來判斷,跟谷歌給網(wǎng)站排名的方法非常類似。
It is this ability to collect, measure and analyze data for meaningful insights that is the promise of Big Data technology. In the humanities and social sciences, the flood of new data comes from many sources including books scanned into digital form, Web sites, blog posts and social network communications.
大數(shù)據(jù)技術(shù)可以為你提供收集、測量和分析數(shù)據(jù),從而獲得有效發(fā)現(xiàn)的能力。在人文和社會科學領(lǐng)域,像書籍掃描而成的數(shù)字圖書、網(wǎng)站、博客文章,以及社交網(wǎng)站上的帖子等多種來源,產(chǎn)生了大量的新數(shù)據(jù)。
Data-centric specialties are growing fast, giving rise to a new vocabulary. In political science, this quantitative analysis is called political methodology. In history, there is cliometrics, which applies econometrics to history. In literature, stylometry is the study of an author’s writing style, and these days it leans heavily on computing and statistical analysis. Culturomics is the umbrella term used to describe rigorous quantitative inquiries in the social sciences and humanities.
以數(shù)據(jù)為中心的專業(yè)迅猛發(fā)展,導致了一系列新詞匯的產(chǎn)生。在政治學中,這種定量分析被稱為政治方法學。歷史學中則有歷史計量學,也就是把計量經(jīng)濟學運用在歷史上。在文學中,文體學研究的是作家寫作風格,如今,文體學在朝著計算和統(tǒng)計分析的方向嚴重傾斜。“文化組學”則是涵蓋性術(shù)語,用來描述社會科學和人文學科領(lǐng)域中嚴謹?shù)亩空{(diào)查。
“Some call it computer science and some call it statistics, but the essence is that these algorithmic methods are increasingly part of every discipline now,” says Gary King, director of the Institute for Quantitative Social Science at Harvard.
“有人把它叫做計算機科學,有人稱之為統(tǒng)計學,但從本質(zhì)上說,這些計算方法正在越來越多地成為每個學科的一部分。”加里·金(Gary King)說,他是哈佛大學定量社會科學研究所的所長。
Cultural data analysts often adapt biological analogies to describe their work. Mr. Jockers, for example, called his research presentation “Computing and Visualizing the 19th-Century Literary Genome.”
文化數(shù)據(jù)分析師常常會把自己的工作跟生物學做類比。比如喬克斯就把他的研究簡報命名為“對19世紀的文學基因組進行的計算和可視化展現(xiàn)”。
Such biological metaphors seem apt, because much of the research is a quantitative examination of words. Just as genes are the fundamental building blocks of biology, words are the raw material of ideas.
這種生物學隱喻用得非常恰當,因為這項研究的大部分工作就是在對詞語進行定量分析。正如基因是生物學的基本構(gòu)建單位一樣,詞語也是思想的原材料。
“What is critical and distinctive to human evolution is ideas, and how they evolve,” says Jean-Baptiste Michel, a postdoctoral fellow at Harvard.
“人類進化的一個關(guān)鍵而獨特的方面就是思想以及它的進化方式。”哈佛大學博士后研究員讓-巴蒂斯特·米歇爾(Jean-Baptiste Michel)說。
Mr. Michel and another researcher, Erez Lieberman Aiden, led a project to mine the virtual book depository known as Google Books and to track the use of words over time, compare related words and even graph them.
米歇爾和另一位研究員埃雷茲·利伯曼·艾登(Erez Lieberman Aiden)領(lǐng)導開展了一個研究項目:挖掘虛擬書庫“谷歌圖書”的數(shù)據(jù),追蹤詞語在一段時間中的使用狀況,比較與之關(guān)聯(lián)的詞語,甚至是用圖表來展示它們。
Google cooperated and built the software for making graphs open to the public. The initial version of Google’s cultural exploration site began at the end of 2010, based on more than five million books, dating from 1500. By now, Google has scanned 20 million books, and the site is used 50 times a minute. For example, type in “women” in comparison to “men,” and you see that for centuries the number of references to men dwarfed those for women. The crossover came in 1985, with women ahead ever since.
谷歌跟他們合作開展這個項目,而且還開發(fā)了一個軟件來制作供公眾觀看的圖表。谷歌文化探索站點最初于2010年年底建成,當時它有藏書500多萬冊,歷史可上溯至1500年。迄今為止,谷歌已經(jīng)掃描了2000萬冊圖書,用戶們每分鐘使用該網(wǎng)站50次。比如說,輸入“女人”和“男人”這兩個詞進行比較,你會看到,幾個世紀以來,“男人”這個詞出現(xiàn)的次數(shù)遠遠多于“女人”,但1985年是個轉(zhuǎn)折點,之后“女人”就一直處在領(lǐng)先位置。
In work published in Science magazine in 2011, Mr. Michel and the research team tapped the Google Books data to find how quickly the past fades from books. For instance, references to “1880,” which peaked in that year, fell to half by 1912, a lag of 32 years. By contrast, “1973” declined to half its peak by 1983, only 10 years later. “We are forgetting our past faster with each passing year,” the authors wrote.
2011年,米歇爾和研究小組在《科學》(Science)雜志上發(fā)表了一篇論文,描述他們利用谷歌圖書的數(shù)據(jù)來研究“過去”從書本上消失的速度有多快。例如,“1880”的提及次數(shù)在1880年當年達到了頂峰,到1912年時下降了一半,滯后時間為32年。相比之下,“1973”在僅僅10年后,即1983年,提及次數(shù)就降到鼎盛時期的一半。“每過一年,我們都更快地忘記了我們的過去,”研究者寫道。
Jon Kleinberg, a computer scientist at Cornell, and a group of researchers approached collective memory from a very different perspective.
喬恩·克萊因伯格(Jon Kleinberg)是康奈爾大學的一名計算機科學家,他和研究團隊從一個非常不同的角度來研究集體記憶。
Their work, published last year, focused on what makes spoken lines in movies memorable. Sentences that endure in the public mind are evolutionary success stories, Mr. Kleinberg says, comparing “the fitness of language and the fitness of organisms.”
他們研究的課題是“是什么讓電影中的臺詞令人難忘”,論文已經(jīng)在去年發(fā)表??巳R因伯格說,令公眾難以忘懷的臺詞是進化中的勝利者,他把“語言的‘適者生存’比作生物的‘適者生存’”。
As a yardstick, the researchers used the “memorable quotes” selected from the popular Internet Movie Database, or IMDb, and the number of times that a particular movie line appears on the Web. Then they compared the memorable lines to the complete scripts of the movies in which they appeared — about 1,000 movies.
研究人員從人氣互聯(lián)網(wǎng)電影數(shù)據(jù)庫IMDB上選擇了“經(jīng)典臺詞”,并使用電影臺詞在網(wǎng)絡(luò)上出現(xiàn)的次數(shù)作為衡量尺度。然后,他們把這些經(jīng)典臺詞跟臺詞所在的完整劇本做比較——總共約1000部電影。
To train their statistical algorithms on common sentence structure, word order and most widely used words, they fed their computers a huge archive of articles from news wires. The memorable lines consisted of surprising words embedded in sentences of ordinary structure. “We can think of memorable quotes as consisting of unusual word choices built on a scaffolding of common part-of-speech patterns,” their study said.
他們在電腦里建立了一個巨大的新聞媒體文檔庫,以便讓統(tǒng)計算法了解常見的句子結(jié)構(gòu)、詞序和使用最廣的詞語。結(jié)果他們發(fā)現(xiàn),很多經(jīng)典臺詞是把驚人之詞嵌入到了結(jié)構(gòu)普通的句子中。“我們可以這樣想,經(jīng)典臺詞是在常見的詞序結(jié)構(gòu)中,填入不尋常的詞語。”他們在研究報告中寫道。
Consider the line “You had me at hello,” from the movie “Jerry Maguire.” It is, Mr. Kleinberg notes, basically the same sequence of parts of speech as the quotidian “I met him in Boston.” Or consider this line from “Apocalypse Now”: “I love the smell of napalm in the morning.” Only one word separates that utterance from this: “I love the smell of coffee in the morning.”
比如來自電影《甜心先生》(Jerry Maguire)的一句臺詞:“我對你一見傾心”(You had me at hello)。克萊因伯格指出,它的詞序基本上跟“我在波士頓遇到了他”(I met him in Boston)是一樣的。又比如《現(xiàn)代啟示錄》(Apocalypse Now)中的臺詞“我喜歡早晨汽油彈的氣味”(I love the smell of napalm in the morning),跟“我喜歡早晨咖啡的氣味”(I love the smell of coffee in the morning)只相差一個詞。
This kind of analysis can be used for all kinds of communications, including advertising. Indeed, Mr. Kleinberg’s group also looked at ad slogans. Statistically, the ones most similar to memorable movie quotes included “Quality never goes out of style,” for Levi’s jeans, and “Come to Marlboro Country,” for Marlboro cigarettes.
這種分析可以運用在各種文本上,包括廣告語??巳R因伯格的小組也確實研究了廣告語。據(jù)統(tǒng)計,跟經(jīng)典臺詞最類似的廣告語包括李維斯牛仔褲的“質(zhì)量永遠不會過時”(Quality never goes out of style),或萬寶路的“請來萬寶路之鄉(xiāng)”(Come to Marlboro Country)。