[Milton-L] RE: Culturomics? Genome?
Gilliatt, Cynthia Ann - gilliaca
gilliaca at jmu.edu
Fri Dec 17 11:31:08 EST 2010
.this in today's Guardian about two "culturomics" researchers at Harvard who are using Google data and $ to study the English language "genome":
"In their initial analysis of the database, the team found that around 8,500 new words enter the English language every year and the lexicon grew by 70% between 1950 and 2000. But most of these words do not appear in dictionaries. "We estimated that 52% of the English lexicon – the majority of words used in English books – consist of lexical 'dark matter' undocumented in standard references," they wrote in the journal Science (the full paper is available with free online registration)."
So how did their computerknow they were words? And what dictionaries did they use? Did they include proper names?
"Let's talk a bit about terms like "culturomics" and "genome" and the apparent need to sound like a scientist (a wacky scientist at that) in order to be taken seriously by the media and govt grant dispensers these days."
"But first, let me try to cast some doubt on the notion that 52 % of the English lexicon (as represented by 4 % of the books ever published in English) the majority of words used in English books do not appear in any dictionaries or other reference books."
Which 4% of books printed in English? Who chose? Did they include texts in Early Modern English? Or were the texts all 20th/21st c?
"This claim falls so far outside my experience as a reader and dictionary user that I want say. Are you kidding? Maybe their computer algorithm is good at searching a word database and very very poor at using a dictionary. I suspect that their search algorithm (Harvard's, not Google's) fails to allow for any sort of conjugation and inflection, so, for example, the word, "indirectly" comes up as "dark matter.""
Dark matter indeed. Well worth discussing. Thanks.
More information about the Milton-L