I started writing this post about making order-of-magnitude estimates, but it turned into more of a question of what counts as a “hit” when we search through scanned books.

Part I: Estimating Numbers

A few months back I had a question: how many volumes does Google Books actually have in the mainland Nordic languages?

It turns out answering this question is a little tricky. While you can constrain a query to a given language in the “Advanced Search” function of Google Books, you can’t just return a list of all books in that language. If you try to leave the “Search For” field blank, in an attempt to return a list of all books, you just get redirected to the home page of google.com, with nary an apology. So you have to search for something — anything — that will trigger at least one hit per book.

What can we search for that will be in every published volume in a language? The words for “I” wouldn’t necessarily work — there might be a history book without any dialogue. Searching for “said” would return nearly every novel, but very few plays.

What I settled on was to search for indefinite articles — the equivalent of “a” and “an” in English. (Definite articles in these languages are suffixed onto the noun, so that wouldn’t work.) Modern Nordic languages (except some kinds of Norwegian) have two grammatical genders, endearingly called neutrum and utrum in Swedish, or neuter and common-gender. (Yes, in the Swedish language at least, gender has indeed collapsed.) With some minor orthographic variations, these work out to “en” and “ett” in most cases. To sum up: if we search on either of these two words, we should get our one required hit in 99.999% of all texts.

(As an aside, you may wonder whether searching for one indefinite article or the other influences the search results. The answer is it does, but at very small levels that I suspect may be noise, or slightly-out-of-date indexes.)

To generate a quick and dirty graph, I chose to chunk the results in decade-long sections. I figured this would give me a sense of change over time, without making me do 100 separate queries times three languages.

So! What are the results of this admittedly unscientific survey?

full-text_scandinavian_books_digitized_per_decade.gif

First off, we should say that all publishing does not cease in Scandinavia around 1920. I restricted my queries to full-text-available only, which essentially means no longer in copyright. Between libraries’ decisions to only scan out-of-copyright texts, and the search engine obeying my command to only display full-text books, the data just appears to go to zero.

Secondly, the question of which collections these books came from. In spot checking some of these results, I’ve found books from everywhere from Harvard to the NYPL to the Bavarian State Library. But we also know that the Royal Library in Denmark (which serves as both the national library as well as the library of the University of Copenhagen) has an agreement with Google Books to digitize parts of its collections. (See this presentation for more details.) So the very noticeable spikes in the Danish material may be an artifact of one library giving Google everything from 1900-1910 as a test, for example. We would need to compare this graphs to an actual history of publishing in the three Nordic countries to know how well it maps to what was actually printed.

All told, however, my results seem to suggest that Google Books had roughly 160,000 full-text volumes in Norwegian, Danish and Swedish as of March 2010. Keep in mind, that’s texts of all types, including things like church records, some serials, and no doubt a lot of ephemera.

Part II: Results and Context

In the time since I did this quick and dirty analysis, Google has deployed a refinement in their search system — an attempt to weld together their discrete searches for media such as videos, images, books, status updates, etc. But this change has had a negative effect on our ability do conduct even approximate counting such as that shown in the graph above.

Most commonly called “The New Sidebar” in online discussions, this feature seemed to deploy in a distributed fashion between November 2009 and May 2010. Either as part of this update, or during roughly the same timeframe, Google Books stopped counting the number of volumes with a given word you searched for, and started counting the number of hits. So for example I can now tell you that the word “en” was used 83,000 times in Swedish-language books in the year 1900, but I can’t tell you how many books that represents. (Clearly, Sweden did not publish 227 books every day that year, or I would have never gotten through my General Exam reading list.)

The new system is, undoubtably, more useful for certain things — like finding the frequency of term occurrence over time. But unless I’m missing some big obvious button somewhere, we have lost the ability to count works that include a word or phrase. The only way we can get the total number of works, it seems, would be to go to the end of all the results page for each query, count the total pages and then multiply by 10, for the number of results on every page.

Finally, it’s worth noting that the accuracy of any of Google Books’ metadata has been the subject of some debate. Even when the system reported the number of works with a given search term, such as the chart above, I was trusting both the publication date as well as the language fields. Geoff Nunberg’s investigation and critique of precisely these fields has received a lot of attention last year, however what I’ll link to is the interesting post by Google’s Jon Orwant in the comments section below. My feeling is that in my own field, Scandinavian Literature, we may be facing questions of metadata quality that are as-of-yet unexposed, because of the much smaller number of people who read these languages compared to English.

Still, there’ no doubt that eventually Google Books will be a scholarly resource of first measure, irrespective of whether it was originally intended to be or, indeed, is presently run to be. What will help is input and feedback from people working in literature (and corpus linguistics, to name a field with several decades’ more experience with problems like these.) Sussing out what constitutes a “hit” — linguistic lemma or published volume — is one of the more interesting questions we’ll all have to think about in the future.

A little Taliesin Associates-esque mid-century modern, in the middle of the Utah desert.

Thunderbird Inn

Thunderbird Inn

Cisco Cius

Part of CEO John Chamber’s speech here at the annual Cisco Convention was a surprise product announcement — a new business-focused tablet computer based on Google’s Android operating system. You can read more coverage of the intro from Engadget and Gizmodo, but on the show floor itself the new device was imprisoned behind glass:

Cisco Cius

Despite the business focus of the product itself, the devices’s unveiling during the keynote used primary education as the context. Actors portraying students, parents and teachers put the tablet to work pitching the video-conferencing and e-textbook capabilities. (The latency of a satellite hookup to research vessel scotched the dream of seamless telepresence during the demo, unfortunately.)

But whether Cisco chooses to focus on the classroom or the boardroom (or both), several questions remain about its entry into a crowded tabloid market. Non-phone devices based on Android have had a rocky road to travel getting the key differentiator of that operating system — the open Android market — to work. Companies that have brought Android-based tablets to market, such as Archos, have found themselves both stuck with older versions of the OS, as well as locked out of the vibrant Marketplace — a software distribution system much more open and less controlled Apple’s App Store, but paradoxically unavailable for any device Google refuses to authorize.

Put another way: take away the apps that Google requires co-branding for (Gmail) and won’t allow tablets to use (Marketplace), and you end up with a much less compelling story for a competitor to the current market leader, Apple’s iPad. Though Cisco’s expertise in enterprise features such as IP telephony and video telepresence can make the Cius a well-fitting cog in a corporation’s existing IT infrastructure, users may wonder why they’re kept out of the dynamic and ever-growing Android software marketplace for seemingly arbitrary reasons.

How does a Fortune 500 company present its CEO to 10,000 customers during an economic downturn?

John Chambers keynote

With a lot of confidence, apparently. Despite backing the wrong horse in 2008, John Chambers is bullish on the economic recovery, as any CEO whose bottom line depends on expanding businesses would be.

John Chambers keynote

There’s evidence he has good reason to believe in Cisco’s performance. Since heralding a bold new assault on the data center last year — going to battle against systems integrators such as Dell and HP — Cisco has proven itself an unexpectedly strong competitor to traditional hardware companies in selling integrated server systems, incorporating everything from CPU to disks to, naturally, the routers and switches.

Cisco Live 2010

Instead of building all these elements themselves, Cisco has partnered with vendors such as EMC and their subsidiary VMWare. Put the server, disk subsystem, and network switch into one box and you end up with a “VBlock”, which Cisco will ship to your door ready to go — the Lunchables of the data center:

Cisco Live 2010

More interesting than the back-office equipment, however, is Cisco’s new play for a “business tablet” — the Cius. I’ll cover that in my next post.

Cisco Live 2010

Cisco Live 2010

Returning to Cisco’s annual convention this summer (after a one-year absence) finds me in Las Vegas during 109° weather and reminds me that Nevada-based tech conferences are much more enjoyable in March than in the beginning of July. Luckily the hotel (Luxor) connects indoor to the convention center (Mandalay Bay) through indoor passages, through which one can walk and pass by oversized styrofoam logos such as these:

Cisco Live 2010

Icelandic Coffee

Icelandic Coffee

Trophy Cupcake

Trophy Cupcakes

Editing a chapter

Thesis time

Carsonville, Sanilac County, 1912

Who knew there was a website dedicated to small train stations in rural Michigan? In this case, the page on Carsonvile, Sanilac County offers a few different views of the same station that my great-great grandfather sent as a postcard in 1912, above.

Easter Eggs 2010

Easter 2010

Some pictures from the Mad Cow String Band’s set at El Corazon:

Mad Cow String Band

Finishing up the March trip to Tucson, here’s what is the northernmost of a string of missions that stretches down into Mexico.

Mission San Xavier del Bac

Find recent content on the main index or look in the archives to find all content.

Recent Activity

Tuesday Aug 31
  • Peter tweeted, "voucher-for-bumping = impromptu subsidized email catchup at LAX."
  • Peter is returning from a trip to Los Angeles, CA.
Thursday Aug 26
  • Peter tweeted, "is about to present on citation networks in Old Norse studies as part of the wind-up of the NEH #humnets workshop."
Tuesday Aug 24
Monday Aug 23
  • Peter tweeted, "Great talk from David Blei (Princeton) on Relational Topic Models. Instead of word overlap, use lower-dimensional representation. #humnets"
  • Peter tweeted, "Really enjoying Krytzof Urban's talk on word space models and keyword search at #humnets."
Saturday Aug 21
  • Peter is planning a trip to Los Angeles, CA in September 2010.
Saturday Aug 14
Friday Aug 13
  • Peter is planning a trip to Tokyo, Japan in November 2010.

Bilder

Flash Required

Recent Comments

  • Peter Leonard: Thanks for the heads up, fixed! read more
  • paul-peeters.myopenid.com: Hey Peter, the links to the images in your article read more
  • Peter: Hi Chris, thanks for your comment! If you haven't seen read more
  • Chris Treen: So good to see someone else building one of these. read more
  • Laura: Wow, Peter. Thank you for this most helpful explanation. Anything read more
  • Sathish: I Tried and it says "CODE ERROR".. what should i read more
  • Scott: The way that Facebook is redesigning stuff these days, I read more
  • Peter Leonard: Hi there Maria, your problem may be that the Updates read more
  • Maria: Hi there, Thanks a lot for this very useful link. read more
  • Shandor: Are those crackers on the closest plate? Was there some read more