Google and Michigan's Library
Google books is an astonishing, epochal development. But I am tired of journalists who write about it ignoring the fact that the core of the digitized collection is the 7 million volumes at the University of Michigan. It is always mentioned that Harvard and Oxford are participating. Last I knew, Oxford offered them access to 40,000 volumes.. But U-M President Mary Sue Coleman offered Google access everything we have, and has been good at giving the project an intellectual rationale. Google is going through the shelves and running optical character recognition on every book they encounter, in every language.
One problem: I am already finding poorly done books, where every other page is blurred beyond reading. This is very bad because I don't know when it would ever be corrected, and no one would have an incentive to carry out this sort of project once Google has.
I hope Google will tighten its quality control, and will commit to redoing flawed scanning jobs. For many research projects, only if everything is there will solid results be reached. I fear I think some of its subcontractors may be taking advantage and doing shoddy work, especially in languages other than English.
A second, general problem with Google is that on the whole it is no good at searching by date. Why is that so hard to put in a search engine? Is it that programmers just don't appreciate the desirability of being able to study instances of the word "liberte" in France, 1700-1789? You can put dates in the searches, but in my experience that doesn't return satisfactory results. If Google wants the project to have maximum impact, they need to address this problem. (It would be nice to address it in their general web search engine, too. Have you ever tried to find a document put up on the Web in 1998, where you don't remember whole search strings?) Otherwise, I see a business opportunity for a historian who has good programming skills . . .
Google is about to change everything in scholarship and maybe in the history of reading. Anyone interested in that issue should look at Elizabeth Eisenstein's "The Printing Press as an Agent of Change" and the works of Roger Chartier.
What is more, Google is bringing 1,000 jobs to the state of Michigan, which is the best news we've had since the 1967 riots that were Detroit's Katrina. I sometimes get readers writing that they had been under the impression that Detroit had come back. Well, there is the Renaissance Center. But most of the city limps along and is still firing police and firemen and making teachers take pay cuts. A Washtenaw county high tech corridor won't solve the problem, but anything that brings jobs and investment into the area is better than a kick in the head.
I wish there were a way to put an emphasis on training some young Detroiters in programming and begin reversing the brain drain that started in 1967.

|
13 Comments:
Frankly the whole state needs a boost. As a resident of Lansing for three years, I saw a continued decline in the areas economy, with the only bright spot been East Lansing (Go Green?!), but outside that not much else (although the city Majors where sure trying).
I wonder if Detroit politicians regret handing over Poletown to GM (or was it Ford?) only to have them abandon town a decade or so latter?
At least the Tigers are back =)
I agree with the quality of google books, including their downloads. In doing research on Aristotle, I am struck on how few of his works are available for download. Often, as you say, there is a lot of fuzzyness.
There was a long discussion on DailyKos yesterday about what topics and events should be included for YearlyKos07. The location has not been decided yet, and someone suggested Detroit. That was seconded with a suggestion to invite you, Dr. Cole. (My vote goes to New Orleans as first choice, Detroit second. Both with the reasoning that those cities need conventions.)
About the programming and date recognition etc., I've been involved with testing both library catalogueing and, not least, genealogy programs. If Google doesn't do user group testing, they aren't serious. Date recognition is tricky, something that developers of genealogy programs have to work on a lot. The changeover Julian/Gregorian calendars happened in different places on different dates throughout more than a couple of centuries. Not to mention the calendars you are familiar with which don't date from the birth of Christ. Sounds like there is too great an emphasis on quantity over quality.
last year i scanned one 100 book and one 400 page book, both originally published in the 40s and 50s respectively, for a personal research project comparing the works of two authors using statistical text analysis. the scans were run through omnipage, an optical character recognition program, to convert the scans to editable text files. because of the still-fledgling nature of ocr, much human labor still must be invested in error-correction. quality control is key, especially if your work depends on 100% accuracy.
since this was a personal project, with no budget but my time, and since i could trust no one else to do it correctly, especially if they weren't being paid, i did this all myself, in my free time. let me make one thing clear: scanning and ocr is tedious and time-consuming. the most i could get through in one sitting was twenty pages. i usually did ten, often settled for five. attempting to move faster only increased the probability for error. and if you care about whether the double-quotes are in fact double-quotes and not two single quotes, whether the em-dashes are not hyphens and whether the e-acutes all have the correctly-angled accents, you will check and double-check and triple-check.
once i had finished my analysis, i went the further step of importing the complete text of the two books, and their scanned illustrations, into quarkxpress, and created new layouts, faithful to the original publications. i output the two books as pdfs, which now sit on my hard disk (as well as being archived to cd), where they will remain since i do not have the rights to redistribute them.
in all, the work consumed six months. if i were being paid to do this full time, i would budget up to 50 pages per day per person just for the scanning and ocr, and partner that person with a teammate to do the error-correction. the two could occasionally trade places for welcome variety. at this rate you'd need an enormous staff and several years to tackle millions of volumes. but unless you hire people who care about the quotes, em-dashes and e-acutes, you're going to end up with shoddy work. and if you care more about the volume of pages you can get through, you're going to end up with shoddy work.
but, unless all this is rendered irrelevant by an errant asteriod, i'm pretty confident that we'll eventually convert all our important printed matter to digital format.
As someone who works in a higher tech job in Grand Rapids, I don't see the Ann Arbor thing as being a strong catalyst for Michigan by itself. Michigan has a relatively small number of high tech jobs, yet when companies do need to hire high tech talent, they have a hard time finding it in Michigan. I am not sure why they don't find it in Michigan; is it because there isn't a high tech community, or have the high tech workers all given up and gone elsewhere? Maybe Google can break this cycle, but I see them bring advertising jobs, not engineering jobs. Engineers at the Big Three make a lot more than $50,000 on average.
For looking at old versions of webpages, www.archive.org works ok.
http://web.archive.org/web/19990203001539/
Hmmm, Juan, sounds like a little "Go Blue" U of Mich spirit seeped in here, though justified by the University's genuine contribution! Re your wish that there was a way to train Detroiters in computer skills, there is: get rid of these damn Republican right-wing administrations that starve the public sector (except for defense and "security")and depend solely upon "trickle down" for everything else. Haven't Americans been trickled on enough yet?
You report "finding poorly done books, where every other page is blurred beyond reading." Does the Google library digitalization project produce digital text files or optical images? The latter might be nice to conserve illustrations or facsimiles of hand manuscripts, but are NOT good to preserve texts per se. Humble HTML files are searcheable and offer the widest, easiest access. However, a crooked or dirty scan with no proof-reading produces unreadable files riddled with errors. Is there no proof reading? Who does Google hire and what do they pay? This cannot be a bibliophobe task to be measured by gross pages per day. It can take a full day to convert a single chapter from a yellowed old book with small type in two columns into an error free HTML file.
Is it unreal to dream that the various university presses will put all their scholarly works into the database? A library composed only of public domain items published before 1900 has its limitations, particularly if the selection process simply the order of library index number, meaning a long wait to digitalize history tomes (900s). But I'd not bet small town folks in Pierre or Saskatchewan will have any easy dialup access to OUP's historical series any time soon.
Google is following its own lights and I'm afraid that its stunning expertise, of the mathematial kind, in information science has gone to its head.
It is astounding that they have obviously not consulted working scholars such as yourself during the design of their database, or at least in the design of its access tools. Your discussion of the lack of a means of specifying epochs makes it clear that they have not done so.
Google is in severe danger of becoming a corporate NSA, and this is doubly dangerous to all of us humans on the planet because they are actually competent and able to catalogue all the information that the NSA can only collect and sit upon.
Google guarantees it will read all your mail, and all your correspondents' mail to you along with. And people line up to become customers! I will not correspond with a gmail address and you shouldn't either.
And now people are going voluntarily to let google deliver a search engine to their own computers to add all their "private" data to the googlebase? And they're going to start storing their word processor documents and spread sheets in the googlebase?
I guess there is one born every nanosecond.
It looks like their google library is getting exactly the amount of attention a PR project requires, while the googlebase is growing like Topsy.
I found a snipped of a page on Google Books with information on something I am researching. It was from a scholarly journal published since the 1920s. The only identification was that it came from "page 35." No year, no volume or issue number.
The University of Michigan's "Making of America" website has been a terrific online resource for years (Cornell's is even better).
Nothing new companies doing a lousy job. I remember reading in a book on counterinsurgency in the Philippines. Apparently the microfilm copies of government records are all but unreadable because of the lousy job they did photographing them. Of course the originals are worse which means there probably lost.
In Google, if you want to search by date also you can use the advanced book search options (see the small texted link next to the search box) or click on below:
http://books.google.com/advanced_book_search
I seems to work for me. Also there are some options for searching by when the page was indexed. Check the link below:
http://faculty.valencia.cc.fl.us/infolit/Google/help.htm
Dr. Cole,
The ability to narrow a book search to a specific publication date range already exists with Google Advanced Book Search (http://books.google.com/advanced_book_search).
Post a Comment
<< Home