Transcribing billions of pages of handwritten documents is no easy task. Between the effects of time on paper and ink, the vagaries of individual handwriting, and history’s less-than-consistent spelling conventions, making sure historic records are intelligible (much less full-text searchable!) is easier said than done.
What tools and processes do you suggest we use to transcribe NARA’s billions of pages of handwritten documents quickly and efficiently?
What are some examples of the technologies that exist for this now? (For those of us unfamiliar.)
Crowd sourcing? High quality scans, post the scans someplace where anybody with an account can add keywords or transcriptions. Just post them by some sort of multi level catagory, not just NARA scans.
Perhaps human eyes, brains and typing fingers are the only viable solution at this stage. I’d be interested to know if there are any automated character recognition programs that could cope with a variety of handwriting without self-destructing within minutes.
Maybe just opening it up to the public en masse is the answer. Let the thousands of eyes, brains and fingers that are out there do the work. The Australian National Library’s historic newspapers project seems to be going well using this method.
http://newspapers.nla.gov.au/ndp/del/home
We should also seriously consider why we want to transcribe collection material en masse. What are the real benefits? Is it really worth it?
On the one hand it’s great to open up enhanced access to these resources. On the other hand however, I can’t help thinking that it’s an awful lot of work for how much real benefit?
Personally I really like the adventure of poring over handwritten documents, discovering the little gems that jump out from the written words. Technology-generated text just leaves me a bit cold by comparison.
And are we inviting a generation of mindless keyword searchers to continue their frivolous ways, or should we just be encouraging ‘readers’ instead.
Not to be too much of a Negative-Nellie; more so the Devil’s Advocate perhaps?
Cheers,
Craig Tibbitts
Official and Private Records
Research Centre
Australian War Memorial.
Get involved with crowd-sourced or micro-volunteering efforts such as:
http://www.beextra.org/
Have you looked at the ReCAPTCHA program? It uses the images used to allow access to web sites as a channel for crowdsourced text recognition. Framing the process as a game or some other form of activity may help as well.
This is a very interesting question with some great contributions above.
On my first view of documents with the NARA partner FOOTNOTE’s website I was impressed with they way they use crowd sourcing coupled with a sophisticated user interface which allows users to box or highlight sections of handwritten or otherwise difficult to decipher text and give a translation.
Like the National Libraries of Australia project which Craig Tibbitts refers to the Footnote method also permits credit to named or screen-named users. This both encourages the users, enables the user to self-index work they wish to return to, and also enables other users to perhaps track down similar topics of interest by following the indexer’s work.
It may also be useful to educate users in methods of interpreting archaic handwriting to increase productivity and effectiveness. Useful tips pages like this one:
http://archivesoutside.records.nsw.gov.au/useful-tips-for-reading-handwritten-documents/
can help; I’m sure NARA must have something similar. An online training course may also be useful for users. All such things take time and money to implement.
The National Libraries of Australia project to utilize the transcription of the many may (or may not) already have been superseded by Google’s Newspaper Archiving project which has done some of the same work better.
I give examples of that last point in my blog piece about it “Should the National Library of Australia throw their lot in with Google News Archives” [ I won’t link to that piece lest I contravene your comment rules ]
The implication being is that the NLA perhaps could not have foreseen that Google would do the same work better, but now should think about other options rather than their crowd sourced solution.
I would be interested to know if Google could develop some technology to scan, interpret and index NARA’s handwritten documents.
Then there’s the question of NARA partnering with a public company like Google. A rubicon already crossed by the partnership with Footnote – but with anti-trust implications and also longevity questions.
A minor aside, but very worthwhile in considering what one person examining handwritten documents can do is the work of Irma Havlicek of the Powerhouse Museum on the historical letters of the Sydney Observatory, a project for the International year of Astronomy:
http://www.sydneyobservatory.com.au/historicalletters/
Redundant crowdsourcing is the most economical approach right now. You could also do what Ancestry.com did for old Federal Census returns — outsource the work to a Chinese company, which had staff manually review and enter the handwritten data into an online database.
Or, wait five years and see what Google or stealth startup can manage in the field of advanced OCR.
Hello all- this has been a really insightful discussion so far! I wanted to thank you for your feedback and to let you know that we’ve posted a follow-up response here: http://blogs.archives.gov/online-public-access/?p=465. We decided to write it as a separate post rather than a comment since it’s pretty link-heavy and we didn’t want to hamper your reading pleasure by making you slog through all the URL coding. Thanks!
-Kristen
Marc Moskowitz,
Thanks for your suggestion about reCaptcha. We’ve been in contact with Professor Von Ahn, the developer of reCaptcha. His program has not yet been used for transcription of handwriting, but I’m sure this is something we will keep in mind.
Thanks again!
Jill (admin)
Have you considered using the same method that was used to transcribe Ellis Island Records, Freedman Bank Records? It’s all volunteer and therefore free.
Thanks for the suggestion Sherry! We are really excited about the great response we have received about transcription in the archives, and have created a new blog entry responding to your ideas at http://blogs.archives.gov/online-public-access/?p=465. Thanks again, and keep up the good work!
Whenever you open up volunteer transcription, you always have one bad apple who gets online and does really bad things. Free transcription requires monitoring of some kind. There are pros and cons to everything. Banks do have programs to OCR signatures, but so far have found nothing that extends to full documents.
There are no good solution for this problem right now . It is quite hard to handle the handwritten papers
On my first view of documents with the NARA partner FOOTNOTE’s website I was impressed with they way they use crowd sourcing coupled with a sophisticated user interface which allows users to box or highlight sections of handwritten or otherwise difficult to decipher text and give a translation.
I think there is no solution for this problem. It’s really tough to handle those papers.
Thanks
Jhon
As Ian Lamont pointet it out, China might represent the best solution for now to transcribe billions of pages of handwritten documents. Efficiently, I don’t know, but quickly, for sure! 🙂
We are not writing letters on paper anymore because email is fast and more convenient. We don’t use the pen or pencil to take notes in libraries because our research is done online and notes are often stored in digital scrapbooks or Word documents for easy retrieval.
I found this on a record and would like to know what it links to T91451