My Stories: library

Tampilkan postingan dengan label library. Tampilkan semua postingan

Sabtu, 07 November 2009

ePubs and quality

You may have heard news about the release of "bookserver" by the good folks at the Internet Archive. This is a DRM-free ePub ecosystem, initially stocked with the prodigious output of Google's book scanning project and the Internet Archive's own book scanning project.

To see how the NZETC stacked up against the much larger (and better funded) collection I picked one of our Maori Language dictionaries. Our Maori and Pacifica dictionaries month-after-month make up the bulk of our top five must used resources, so they're in-demand resources. They're also an appropriate choice because when they were encoded by the NZETC into TEI, the decision was made not to use full dictionary encoding, but a cheaper/easier tradeoff which didn't capture the linguistic semantics of the underlying entries, but treated them as typeset text. I was interested in how well this tradeoff was wearing.

I did my comparison using the new firefox ePub plugin, things will be slightly different if you're reading these ePubs on an iPhone or Kindle.

The ePub I looked at was A Dictionary of the Maori Language by Herbert W. Williams. The NZETC has the 1957 sixth edition. There are two versions of the work on bookserver. A 1852 second edition scanned by Google books (original at the New York Public library) and a 1871 third edition scanned by the Internet Archive in association with Microsoft (original in the University of California library system). All the processing of both works appear to be been done in the U.S. The original print used macrons (NZETC), acutes (Google) and breves (Internet Archive) to mark long vowels. Find them here.

Lets take a look at some entries from each, starting at 'kapukapu':

NZETC:

kapukapu. 1. n. Sole of the foot.

2. Apparently a synonym for kaunoti, the firestick which was kept steady with the foot. Tena ka riro, i runga i nga hanga a Taikomako, i te kapukapu, i te kaunoti (M. 351).

3. v.i. Curl (as a wave). Ka kapukapu mai te ngaru.

4. Gush.

5. Gleam, glisten. Katahi ki te huka o Huiarau, kapukapu ana tera.

Kapua, n. 1. Cloud, bank of clouds. E tutakitaki ana nga kapua o te rangi, kei runga te Mangoroa e kopae pu ana (P.).

2. A flinty stone. = kapuarangi.

3. Polyprion oxygeneios, a fish. = hapuku.

4. An edible species of fungus.

5. Part of the titi pattern of tattooing.

Kapuarangi, n. A variety of matā, or cutting stone, of inferior quality. = kapua, 2.

Kāpuhi, kāpuhipuhi, n. Cluster of branches at the top of a tree.

Kāpui, v.t. 1. Gather up in a bunch. Ka kapuitia nga rau o te kiekie, ka herea.

2. Lace up or draw in the mouth of a bag.

3. Earth up crops, or cover up embers with ashes to keep them alight.

kāpuipui, v.t. Gather up litter, etc.

Kāpuka, n. Griselinia littoralis, a tree. = papauma.

Kapukiore, n. Coprosma australis, a shrub. = kanono.

Kāpuku = kōpuku, n. Gunwale.

Google Books:

Kapukapu, s. Sole of the foot,

Eldpukdpu, v. To curl* as a

wave.

Ka kapukapu mai te ngaru; The wave curls over.

Kapunga, v. To take up with both hands held together,

Kapungatia he kai i te omu; Take up food from the oven.

(B. C,

Kapura, s. Fire, -' Tahuna he kapura ; Kindle a fire.

Kapurangi, s. Rubbish; weeds,

Kara, s. An old man,

Tena korua ko kara ? How are you and the old man ?

Kara, s> Basaltic stone.

He kara te kamaka nei; This stone is kara.

Karaha, s. A calabash. ♦Kardhi, *. Glass,

Internet Archive:

kapukapu, n. sole of the foot.

kapukapu, v. i. 1. curl (as a wave). Ka kapukapu mai te ngaru. 2. gush.

kakapii, small basket for cooked food.

Kapua, n. cloud; hank of clouds,

Kapunga, n. palm of the hand.

kapunga, \. t. take up in both hands together.

Kapiira, n. fire.

Kapiiranga, n. handful.

kapuranga, v. t. take up by hand-fuls. Kapurangatia nga otaota na e ia. v. i. dawn. Ka kapuranga te ata.

Kapur&ngi, n. rubbish; uveds.

I. K&r&, n. old man. Tena korua ko kara.

II. K&r&, n. secret plan; conspiracy. Kei te whakatakoto kara mo Te Horo kia patua.

k&k&r&, D. scent; smell.

k&k&r&, a. savoury; odoriferous.

k^ar&, n. a shell-iish.

Unlike the other two, the NZETC version has accents, bold and italics in the right place. It' the only one with a workable and useful table of contents. It is also edition which has been extensively revised and expanded. Google's second edition has many character errors, while the Internet Archive's third edition has many 'á' mis-recognised as '&.' The Google and Internet Achive versions are also available as PDFs, but of course, without fancy tables of contents these PDFs are pretty challenging to navigate and because they're built from page images, they're huge.

It's tempting to say that the NZETC version is better than either of the others, and from a naïve point of it is, but it's more accurate to say that it's different. It's a digitised version of a book revised more than a hundred years after the 1852 second edition scanned by Google books. People who're interested in the history of the language are likely to pick the 1852 edition over the 1957 edition nine times out of ten.

Technical work is currently underway to enable third parties like the Internet Archive's bookserver to more easily redistribute our ePubs. For some semi-arcane reasons it's linked to upcoming new search functionality.

What LibraryThing metadata can the NZETC reasonable stuff inside it's CC'd epubs?

This is the second blog following on from an excellent talk about librarything by LibraryThing's Tim given the VUW in Wellington after his trip to LIANZA.

The NZETC publishes all of it's works as epubs (a file format primarily aimed at mobile devices), which are literally processed crawls of it's website bundled with some metadata. For some of the NZETC works (such as Erewhon and The Life of Captain James Cook), LibraryThing has a lot more metadata than the NZETC, becuase many LibraryThing users have the works and have entered metadata for them. Bundling as much metadata into the epubs makes sense, because these are commonly designed for offline use---call-back hooks are unlikely to be avaliable.

So what kinds of data am I interested in?
1) Traditional bibliographic metadata. Both LT and NZETC have this down really well.
2) Images. LT has many many cover images, NZETC has images of plates from inside many works too.
3) Unique identification (ISBNs, ISSNs, work ids, etc). LT does very well at this, NZETC very poorly
4) Genre and style information. LT has tags to do fancy statistical analysis on, and does. NZETC has full text to do fancy statistical analysis on, but doesn't.
5) Intra-document links. LT has work as the smallest unit. NZETC reproduces original document tables of contents and indexes, cross references and annotations.
6) Inter-document links. LT has none. NZETC captures both 'mentions' and 'cites' relationships between documents.

While most current-generation ebook readers, of course, can do nothing with most of this metadata, but I'm looking forward to the day when we have full-fledged OpenURL resolvers which can do interesting things, primarily picking the best copy (most local / highest quality / most appropiate format / cheapest) of a work to display to a user; and browsing works by genre (LibraryThing does genre very well, via tags).

Selasa, 15 September 2009

Thoughts on koha

The Koha community is currently undergoing a spasm, with a company apparently forking the code.
As a result a bunch of people are looking at where the community should go from here and how it should be led. In particular the idea of a not-for-profit foundation has been floated and is to be discussed at a meeting early tomorrow morning .
My thoughts on this issue are pretty simple:

A not-for-profit is a fabulous idea
Reusing one of the existing software not-for-profit (Apache, Software in the Public Interest, etc) introduces a layer of non-library complexity. Libraries are have a long history with consortia, but tend to very much flock together with their own kind, I can see them being leary of a non-library entity.
A clear description of a forward-looking plan written in plain language that everyone can understand is vital to communicate the vision of the community, particularly to those currently on the fringes

Selasa, 05 Mei 2009

Why card-based records aren't good enough

Card catalogs have a long tradition in librarianship, dating back, I'm told, to the book stock-take in the French revolution. Librarians understand card catalogs in a deep way that comes from generations of librarians having used them as a core professional tool all their professional lives. Librarians understand card catalogs in ways that I, as a computer scientist, never will. I still recall on one of my first visits to a university library, I asked a librarian where I might find books by a particular author, they found the work for me arguably as fast as I can now find works with the new wizzy electronic catalog.

It is natural, when faced with something new, to understand it in terms of what we already know and already understand. Unfortunately, understanding the new by analogy to the old can lead to form of the old being assumed in the new. It was true that when libraries digitized their card catalogs in the 1970s and 1980s, they were more or less exactly digital versions of the card catalog predecessors, because their content was limited to old data from the cards and new data from cataloging processes (which were unchanged from the card catalog era) and because librarians and users had come to equate a library catalog with a card catalog---it was what they expected.

MARC is a perfect example of this kind of thing. As a data format to directly replace a card catalog of printed books, it can hardly be faulted.

Unfortunately, digital metadata has capabilities undreamt of at the time of the French revolution, and card catalogs and MARC do a poor job of handling these capabilities.

A whole range of people have come up with criticisms of MARC that involve materials and methodologies not routinely held in libraries at the time of the French revolution (digital journal subscriptions and music, for example), but I view these as postdating card catalogs and thus the criticism as unfair.

So what was held in libraries in 1789 that MARC struggle with? Here's a list:

Systematically linking discussion of particular works with instances of those works
Systematically linking discussion of particular instances with those instances ("Was person X the transcriber of manuscript Y?")
Handling ambiguity ("This play may have been written by Shakespeare. It might also have been a later forgery by Francis Bacon, Christopher Marlowe or Edward de Vere")

All of these relate to core questions which have been studed in libraries for centuries. They're well understood issues, which changed little in the hundred years until the invention of the computer (which is when all the usually-cited issues with MARC began).

The real question is why we're still expecting an approach that didn't solve the problems two hundred years ago to solve our problems now? Computers are not magic in this area they just seem to be helping us do the wrong things faster, more reliably and for larger collections.

We need a new approach to bibliographic metadata, one which is not ontologically bound to little slips of paper. There are a whole range of different alternatives out there (including a bevy of RDF vocabularies), but I've yet to run into one which both allowed clear representation of existing data (because lets face it, I'm not going to re-enter worldcat, and neither are you, not in our lifetimes) and admitting non-card-based metadata as first class elements.

</rant>

My Stories