Tampilkan postingan dengan label nzetc. Tampilkan semua postingan
Tampilkan postingan dengan label nzetc. Tampilkan semua postingan

Sabtu, 07 November 2009

ePubs and quality

You may have heard news about the release of "bookserver" by the good folks at the Internet Archive. This is a DRM-free ePub ecosystem, initially stocked with the prodigious output of Google's book scanning project and the Internet Archive's own book scanning project.

To see how the NZETC stacked up against the much larger (and better funded) collection I picked one of our Maori Language dictionaries. Our Maori and Pacifica dictionaries month-after-month make up the bulk of our top five must used resources, so they're in-demand resources. They're also an appropriate choice because when they were encoded by the NZETC into TEI, the decision was made not to use full dictionary encoding, but a cheaper/easier tradeoff which didn't capture the linguistic semantics of the underlying entries, but treated them as typeset text. I was interested in how well this tradeoff was wearing.

I did my comparison using the new firefox ePub plugin, things will be slightly different if you're reading these ePubs on an iPhone or Kindle.

The ePub I looked at was A Dictionary of the Maori Language by Herbert W. Williams. The NZETC has the 1957 sixth edition. There are two versions of the work on bookserver. A 1852 second edition scanned by Google books (original at the New York Public library) and a 1871 third edition scanned by the Internet Archive in association with Microsoft (original in the University of California library system). All the processing of both works appear to be been done in the U.S. The original print used macrons (NZETC), acutes (Google) and breves (Internet Archive) to mark long vowels. Find them here.


Lets take a look at some entries from each, starting at 'kapukapu':


NZETC:

kapukapu. 1. n. Sole of the foot.

2. Apparently a synonym for kaunoti, the firestick which was kept steady with the foot. Tena ka riro, i runga i nga hanga a Taikomako, i te kapukapu, i te kaunoti (M. 351).

3. v.i. Curl (as a wave). Ka kapukapu mai te ngaru.

4. Gush.

5. Gleam, glisten. Katahi ki te huka o Huiarau, kapukapu ana tera.

Kapua, n. 1. Cloud, bank of clouds. E tutakitaki ana nga kapua o te rangi, kei runga te Mangoroa e kopae pu ana (P.).

2. A flinty stone. = kapuarangi.

3. Polyprion oxygeneios, a fish. = hapuku.

4. An edible species of fungus.

5. Part of the titi pattern of tattooing.

Kapuarangi, n. A variety of matā, or cutting stone, of inferior quality. = kapua, 2.

Kāpuhi, kāpuhipuhi, n. Cluster of branches at the top of a tree.

Kāpui, v.t. 1. Gather up in a bunch. Ka kapuitia nga rau o te kiekie, ka herea.

2. Lace up or draw in the mouth of a bag.

3. Earth up crops, or cover up embers with ashes to keep them alight.

kāpuipui, v.t. Gather up litter, etc.

Kāpuka, n. Griselinia littoralis, a tree. = papauma.

Kapukiore, n. Coprosma australis, a shrub. = kanono.

Kāpuku = kōpuku, n. Gunwale.



Google Books:

Kapukapu, s. Sole of the foot,

Eldpukdpu, v. To curl* as a

wave.

Ka kapukapu mai te ngaru; The wave curls over.

Kapunga, v. To take up with both hands held together,

Kapungatia he kai i te omu; Take up food from the oven.

(B. C,

Kapura, s. Fire, -' Tahuna he kapura ; Kindle a fire.

Kapurangi, s. Rubbish; weeds,

Kara, s. An old man,

Tena korua ko kara ? How are you and the old man ?

Kara, s> Basaltic stone.

He kara te kamaka nei; This stone is kara.

Karaha, s. A calabash. ♦Kardhi, *. Glass,



Internet Archive:

kapukapu, n. sole of the foot.

kapukapu, v. i. 1. curl (as a wave). Ka kapukapu mai te ngaru. 2. gush.

kakapii, small basket for cooked food.

Kapua, n. cloud; hank of clouds,

Kapunga, n. palm of the hand.

kapunga, \. t. take up in both hands together.

Kapiira, n. fire.

Kapiiranga, n. handful.

kapuranga, v. t. take up by hand-fuls. Kapurangatia nga otaota na e ia. v. i. dawn. Ka kapuranga te ata.

Kapur&ngi, n. rubbish; uveds.

I. K&r&, n. old man. Tena korua ko kara.

II. K&r&, n. secret plan; conspiracy. Kei te whakatakoto kara mo Te Horo kia patua.

k&k&r&, D. scent; smell.

k&k&r&, a. savoury; odoriferous.

k^ar&, n. a shell-iish.


Unlike the other two, the NZETC version has accents, bold and italics in the right place. It' the only one with a workable and useful table of contents. It is also edition which has been extensively revised and expanded. Google's second edition has many character errors, while the Internet Archive's third edition has many 'á' mis-recognised as '&.' The Google and Internet Achive versions are also available as PDFs, but of course, without fancy tables of contents these PDFs are pretty challenging to navigate and because they're built from page images, they're huge.

It's tempting to say that the NZETC version is better than either of the others, and from a naïve point of it is, but it's more accurate to say that it's different. It's a digitised version of a book revised more than a hundred years after the 1852 second edition scanned by Google books. People who're interested in the history of the language are likely to pick the 1852 edition over the 1957 edition nine times out of ten.

Technical work is currently underway to enable third parties like the Internet Archive's bookserver to more easily redistribute our ePubs. For some semi-arcane reasons it's linked to upcoming new search functionality.

What LibraryThing metadata can the NZETC reasonable stuff inside it's CC'd epubs?

This is the second blog following on from an excellent talk about librarything by LibraryThing's Tim given the VUW in Wellington after his trip to LIANZA.

The NZETC publishes all of it's works as epubs (a file format primarily aimed at mobile devices), which are literally processed crawls of it's website bundled with some metadata. For some of the NZETC works (such as Erewhon and The Life of Captain James Cook), LibraryThing has a lot more metadata than the NZETC, becuase many LibraryThing users have the works and have entered metadata for them. Bundling as much metadata into the epubs makes sense, because these are commonly designed for offline use---call-back hooks are unlikely to be avaliable.

So what kinds of data am I interested in?
1) Traditional bibliographic metadata. Both LT and NZETC have this down really well.
2) Images. LT has many many cover images, NZETC has images of plates from inside many works too.
3) Unique identification (ISBNs, ISSNs, work ids, etc). LT does very well at this, NZETC very poorly
4) Genre and style information. LT has tags to do fancy statistical analysis on, and does. NZETC has full text to do fancy statistical analysis on, but doesn't.
5) Intra-document links. LT has work as the smallest unit. NZETC reproduces original document tables of contents and indexes, cross references and annotations.
6) Inter-document links. LT has none. NZETC captures both 'mentions' and 'cites' relationships between documents.

While most current-generation ebook readers, of course, can do nothing with most of this metadata, but I'm looking forward to the day when we have full-fledged OpenURL resolvers which can do interesting things, primarily picking the best copy (most local / highest quality / most appropiate format / cheapest) of a work to display to a user; and browsing works by genre (LibraryThing does genre very well, via tags).

Sabtu, 06 Juni 2009

Legal Māori Archive


Now that the
Legal Māori Archive is live, I thought I'd highlight a couple of my favourite texts from the corpus.

The first is a great example of reinforcing cultural confusion.
"The Laws of England, Compiled and translated into the Māori language" by judge Francis Dart Fenton is a bi-lingual compendium of the laws of England, but extraordinarily uses bible quotes as examples.

The second example is actaully a collection of texts, the works of Rev. Henry Hanson Turton, who compiled thousands of pages of land deeds and associated documents into six volumes. I can see these seeing a lot of use by Treaty researchers.

Jumat, 09 Januari 2009

Excellent stuff from New Zealand Geographic Board Ngā Pou Taunaha o Aotearoa

A while ago, motivated by the need for an authoritative list of New Zealand place names for our with at the NZETC, I criticised the NZGB fairly roundly.
While they haven't produced what I/we want/need, in the last couple of months they've made huge progress in an unambiguously right direction.
Their primary work is the New Zealand Gazetteer of Official Geographic Names, a list of all official place names in New Zealand. It uses have a peculiar definition of "official" (= mentioned in legislation or a Treaty of Waitangi settlement), they have very few names of inhabited places (and no linking with the much larger ones maintained by official bodies such as the police and fire service), They have no elevation data for mountains and pass (which are defined by their height) and they define some things as points when they appear to be areas (such as Arthur Pass National Park), but it's much better than the New Zealand Place Names Database since:
  1. It has a statutory reference for every place, given the source of the officialness of the name
  2. It fully support Macrons
  3. It has a machine readable-list of DoC administered lands --- I can imagine this being used for all sorts of interesting things, getting people out in other scenic and marine reserves.
NZGB sent around an email in which they explicitly addressed some of the points I'd earlier raised (I'm sure I wasn't the only one):
It should be noted that some of the naming practices of the past will have to be lived with, despite inconsistencies. Moving forward, the rules of nomenclature followed by the NZGB are designed to promote standardisation, consistency, and non-ambiguity. The modern format for dual names is '<Maori name> / <non-Maori name', which the NZGB has applied for the past 10 years, though Treaty settlement dual names sometimes deviate from this convention, because the decision is ultimately made by the Minister for Treaty of Waitangi Negotiations. Older forms of dual names, with brackets, will remain depicted as such until changed through the statutory processes of the NZGB Act 2008. These are not generally regarded as alternative names.
Macrons in Maori names have posed problems for electronic databases. Nevertheless they are part of the orthography, recommended by the Maori Language Commission, and the Board endorses their use. The Gazetteer will include macrons where they are formalised as part of the official name. When Section 32 of the new Act comes into force, official documents will be required to show official names, and these will need to include macrons where they have been included as part of the official name (unless the proviso is used). A list of those official names which have macrons is at http://www.linz.govt.nz/placenames/researching-place-names/macrons/index.aspx . LINZ's Customer Services has some solutions for showing macrons in LINZ's own databases and on published maps and charts, and is currently investigating how bulk data extracts might include information about macrons, for the customer's benefit.
Despite the name, it isn't clear in my mind exactly what's official and what isn't. Is the content of the "coordinates" column official? For railway lines this is a reference to the description, which in the cases of railways is usually of the form "From X to Y", where X and Y are place names, frequently place names that aren't on the list, so are thus presumably not official. Unless I'm going blind there is also no indication of accuracy on the physical measurements.

Rabu, 08 Oktober 2008

fuzzziness

I've been using topic maps in my day job, so I decided to try out http://www.fuzzzy.com/, a social bookmark engine that uses an underlying topic map engine.
I tried to approach fuzzzy with an open mind, but the increasingly stumbling on really annoying (mis-)features.
  1. This is the first bookmark engine I've ever used hat doesn't let users migrate their bookmarks with them. This is perhaps the biggest single feature fuzzzy could add to attract new users, since it seems that most people who're likely to use a bookmark engine have already played with another one long enough to have dozens or hundreds of bookmarks they'd like to bring with them. I know this is non-ideal from the point of view of the social bookmark engine they're migrating too, since it makes it hard to do things completely differently, but users have baggage.
  2. While it'd possible to vote up or vote down just about everything (bookmarks, tags, bookmark-tags, users, etc), very little is actually done with these votes. If I've viewed a bookmark once and voted it down, why is it added to my "most used Bookmarks"? Surely if I've indicated I don't like it the bookmark should be hidden from me, not advertised to me.
  3. For all the topic map goodness on the site, there is no obvious way to link from the fuzzzy topic map to other topic maps.
  4. There doesn't seem to be much in the way of interfacing with other semantic web standards (i.e. RDF).
  5. The help isn't. Admittedly this may be partly because many of the key participants have English as a second language.
  6. There's a spam problem. But then everywhere has a spam problem.
  7. It's not obvious that I can export my bookmarks out of fuzzzy in a form that any other bookmark engine understands.
These (mis-)features are a pity, because at NZETC we use topic maps for authority (in the librarianship sense), and it would be great to have a compatible third party that could be used for non-authoritative stuff and which would just work seamlessly.

Sabtu, 04 Oktober 2008

Place name inconsistencies

I've been looking at the "Dataset of New Zealand Geographic Place Names" from LINZ. This appears to be as close as New Zealand comes to an Official list of place names. I've been looking because it would be great to use as an authority in the NZETC.

Coming to the data I was aware of a number of issues:
  1. Unlike most geographical data users, I'm primarily interested in the names rather than the relative positions
  2. New Zealand is currently going through an extended period of renaming of geographic features to their original Māori names
  3. The names in the dataset are primarily map labels and are subject to cartographic licence
What I didn't expect was the insanity in the names. I know that there are some good historical reasons for this insanity, but that doesn't make it any less insane.
  1. Names can differ only by punctuation. There is a "No. 1 Creek" and a "No 1 Creek".
  2. Names can differ only by presentation. There is a "Crook Burn or 8 Mile Creek", an "Eight Mile Creek or Boundary Creek" and an "Eight Mile Creek" (each in a different province).
  3. There is no consistent presentation of alternative names. There is "Saddle (Mangaawai) Bivouac", "Te Towaka Bay (Burnside Bay)", "Queen Charlotte Sound (Totaranui)", "Manawatawhi/Three Kings Islands", "Mount Hauruia/Bald Rock", "Crook Burn or 8 Mile Creek" and "Omere, Janus or Toby Rock"
  4. There is no machine-readable source of the Māori place names with macrons, and the human readable version has contains subtle difference to the machine-readable database (which contains no non-ASCII characters). For example "Franz Josef Glacier/Kā Roimata o Hine Hukatere (Glacier)" and "Franz Josef Glacier/Ka Roimata o Hine Hukatere" differ by more than the macrons. There appears to be no information on which are authoritative.
Right now I'm find finding this rather frustrating.

(grammar edit)