My Stories

Sabtu, 07 November 2009

What LibraryThing metadata can the NZETC reasonable stuff inside it's CC'd epubs?

This is the second blog following on from an excellent talk about librarything by LibraryThing's Tim given the VUW in Wellington after his trip to LIANZA.

The NZETC publishes all of it's works as epubs (a file format primarily aimed at mobile devices), which are literally processed crawls of it's website bundled with some metadata. For some of the NZETC works (such as Erewhon and The Life of Captain James Cook), LibraryThing has a lot more metadata than the NZETC, becuase many LibraryThing users have the works and have entered metadata for them. Bundling as much metadata into the epubs makes sense, because these are commonly designed for offline use---call-back hooks are unlikely to be avaliable.

So what kinds of data am I interested in?
1) Traditional bibliographic metadata. Both LT and NZETC have this down really well.
2) Images. LT has many many cover images, NZETC has images of plates from inside many works too.
3) Unique identification (ISBNs, ISSNs, work ids, etc). LT does very well at this, NZETC very poorly
4) Genre and style information. LT has tags to do fancy statistical analysis on, and does. NZETC has full text to do fancy statistical analysis on, but doesn't.
5) Intra-document links. LT has work as the smallest unit. NZETC reproduces original document tables of contents and indexes, cross references and annotations.
6) Inter-document links. LT has none. NZETC captures both 'mentions' and 'cites' relationships between documents.

While most current-generation ebook readers, of course, can do nothing with most of this metadata, but I'm looking forward to the day when we have full-fledged OpenURL resolvers which can do interesting things, primarily picking the best copy (most local / highest quality / most appropiate format / cheapest) of a work to display to a user; and browsing works by genre (LibraryThing does genre very well, via tags).

Kamis, 15 Oktober 2009

Interlinking of collections: the quest continues

After an excellent talk today about LibraryThing by LibraryThing's Tim, I got enthused to see how LibraryThing stacks up against other libraries for having matches in it's authority control system for entities we (the NZETC) care about.
The answer is averagely.
For copies of printed books less than a hundred years old (or reprinted in the last hundred years), and their authors, LibraryThing seems to do every well. These are the books likely to be in active circulation in personal libraries, so it stands to reason that these would be well covered.
I tried half a dozen books from our Nineteenth-Century Novels Collection, and most were missing, Erewhon, of course, was well represented. LibraryThing doesn't have the "Treaty of Waitangi" (a set of manuscripts) but it does have "Facsimiles of the Treaty of Waitangi." It's not clear to me whether these would be merged under their cataloguing rules.
Coverage of non-core bibliographic entities was lacking. Places get a little odd. Sydney is "http://www.librarything.com/place/Sydney,%20New%20South%20Wales,%20Australia" but Wellington is "http://www.librarything.com/place/Wellington" and Anzac Cove appears to be is missing altogether. This doesn't seem like a sane authority control system for places, as far as I can see. People who are the subjects rather than the authors of books didn't come out so well. I couldn't find Abel Janszoon Tasman, Pōtatau Te Wherowhero or Charles Frederick Goldie, all of which are near and dear to our hearts.

Here is the spreadsheet of how different web-enabled systems map entities we care about.

Correction: It seems that the correct URL for Wellington is http://www.librarything.com/place/Wellington,%20New%20Zealand which brings sanity back.

Sabtu, 19 September 2009

eBook readers need OpenURL resolvers

Everyone's talking about the next generation of eBook readers having larger reading area, more battery life and more readable screen. I'd give up all of those, however, for an eBook reader that had an internal OpenURL resolver.

OpenURL is the nifty protocol that libraries use to find the closest copy of a electronic resources and direct patrons to copies that the library might have already licensed from commercial parties. It's all about finding the version of a resource that is most accessible to the user, dynamically.

Say I've loaded 500 eBooks into my eBook reader: a couple of encyclopedias and dictionaries; a stack of books I was meant to read in school but only skimmed and have been meaning to get back to; current block-busters; guidebooks to the half-dozen countries I'm planning on visiting over the next couple of years; classics I've always meant to read (Tolstoy, Chaucer, Cervantes, Plato, Descartes, Nietzsche); and local writers (Baxter, Duff, Ihimaera, Hulme, ...). My eBooks by Nietzsche are going to refer to books by Descartes and Plato; my eBooks by Descartes are going to refer to books by Plato; my encyclopaedias are going to refer to pretty much everything; most of the works in translation are going to contain terms which I'm going to need help with (help which theencyclopedias and dictionaries can provide).

Ask yourself, though, whether you'd want to flick between works on the current generation of readers---very painful, since these devices are not designed for efficient navigation between eBooks, but linear reading of them. You can't follow links between them, of course, because on current systems links must point either with the same eBook or out on to the internet---pointing to other eBooks on the same device is verboten. OpenURL can solve this by catching those URLs and making them point to local copies of works (and thus available for free even when the internet is unavailable) where possible while still retaining their

Until eBook readers have a mechanism like this eBooks will be at most a replacement only for paperback novels---not personal libraries.

Selasa, 15 September 2009

Thoughts on koha

The Koha community is currently undergoing a spasm, with a company apparently forking the code.
As a result a bunch of people are looking at where the community should go from here and how it should be led. In particular the idea of a not-for-profit foundation has been floated and is to be discussed at a meeting early tomorrow morning .
My thoughts on this issue are pretty simple:

A not-for-profit is a fabulous idea
Reusing one of the existing software not-for-profit (Apache, Software in the Public Interest, etc) introduces a layer of non-library complexity. Libraries are have a long history with consortia, but tend to very much flock together with their own kind, I can see them being leary of a non-library entity.
A clear description of a forward-looking plan written in plain language that everyone can understand is vital to communicate the vision of the community, particularly to those currently on the fringes

Senin, 31 Agustus 2009

Data and data modelling and underlying assumptions

I feel that there was a huge disconnect between some groups of participants at #opengovt (http://groups.google.co.nz/group/nzopengovtbarcamp) in Wellington last weekend. This is my attempt to illuminate the gaps.

The gaps were about data and data modelling and underlying assumptions that the way one person / group / institution viewed a kind of data was the same as the way others viewed it.

This gap is probably most pronounced in geo-location.

There's a whole bunch of very bright people doing wonderful mashups in geo-location using a put-points-on-a-map model. Typically using google maps (or one of a small number of competitors) they give insights into all manner of things by throwing points onto maps, street views, etc, etc. It's a relatively new field and every time I look they seem to have a whizzy new toy. Whizzy thing of the day for me was http://groups.google.com/group/digitalnz/browse_thread/thread/b5b0c96ce08ca441 . Unfortunately the very success of the 'data as points' model encourages the view that location is a lat / long pair and the important metric is the number of significant digits in the lat / long.

In the GLAM (Galleries, Libraries, Archives and Museums) sector, we have a tradition of using thesauri such as the Getty Thesaurus of Geographic Names. Take all look at the entry for The Wellington region:http://www.getty.edu/vow/TGNFullDisplay?find=wellington&place=&nation=New+Zealand&prev_page=1&english=Y&subjectid=7000512

Yes, if has a lat and a long (with laughable precision), but the lat and long are arguably the least important information on the page. There's a faceted hierarchy, synonyms, linked references and type data. Te Papa have just moved to Getty for place names in their new site (http://collections.tepapa.govt.nz/) and frankly, I'm jealous. They paid a few thousand dollars for a licence to thesaurus and it's a joy to use.

The idea of #opengovt is predicated on institutions and individuals speaking the same languages, being able to communicate effectively, and this is clearly a case where we're not. Learning to speak each others languages seems like it's going to be key to this whole venture.

As something of a worked example, here's something that I'm working on at the moment. It's a page from The Manual of the New Zealand Flora by Thomas Frederick Cheeseman, a core text in New Zealand botany, see http://www.nzetc.org/tm/scholarly/tei-CheManu-t1-body1-d22-d5.html The text is live on our website, but it's not yet fully marked up. I've chosen it because it illustrates two separate kinds of languages and their disparities.

What are the geographic locations on that page?

* Nelson-Mountains flanking the Clarence Valley
* Marlborough—Kaikoura Mountains
* Canterbury—Kowai River
* Canterbury—Coleridge Pass
* Otago—Mount St. Bathan's

The qualifier "2000–5000 ft" (which I believe is an elevation range at which these flourish) applies across these. Clearly we're going to struggle to represent these with a finite number of lat/long points, no matter how accurate. In all likelihood, I'll not actually mark up these locations, since the because no one's working with complex locations, the cost benifit isn't within sight of being worth it.

Te Papa and the NZETC have a small-scale binomial name exercise underway, and for that I'll be scripting the extraction of the following names from that page:

* Notospartium carmichœliœ (synonym Notospartium carmichaeliae)
* Notospartium torulosum

There were a bunch of folks at the #opengovt barcamp who're involved in the "New Zealand Organisms Register" (http://www.nzor.org.nz/) project. As I understand it, they want me to expose the following names from that page:

* Notospartium carmichœliœ, Hook. f.
* Notospartium torulosum, Hook. f.

Of course the name the public want is:

* New Zealand Pink Broom
* ? (Notospartium torulosum appears not to have a common name)

Note that none of these taxonomic names actually appear in full on the page...

Yes is, clearly, an area where the best can be the good and visa versa, but the good needs to at least be aware of the best.

Senin, 27 Juli 2009

Learning XSLT 2.0 Part 1; Finding Names

We mark up a lot of names, so one of the first things I decided to do was to build an XSLT stylesheet that takes a list of names and tags those names when they occur in a separate XSLT file. To make things easier and clearer, I've ignored little things like namespaces, conformant TEI, etc, etc.

First up, the list of names, these are multi-word names. Notice the simple structure, this could easily be built from a comma seperated list or similar:

<?xml version="1.0" encoding="UTF-8"?>
<names>
<name>Papaver argemone</name>
<name>Papaver dubium</name>
<name>Papaver Rhceas</name>
<name>Zanthoxylum novæ-zealandiæ</name>
</names>

Next, some sample text:

<?xml version="1.0" encoding="UTF-8"?>
<doc>
 There are several names Papaver argemone in this document Papaver argemone
 Some of them are the same as others (Papaver Rhceas Papaver rhceas P. rhceas)
 Non ASCII characters shouldn't cause a problem in names like Zanthoxylum novæ-zealandiæ AKA Zanthoxylum novae-zealandiae
</doc>

Finally the stylesheet. It consists of three parts: the regexp variable that builds a regexp from the names in the file; a default template for everything but text(); and a template for text()s that applies the rexexp.

<?xml version="1.0"?>
<xsl:stylesheet version="2.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform" >


<xsl:variable name="regexp">
<xsl:value-of select="concat('(',string-join(document('name-list.xml')//name/text(), '|'), ')')"/>
</xsl:variable>


<xsl:template match="@*|*|processing-instruction()|comment()">
<xsl:copy>
<xsl:apply-templates select="@*|*|processing-instruction()|comment()|text()"/>
</xsl:copy>
</xsl:template>


<xsl:template match="text()">
<xsl:analyze-string select="." regex="{$regexp}">
<xsl:matching-substring>
<name type="taxonomic" subtype="matched">
<xsl:value-of select="regex-group(1)"/>
</name>
</xsl:matching-substring>
<xsl:non-matching-substring>
<xsl:value-of select="."/>
</xsl:non-matching-substring>
</xsl:analyze-string>
</xsl:template>

</xsl:stylesheet>

The output looks like:

<?xml version="1.0" encoding="UTF-8"?><doc>
 There are several names <name type="taxonomic" subtype="matched">Papaver argemone</name> in this document <name type="taxonomic" subtype="matched">Papaver argemone</name>
 Some of them are the same as others (<name type="taxonomic" subtype="matched">Papaver Rhceas</name> Papaver rhceas P. rhceas)
 Non ASCII characters shouldn't cause a problem in names like <name type="taxonomic" subtype="matched">Zanthoxylum novæ-zealandiæ</name> AKA Zanthoxylum novae-zealandiae
</doc>

As you may notice, I've not yet worked out the best way to handle the 'æ'

Sabtu, 06 Juni 2009

Legal Māori Archive

Now that the Legal Māori Archive is live, I thought I'd highlight a couple of my favourite texts from the corpus.

The first is a great example of reinforcing cultural confusion. "The Laws of England, Compiled and translated into the Māori language" by judge Francis Dart Fenton is a bi-lingual compendium of the laws of England, but extraordinarily uses bible quotes as examples.

The second example is actaully a collection of texts, the works of Rev. Henry Hanson Turton, who compiled thousands of pages of land deeds and associated documents into six volumes. I can see these seeing a lot of use by Treaty researchers.