My Stories: NewZealand

Tampilkan postingan dengan label NewZealand. Tampilkan semua postingan

Senin, 31 Agustus 2009

Data and data modelling and underlying assumptions

I feel that there was a huge disconnect between some groups of participants at #opengovt (http://groups.google.co.nz/group/nzopengovtbarcamp) in Wellington last weekend. This is my attempt to illuminate the gaps.

The gaps were about data and data modelling and underlying assumptions that the way one person / group / institution viewed a kind of data was the same as the way others viewed it.

This gap is probably most pronounced in geo-location.

There's a whole bunch of very bright people doing wonderful mashups in geo-location using a put-points-on-a-map model. Typically using google maps (or one of a small number of competitors) they give insights into all manner of things by throwing points onto maps, street views, etc, etc. It's a relatively new field and every time I look they seem to have a whizzy new toy. Whizzy thing of the day for me was http://groups.google.com/group/digitalnz/browse_thread/thread/b5b0c96ce08ca441 . Unfortunately the very success of the 'data as points' model encourages the view that location is a lat / long pair and the important metric is the number of significant digits in the lat / long.

In the GLAM (Galleries, Libraries, Archives and Museums) sector, we have a tradition of using thesauri such as the Getty Thesaurus of Geographic Names. Take all look at the entry for The Wellington region:http://www.getty.edu/vow/TGNFullDisplay?find=wellington&place=&nation=New+Zealand&prev_page=1&english=Y&subjectid=7000512

Yes, if has a lat and a long (with laughable precision), but the lat and long are arguably the least important information on the page. There's a faceted hierarchy, synonyms, linked references and type data. Te Papa have just moved to Getty for place names in their new site (http://collections.tepapa.govt.nz/) and frankly, I'm jealous. They paid a few thousand dollars for a licence to thesaurus and it's a joy to use.

The idea of #opengovt is predicated on institutions and individuals speaking the same languages, being able to communicate effectively, and this is clearly a case where we're not. Learning to speak each others languages seems like it's going to be key to this whole venture.

As something of a worked example, here's something that I'm working on at the moment. It's a page from The Manual of the New Zealand Flora by Thomas Frederick Cheeseman, a core text in New Zealand botany, see http://www.nzetc.org/tm/scholarly/tei-CheManu-t1-body1-d22-d5.html The text is live on our website, but it's not yet fully marked up. I've chosen it because it illustrates two separate kinds of languages and their disparities.

What are the geographic locations on that page?

* Nelson-Mountains flanking the Clarence Valley
* Marlborough—Kaikoura Mountains
* Canterbury—Kowai River
* Canterbury—Coleridge Pass
* Otago—Mount St. Bathan's

The qualifier "2000–5000 ft" (which I believe is an elevation range at which these flourish) applies across these. Clearly we're going to struggle to represent these with a finite number of lat/long points, no matter how accurate. In all likelihood, I'll not actually mark up these locations, since the because no one's working with complex locations, the cost benifit isn't within sight of being worth it.

Te Papa and the NZETC have a small-scale binomial name exercise underway, and for that I'll be scripting the extraction of the following names from that page:

* Notospartium carmichœliœ (synonym Notospartium carmichaeliae)
* Notospartium torulosum

There were a bunch of folks at the #opengovt barcamp who're involved in the "New Zealand Organisms Register" (http://www.nzor.org.nz/) project. As I understand it, they want me to expose the following names from that page:

* Notospartium carmichœliœ, Hook. f.
* Notospartium torulosum, Hook. f.

Of course the name the public want is:

* New Zealand Pink Broom
* ? (Notospartium torulosum appears not to have a common name)

Note that none of these taxonomic names actually appear in full on the page...

Yes is, clearly, an area where the best can be the good and visa versa, but the good needs to at least be aware of the best.

Senin, 02 Februari 2009

Report from the NDHA's International Perspectives on Digital Preservation

NOTE: I'm a computer scientist by training and this was largely librarian/archivist gig, so it's entirely possibly I've got the wrong end of the stick on one or more point in the summary below. It's also my own summary, and not the position of my employer, even though I was on work time during the event.

The NDHA is about to announce that the NDHA project has been completed on time and under budget. This is particularly pleasing in light of the poor history of government IT failures over the course of the last 30 years and a tribute to all concerned. Indeed, when I was taking undergraduate courses in software engineering a contemporary national library project was used as a text-book example of how not to run a software development undertaking. It's good to see how far they've come.

The event itself was a one-day event in the national library auditorium, with a handful of overseas speakers. I'm not entirely certain that a handful of foreigners counts as "international," but maybe that's just me being a snob. Certainly there was a fine turn-out of locals, including many from the National Library, the Ministry of Culture and Heritage and from VUW, including a number of students, who couldn't possibly have been there for the free food.

There seemed to be an underlying tension between librarianship and archivistship running through the event. I see this as being a really crazy turfwar, personally, since I see the chances of libraries and archives existing as separate entities and disciplines in fifty years seems pretty slim. The separation between the two, the "uniqueness" of objects in an archive seems to be to be obliterated by the free-duplication of digital objects. I've heard people say that archives also work access controls and embargoes for their depositors, but then so can libraries, particularly those in the military and those working with classified documents.

It seemed to me that the word "reliability" was used in a confusing number different of ways by different people. Without naming the guilty parties:

reliability as the truthfulness of the documents in the library/archive. This is the old problem of ingestors having to determine the absolute veracity of documents
reliability as getting the same metadata every time. This seems odd to me, since systems with audit control give _different_ results every time, because information on the previous accesses is included in the metadata of subsequent accesses
reliability as the degree to which the system conformed to a standard/specification

On reflection this may have been a symptom of the different vocabulary used by librarians and archivists. Whatever the cause, if we're wanting to spend public money, we have to be able to explain to the public what we're doing, and this isn't helping.

The organisers told us the presentations would be up by tonight (the evening of the presentation), but you won't find them on google if you go looking, because they tell google to please f**k off. I guess this is what someone was referring to when they said we had to work to make content accessible to google. The link is http://ndha-wiki.natlib.govt.nz/ndha/pages/IPoDP%2009%20Presentations and most were up at the time of writing.

I was hugely encouraged by the number of pieces of software that seemed to be being open sourced, as I see this as being a much better economic model than paying vendors for custom software, particularly since it's potentially scalable out from the national and top-tier libraries/archives/museums out to the second and third tier libraries/archives/museums, which by dint of their much larger numbers actually serve the most users and have the most content. It was unfortunate that the national library hasn't looked beyond propriety software for non-specialist software but continues to use AbodePhotoshop / Microsoft Windows, which are available only for limited periods of time on certain platforms (which will inevitably become obsolete), rather than openoffice, GIMP, etc, which are cross platform and licensed under perpetual licences which include the right to port the software from one platform to another. I guessPhotoshop / Windows is what their clients and funders know and use.

With a number of participants I had conversations about preservation. Andrew Wilson in his presentation used the quote:

“traditionally, preserving things meant keeping them unchanged; however our digital environment has fundamentally changed our concept of preservation requirements. If we hold on to digital information without modifications, accessing the information will become increasingly difficult, if not impossible” Su-Sing Chen, “The Paradox of Digital Preservation”, Computer, March 2001, 2-6

If you think about what intellectual objects we have from the Greeks (which is were us Westerners traditionally trace our intellectual history from), the majority fall into two main classes: (a) art works, which have survived primarily through roman copies and (b) texts, which have survived by copying, including a body of mathematics which were kept alive in the Arabic translation during a period when we Westerners were burning the works in Latin and Greek and claiming that the bible was the only book we needed. I'll grant you that a high-quality book will last maybe 500 years in a controlled environment, maybe even 1000, but for real permanence, you just can't get past physical ubiquity. If we have things truly worthy of long-term preservation, we should be striking deals with the Warehouse to get them into every home in the country, and setting them as translation exercises in our language learning courses.

I had some excellent conversations with other participants at the event, including Phillipa Tocker from Museums Aotearoa / Te Tari o Nga Whare Taonga o te Motu who told me about the http://www.nzmuseums.co.nz/ site they put together for their members.

Looking at the site I'm struck by how similar the search functionality is to http://www.nram.org.nz/. I'm not sure whether their relative similarity is a good thing (because it enables non-experts to search the holdings) or a bad thing (because by lowering themselves to the lowest common denominator they've devalued their uniqueness). While I'm certain that these websites have vital roles in the museums and archives community respectively, I can't help but feel that from an end-users perspective have two sites rather than one seems redundant, and the fact that they don't seem to reference/suggest any other information sources doesn't help. I can't imagine a librarian/archivist not being forth-coming with a suggestion of where to look next if they've run out of local relevant content---why should our websites be any different?

I recently changed the NZETC to point to likely-relevant memory institutions when a search returns no results (or when a user pages through to the end of any list of results).

I also talked to some chaps from Te Papa about the metadata they're using to to represent places names (Getty Thesaurus of Geographic Names) and species names (ad-hoc). At the NZETC we have many place names marked up (in NZ, Europe and the Pacific), but are not currently syncing with an external authority. Doing so would hugely enable interoperability. Ideally we'd be using the shiny new New Zealand Gazetteer of Official Geographic Names, but it doesn't yet have enough of the places we need (it basically only covers places mentioned in legislation or treaty settlements). It does have macrons in all the right places though, which is an excellent start. We currently don't mark up species names, but would like to, and again an external authority would be great.

It might have been useful if the day had included an overview of what the NDHA actually was and what had been achieved (maybe I missed this?).

Jumat, 09 Januari 2009

Excellent stuff from New Zealand Geographic Board Ngā Pou Taunaha o Aotearoa

A while ago, motivated by the need for an authoritative list of New Zealand place names for our with at the NZETC, I criticised the NZGB fairly roundly.
While they haven't produced what I/we want/need, in the last couple of months they've made huge progress in an unambiguously right direction.
Their primary work is the New Zealand Gazetteer of Official Geographic Names, a list of all official place names in New Zealand. It uses have a peculiar definition of "official" (= mentioned in legislation or a Treaty of Waitangi settlement), they have very few names of inhabited places (and no linking with the much larger ones maintained by official bodies such as the police and fire service), They have no elevation data for mountains and pass (which are defined by their height) and they define some things as points when they appear to be areas (such as Arthur Pass National Park), but it's much better than the New Zealand Place Names Database since:

It has a statutory reference for every place, given the source of the officialness of the name
It fully support Macrons
It has a machine readable-list of DoC administered lands --- I can imagine this being used for all sorts of interesting things, getting people out in other scenic and marine reserves.

NZGB sent around an email in which they explicitly addressed some of the points I'd earlier raised (I'm sure I wasn't the only one):

It should be noted that some of the naming practices of the past will have to be lived with, despite inconsistencies. Moving forward, the rules of nomenclature followed by the NZGB are designed to promote standardisation, consistency, and non-ambiguity. The modern format for dual names is '<Maori name> / <non-Maori name', which the NZGB has applied for the past 10 years, though Treaty settlement dual names sometimes deviate from this convention, because the decision is ultimately made by the Minister for Treaty of Waitangi Negotiations. Older forms of dual names, with brackets, will remain depicted as such until changed through the statutory processes of the NZGB Act 2008. These are not generally regarded as alternative names.
Macrons in Maori names have posed problems for electronic databases. Nevertheless they are part of the orthography, recommended by the Maori Language Commission, and the Board endorses their use. The Gazetteer will include macrons where they are formalised as part of the official name. When Section 32 of the new Act comes into force, official documents will be required to show official names, and these will need to include macrons where they have been included as part of the official name (unless the proviso is used). A list of those official names which have macrons is at http://www.linz.govt.nz/placenames/researching-place-names/macrons/index.aspx . LINZ's Customer Services has some solutions for showing macrons in LINZ's own databases and on published maps and charts, and is currently investigating how bulk data extracts might include information about macrons, for the customer's benefit.

Despite the name, it isn't clear in my mind exactly what's official and what isn't. Is the content of the "coordinates" column official? For railway lines this is a reference to the description, which in the cases of railways is usually of the form "From X to Y", where X and Y are place names, frequently place names that aren't on the list, so are thus presumably not official. Unless I'm going blind there is also no indication of accuracy on the physical measurements.

Rabu, 09 Juli 2008

Who should I nominate for the NZ Open Source Awards?

So nominations are open for the New Zealand Open Source Awards and I have to decide who I should nominate. There doesn't seem to be anything stopping me nominating several, but picking one contender and throwing my weight behind them seems like the right thing to do. The ideas I've come up with so far are:

Kiharoa Dear for excellent work in getting firefox, thunderbird and open office working in Māori contexts:

http://kiharoa.dear.maori.nz/

Standards New Zealand for sanity control in the OOXML fiasco:

http://www.standards.co.nz/news/Media+releases/NZ+maintains+negative+vote+on+OOXML+Standard.htm

Hagley Community College for rolling out Ubuntu in a secondary school:

http://computing.hagley.school.nz/about/opensource

Who should I nominate? Is there someone I've missed?

Sabtu, 14 Juni 2008

Kernel Hell and what to do about it

I've been in kernel hell with my home system for the past couple of days. What I want to build is a custom kernel that'll do xen, vserver, vmware, selinux, support both my wireless chipsets and support my video chipset. Ideally it should be built the Debian/Ubuntu way, so it just works on my Ubuntu Hardy Heron system.

So far I've had various combinations of four or five out of six working at once.

I'm not a kernel hacker, but I have a PhD in computer science, so I should be able to at least make progress on this, and the fact that I can't is very frustrating. At work I grabbed a kernel off a co-worker, but it wasn't built the Debian/Ubuntu way.

Standing back and looking at the problem, there seem to be two separate contributing factors:

There are a huge number of organically-grown structural layers. I count git, the kernel build scripts, make, Linus's release system, the Debian kernel building system and the ubuntu kernel building system. I won't deny that each of these service a purpose, but that's is different points a which each of the six different things I'm trying to make work can begin their explanation of how to make them work and six different places for things to go wrong.
There are about many Linux distributions and each of the things I'm trying to get working caters to a different set of them.

In many ways the distribution kernel packagers are victims of their own success, most Ubuntu, debian and RedHat kernels just work because they're packagers keep adding more and more features and more and more drivers to the default kernels. With the default kernels working for so many people, fewer and fewer people build their own kernels and the pool of knowledge shrinks. The depth of that knowledge increases too, with the each evolution of the collective build system.

Wouldn't it be great if someone (ideally under the auspices of the OSDL) stepped in and said "This is insane, we need a system to allow users to build their own kernels from a set of <git repository, tag> pairs and a set of flags (a la the current kernel config system). It would download the git repositories and sync to the tags and then compile to the set of flags. Each platform can build their own GUI and their own backend so it works with their widget set and their low level architecture, but here's a prototype."

The system would take the set of repositories and tags in those repositories and download the sources with git, merge the results, use the flags to configure the build and build the kernel. Of course, sometimes the build won't work (in which case the system sends a copy of the config and the last N lines of output to a central server) and sometimes it will (in which case the system sends a copy of the config and an md5 checksum of the kernel to a central server and optionally uploads the kernel to a local repository), but more than anything it'll make it easy and safe for regular users to compile their own kernels. The system would supplant "building kernels the Debian way" or "building kernels the RedHat way" and enable those projects working at the kernel level to provide meaningful support and help to their users on distributions other than slackware.

Potential benefits I can see are:

increasing the number of crash-tolerant users willing to test the latest kernel features (better testing of new kernels and new features, which is something that's frequently asked for on lkml)
easing the path of new device drivers (users get to use their shiny new hardware on linux faster)
increasing the feedback from users to developers, in terms of which features people are using/interested in (better, more responsive, kernel development)
reduce the reliance on linux packagers to release kernels in which an impossible-to-test number of features work flawlessly (less stressed debian/ubuntu/redhat kernel packages)
ease the path to advanced kernel use such as virtualisation

You know the great thing about that list? Everyone who would need to cooperate gets some benefit, which means that it might just happen...

Macrons and URLs

Macrons are allowed in the path part of URLs, but not currently in the machine-name part (or at least, not yet), so http://extensions.services.openoffice.org/project/māori-papakupu is good, but http://www.taiuru.Māori.nz/ is not (use http://www.taiuru.maori.nz/).

A review of how lots of programs handle macrons is at http://research.elabs.govt.nz/macron-support-in-open-source-web-applications/.

My Stories