XSLT with Python

I just thought it had been a long time since my last post. This one wins even more.

I still really like Lighttpd and until recently was only using Apache for mod_svn and mod_xslt. I don't have much choice with mod_svn short of using svnserve (which I may end up doing), but a few months ago (December 12 by the file date) I took up the challenge of replacing mod_xslt.

I did enjoy mod_xslt and can't complain about its performance or memory usage. The fact that the project is dead is disconcerting, but any time the module stops compiling I'm able to get it working again by looking around, posting on the mailing list, or fixing it myself. Really, the only real qualm I have is that it requires Apache.

As an aside, my love for XML has long since passed and so I just want the system to work and I won't make any future enhancements. In general, I am now anti-XML and pro-JSON and -Bencode. My opinion is that there are still uses for XML, but that it is generally overused.

After some time, I developed this CGI script in Python:

#!/usr/bin/env python

from lxml import etree
import cgi
import sys, os

KEY="SOMESECRETKEY"

def transform(xml, xslt):
    doc = etree.parse(xml)
    doc.xinclude()
    style = etree.XSLT(etree.parse(xslt))

    return style.tostring(style.apply(doc))

if __name__ == "__main__":
    import cgitb
    cgitb.enable()

    form = cgi.FieldStorage()
    if "key" not in form or form["key"].value != KEY:
        print "Status: 403 Forbidden"
        print "Content-Type: text/html"
        print
        print "<html><title>403 Forbidden</title><body><h1>403 Forbidden</h1></body></html>"
        sys.exit()

    xml = form["xmlfile"].value
    xslt = form["xsltfile"].value
    contenttype = form["contenttype"].value

    print "Content-Type:", contenttype
    print
    print transform(xml, xslt)

Luckily I didn't use very many mod_xslt specific features, so everything seemed to "just work." I did lose Content-Type support, so I have to hard-code it as a GET parameter. Notice I added the secret key in there since I didn't want to bother with proper security.

Now for the Lighttpd configuration. Since I can no longer use .htaccess files in different directories to change which XSLT is used, I get this more-ugly config:

url.redirect = ( "^/$" => "/recent/" )
url.rewrite-once = (
    "^(/recent/rss/)(?:index.html|index.xml)?$" => "/cgi-bin/ejona-xslt.py?key=SOMESECRETKEY&xsltfile=/path/to/htdocs/shared/xsl/application-rss%%2Bxml.xsl&contenttype=application/xml&xmlfile=../$1/index.xml",
    "^(/recent/atom/)(?:index.html|index.xml)?$" => "/cgi-bin/ejona-xslt.py?key=SOMESECRETKEY&xsltfile=/path/to/htdocs/shared/xsl/application-atom%%2Bxml.xsl&contenttype=application/atom%%2Bxml&xmlfile=../$1/index.xml",
    "^(/recent/atom/summary/)(?:index.html|index.xml)?$" => "/cgi-bin/ejona-xslt.py?key=SOMESECRETKEY&xsltfile=/path/to/htdocs/shared/xsl/application-atom%%2Bxml.summary.xsl&contenttype=application/atom%%2Bxml&xmlfile=../$1/index.xml",
    "^(/recent/atom/0\.3/)(?:index.html|index.xml)?$" => "/cgi-bin/ejona-xslt.py?key=SOMESECRETKEY&xsltfile=/path/to/htdocs/shared/xsl/application-atom%%2Bxml.0.3.xsl&contenttype=application/xml&xmlfile=../$1/index.xml",
    "^((?:/recent/|/archive/).*).(?:html|xml)$" => "/cgi-bin/ejona-xslt.py?key=SOMESECRETKEY&xsltfile=/path/to/htdocs/shared/xsl/text-html.xsl&contenttype=text/html&xmlfile=../$1.xml",
    "^((?:/recent|/archive)/(?:.*/)?)$" => "/cgi-bin/ejona-xslt.py?key=SOMESECRETKEY&xsltfile=/path/to/htdocs/shared/xsl/text-html.xsl&contenttype=text/html&xmlfile=../$1/index.xml",
)
index-file.names = ( "index.xml" )

Notice the %%2B's in some of the URLs. Those make it additionally ugly, but I still prefer that stuff over dealing with Apache.

All-in-all, it feels like a reasonably hackish solution, but it works great. I don't care about loss in performance (honestly, who reads a not-updated-in-over-two-years blog?) and if I really care I could convert the script into a Fast-CGI on WSGI script. It is nice to know that this proof-of-concept of a blog is somewhat portable now.

Gtk, Languages, and Memory

It has been a long time since my last post. Sorry, but now to business.

Recently I have re-tried to open up to Mono. I certainly do like some of its abilities and features, but the memory usage always worries me. When I compared the memory usage of Banshee and Rhythmbox, I found memory utilization of 80 MiB and 20 MiB, respectively. That mostly shot the Mono idea out of the water. However, then I checked Tomboy memory usage and found it to be reasonable, but still a little high at 16.7 MiB on my computer with three notes. For reference, Pidgin uses 6.9 MiB and Nautilus uses 12.0 MiB.

Then I got to wondering about the baseline memory usage for using GTK+ for different languages. Here are the languages, files, and commands I used.

Languages, Files, and Commands

C

hello-world.c:

#include <gtk/gtk.h>

static void
on_destroy (GtkWidget * widget, gpointer data)
{
    gtk_main_quit ();
}

int
main (int argc, char *argv[])
{
    GtkWidget *window;
    GtkWidget *label;

    gtk_init (&argc, &argv);

    window = gtk_window_new (GTK_WINDOW_TOPLEVEL);
    gtk_window_set_title (GTK_WINDOW (window), "Hello World");

    g_signal_connect (G_OBJECT (window), "destroy",
                      G_CALLBACK (on_destroy), NULL);

    label = gtk_label_new ("Hello, World");
    gtk_container_add (GTK_CONTAINER (window), label);
    gtk_widget_show_all (window);
    gtk_main ();

    return 0;
}

compile:
gcc hello-world.c -o hello-world `pkg-config --libs --cflags gtk+-2.0`

run:
./hello-world

C++

hello-world.cpp:

#include <gtkmm/main.h>
#include <gtkmm/window.h>
#include <gtkmm/label.h>

class HelloWorld : public Gtk::Window
{
public:
    HelloWorld();
    virtual ~HelloWorld();

protected:
    Gtk::Label m_label;
};

HelloWorld::HelloWorld()
: m_label("Hello, World")
{
    set_title("Hello World");

    add(m_label);

    show_all();
}

HelloWorld::~HelloWorld()
{
}

int main (int argc, char *argv[])
{
    Gtk::Main kit(argc, argv);

    HelloWorld helloworld;
    Gtk::Main::run(helloworld);

    return 0;
}

compile:
g++ hello-world.cpp -o hello-world `pkg-config --libs --cflags gtkmm-2.4`

run:
./hello-world

Python

hello-world.py:

import gtk

def on_destroy(o):
	gtk.main_quit()

w = gtk.Window()
w.set_title("Hello World")

w.connect("destroy", on_destroy)

l = gtk.Label("Hello, World")
w.add(l)
w.show_all()
gtk.main()

run:
python hello-world.py

C#

hello-world.cs:

using Gtk;
using System;

class Hello {

	static void Main()
	{
		Application.Init ();

		Window window = new Window ("Hello World");

		window.DeleteEvent += delete_event;

		window.Add (new Label ("Hello, World"));

		window.ShowAll ();
		Application.Run ();

	}
	static void delete_event (object obj, DeleteEventArgs args)
	{
		 Application.Quit ();
	}
}

compile:
mcs hello-world.cs -pkg:gtk-sharp

run:
mono hello-world.exe

IronPython

hello-world.py:

import clr
clr.AddReference("gtk-sharp")
import Gtk

def delete_event (o, args):
  Gtk.Application.Quit ()

Gtk.Application.Init ()
w = Gtk.Window ("Hello World")

w.DeleteEvent += delete_event

l = Gtk.Label ("Hello, World")
w.Add(l)
w.ShowAll ()
Gtk.Application.Run ()

run:
mono ipy.exe hello-world.py

Results

All the memory usages were recorded from GNOME's System Monitor. I used the new "Memory" column that is suppose to be less misleading than other measurements.

Language Memory Usage
C 1.9 MiB
C++ 2.7 MiB
Python 6.6 MiB
C# 3.3 MiB
IronPython 29.8 MiB

I tried to do Java as well, but I could not get the Java bindings to compile. I think that this test gave Mono some credibility and removed any consideration for using IronPython from my mind. I was not very surprised with Python's "high" memory usage, since I had already looked into this when looking into GEdit's plugins. This test tells nothing about actual memory usage in typical programs in each of the different languages, just the baseline for how low the memory usage can go.

Avahi and DNS

I was playing around with Avahi and DNS and decided that since it was so much fun, I should share my experience.

I already had Avahi set up. Avahi is a Zeroconf (aka Bonjour, Rendezvous) implentiation. Zeroconf allows you to find services on a network. Previously I just used Avahi for finding iTunes shares (with Rhythmbox) and publishing my services (ssh, http, ipp, rsync, etc.).

The Beginning

Previously I noticed, with avahi-discover, that my workstation was published (because public-workstation is enabled by default). I had no idea how to use it though. Looking through Planet GStreamer for the first time I found a post describing how to use this other tid-bit of information.

Getting Automatic Local Hostnames

Up to this point, I had been setting hostnames in /etc/hosts for computers on my network. My network maintains fairly stable IPs, so this was not a big issue. But with Avahi, this can be automatic! I emerged nss-mdns and added mdns (most people probably want mdns4 instead) to /etc/nsswitch.conf the hosts line (now looks like "hosts: files dns mdns").

DNS Caching and Local Domain Setting

At this point, I could go to my machine via mastermind.local, and the ersoft server via wife.local. As I removed wife and mastermind from my hosts file, I realized that dns lookups were much slower when gotten from avahi. Comcast's DNS servers are unbearably slow, so I figured this would be a good time to set up a DNS cache.

I found a post for DNS caching for Ubuntu. Since I already had dnsmasq, I uncommented the listen-address line in /etc/dnsmasq.conf and set it equal to 127.0.0.1 ("listen-address=127.0.0.1"). Then I ran rc-update add dnsmasq default and /etc/init.d/dnsmasq start.

To configure your system to use this cache you need to modify /etc/dhcp/dhclient.conf. It is very possible that you are missing this file. If you are, just emerge dhcp.

The file is the same place that you can set your default domain name. Setting the domain name allows you to connect to a host via hostname as opposed to hostname.domainname. In my case, without the default domain name set, I would have to connect to mastermind.local to get to my laptop. For most people, their domain name would be local as well. I my dhclient.conf now looks like:

prepend domain-name-servers 127.0.0.1;
supersede domain-name local;

If you were previously using dhcpcd in Gentoo, you will want to change /etc/conf.d/net to use dhclient. You can achieve this by:

modules=( "dhclient" )

You will need to restart your network device before it will be using the new configuration. If you want to test to see if everything is working, dig is a useful command. It is part of bind-tools. Give dig the argument of the host you want to lookup. It will give you a lot of good DNS information, including "Query time." The query time from the cache should be 0-1 milliseconds.

Updates

This blog has been very lonely for the past few months and I finally am using it again. I finished a few of the tasks that needed to be done a while back, but not all of them.

Most noticably, I broke the RSS feed (I have no clue how!). It works perfectly fine from the command line, but not from apache... Since I don't like RSS that much anyway, I am not too sad about it breaking.

I have now deprecated the Atom 0.3 feed, so that there are no more links to it. I thought that the feed was valid, but checking it at feedvalidator.org shows otherwise. It not being valid combined with Atom 1.0 having had many more months to get implemented into agregators and Atom 0.3 becoming officially deprecated made me feel that there is no reason for me to put forth the effort to have it validate and that I should just deprecate it in general.

I finally fixed the stupid double slash in my URLs and removed the leading space in most links (not a real issue, but it was just code cosmetics). I also changed one thing in the Atom 1.0 feed, although I do not remember what now.

You will find a new Atom 1.0 Summaries. This was going to be in addition to the RSS feed, but since it broke, it will just be a substitute for the RSS feed for those people who like to read posts on the actual site. The summary for a post is now character limited and cut at a word boundry instead of taking the first paragraph.

Those whom are very observant may have noticed that all the CSS files now exist. I split the one stylesheet that I had into several different stylesheets and added a print stylesheet. In addition, I actually created the fixed stylesheet layout, so now the "Choose a style" links work. The original CSS did not completely validate, and this has been fixed as well (so that if I removed the hacks and the Mozilla specific parts it would validate).

Lighttpd

I have tried out and really enjoyed the lighttpd HTTP web server. I hope to eventually swap this server to it some time in the future. The main problem with swapping is I cannot use mod_xslt, so I will need to find or develop a component to perform the XSLT. If you have not tried out lighty, you should.

Unicode in Linux

UTF-8 is a very nice tool that allows easy usage of unicode. Although swapping to UTF-8 has many benefits, the reason I tell people I use the encoding is to write pretty glyphs and the euro (so what if I live in the US?).

However, in Windows, even though every application can pretty much accept UTF-8 as input, it is difficult to input needed characters. Your best bet is to memorize character codes and do Alt + code or swap to the keyboard where such a character is native. I'm not too game for these approaches. I prefer X’s approach of using a configurable Compose key. You press the compose key and then a key chain for a given character. Each key chain was created by some person so as to make as much sense as possible. For example, a pretty open quote, ' “ ', can be typed by Compose + < + ". Likewise, for a close quote, ' ” ', the key chain is Compose + > + ". This allows you to actually remember how to type each character you care about. If that isn't good enough, or if there isn’t a key chain for a character you want, you can configure the chains with the file ~/.XCompose.

There are two large problems with the Compose key, however; the key has no default setting and Linux users for the most part have no earthly clue that such a feature is available to them. It is actually easy (relatively for Linux) to set up the key if you know about it — add “Option "XkbOptions" "compose:menu"” to /etc/X11/xorg.conf, restart X, and you have enabled the feature. The given line will make the compose key the menu key (has a drop-down menu and a mouse pointing to an item on the menu), but there are other allowed keys. In my install, I have two other options: ralt and rwin. To see the allowed settings for yourself, just look at /usr/X11R6/lib/X11/xkb/rules/xorg.lst (assuming you do not specify a XkbRules option; if you do, view the file with with the same rule name as opposed to xorg.lst). The default key bindings change for each locale/charset.

Now the issue is figuring out how to type a certain character. All of the key chains are in a normal configuration file in /usr/X11R6/lib/X11/locale/%LOCALE%/Compose. My locale is en_US.UTF-8. To find your locale you can use the “locale” command. All you really have to do is grep the file for the character in which you are looking. I use “gucharmap” for unicode information and as a normal character map. Here is a script that I found that might make the grepping a little easier (you must edit $dir to fit your system):

#!/bin/sh
dir=/usr/X11R6/lib/X11/locale/en_US.UTF-8
dir="$dir/`sed -n "s#\([^/]*\)/.*:.*$LANG#\1#p" < $dir/Compose`"
grep -F $1 $dir/Compose

If you create a ~/.XCompose file, then you must include the normal key bindings by using “include "%L".” Here is my ~/.XCompose file:

include "%L"
<Multi_key> <minus> <minus> <underscore> : "‐" U2010 # HYPHEN
<Multi_key> <3> <period> : "…" U2026 # HORIZONTAL ELLIPSIS
<Multi_key> <period> <3> : "…" U2026 # HORIZONTAL ELLIPSIS