Weaver
 
Home
Download
... looming on the horizon.

The Weaver Daemon is something that serves out threads. A web-based front end exists -- Loom.

Normal news readers fetch the headers for the articles you haven't seen, and then threads the headers and displays them in a pleasing manner. This is fine, but isn't really an ideal way for a web-based threading interface to work. Doing threading is a (relatively speaking) time-consuming task, and if you wish to design a web-based news reader that's able to serve out, say, tens of pages per second, you either have to throw a 64 CPU Sun Enterprise server at the problem, or you do something non-traditional.

weaverd (i.e., The Weaver Daemon) is an attempt at doing something clever.

The main idea in weaverd is to keep all the threads in memory. Scratch-buffer computation shows that this means that every 10M articles will require 1GB RAM, which is OK.

weaverd will keep its structures mirrored to disk, so that it can start up in a pretty speedy fashion, but won't really read anything from disk once it has bootstrapped itself. (Well, it'll read new articles as they arrive to thread them, but won't touch the data structures otherwise.)

weaverd has three basic structures:

  • String storage. Many strings (author names, subjects) repeat themselves, so only one copy of each unique string is kept, and all the instances refer to this string. A hash structure is used to keep track of the strings.
  • The group array. Each group is represented by a structure that has the group name, group id, and most importantly -- the list of articles (or nodes) in two orders -- article order and thread order.
  • The nodes. Each node (i.e., article) looks like this:
    typedef struct {
      unsigned short group_id;
      unsigned int number;         /* Article number in group, max 3*/
      node_id id;                  /* Unique article id */
    
      node_id parent;
      node_id next_instance;       /* Used for cross-posting. */
    
      /* These three are offsets into the string storage. */
      unsigned int subject;
      unsigned int author;
      unsigned int message_id;
      time_t date;
    
      node_id first_child;
      node_id next_sibling;
    } node;
    
    So each node takes 42 bytes, plus the strings from the string storage. Running over the spool shows that each article has on average 58 bytes that weaverd has to store, so that's exactly 100 bytes per article. Scarily enough, that's what I guesstimated before I started.

    I hadn't really counted up the overhead that all the hash tables entail, though. That's about 20 bytes more per article.

This is the basic algorithm when receiving a new message:

  • Pick out the Date, From, Message-ID, Newsgroups and the last References, and look then up in the string storage. (Well, the Date is converted to a time_t.)
  • Enter them into the numeric node array for the group.
  • Find out where in the thread structure they fit, and alter (possibly) the parent and the previous sibling. Then re-compute the thread node array for the group, possibly being clever, or just brute-force it, depending on how fast brute-forcing will be.
  • Write out to disk all the nodes that were dirtied.
And that's basically it, except for some bookkeeping details. It's really simple.

Now, expiry adds a twist. It would probably be convenient to be able to go from Message-ID to node, so perhaps a separate Message-ID hash table would be nice, in addition to the general string hash table.

The daemon will listen on a socket and take the following commands:

  • enter [path to message]: Parse the message and enter it into the structures as described above.
  • cancel [message-id]: Remove the message in question from the structures.
  • group-thread [group] [page] [page-size] [last]: This is the main command used by the web interface, and returns the threads starting with (threadly speaking) on page PAGE. That is, if there are 1000 articles in the group, asking for 990-1000 will return the ten last displayed articles in the group, if the group was already threaded. So they won't be article numbers 990-1000.