|
... looming on the horizon.
The Weaver Daemon is something that serves out threads. A web-based
front end exists -- Loom.
Normal news readers fetch the headers for the articles you haven't
seen, and then threads the headers and displays them in a pleasing
manner. This is fine, but isn't really an ideal way for a web-based
threading interface to work. Doing threading is a (relatively
speaking) time-consuming task, and if you wish to design a web-based
news reader that's able to serve out, say, tens of pages per second,
you either have to throw a 64 CPU Sun Enterprise server at the
problem, or you do something non-traditional.
weaverd (i.e., The Weaver Daemon) is an attempt at doing something
clever.
The main idea in weaverd is to keep all the threads in memory.
Scratch-buffer computation shows that this means that every 10M
articles will require 1GB RAM, which is OK.
weaverd will keep its structures mirrored to disk, so that it can
start up in a pretty speedy fashion, but won't really read anything
from disk once it has bootstrapped itself. (Well, it'll read new
articles as they arrive to thread them, but won't touch the data
structures otherwise.)
weaverd has three basic structures:
- String storage. Many strings (author names, subjects) repeat
themselves, so only one copy of each unique string is kept, and all
the instances refer to this string. A hash structure is used to keep
track of the strings.
- The group array. Each group is represented by a structure that
has the group name, group id, and most importantly -- the list of
articles (or nodes) in two orders -- article order and thread order.
- The nodes. Each node (i.e., article) looks like this:
typedef struct {
unsigned short group_id;
unsigned int number; /* Article number in group, max 3*/
node_id id; /* Unique article id */
node_id parent;
node_id next_instance; /* Used for cross-posting. */
/* These three are offsets into the string storage. */
unsigned int subject;
unsigned int author;
unsigned int message_id;
time_t date;
node_id first_child;
node_id next_sibling;
} node;
So each node takes 42 bytes, plus the strings from the string storage.
Running over the spool shows that each article has on average 58 bytes
that weaverd has to store, so that's exactly 100 bytes per article.
Scarily enough, that's what I guesstimated before I started.
I hadn't really counted up the overhead that all the hash tables
entail, though. That's about 20 bytes more per article.
This is the basic algorithm when receiving a new message:
- Pick out the Date, From, Message-ID, Newsgroups and the last
References, and look then up in the string storage. (Well, the Date
is converted to a time_t.)
- Enter them into the numeric node array for the group.
- Find out where in the thread structure they fit, and alter
(possibly) the parent and the previous sibling. Then re-compute the
thread node array for the group, possibly being clever, or just
brute-force it, depending on how fast brute-forcing will be.
- Write out to disk all the nodes that were dirtied.
And that's basically it, except for some bookkeeping details. It's
really simple.
Now, expiry adds a twist. It would probably be convenient to be
able to go from Message-ID to node, so perhaps a separate Message-ID
hash table would be nice, in addition to the general string hash
table.
The daemon will listen on a socket and take the following commands:
- enter [path to message]: Parse the message and enter it into the
structures as described above.
- cancel [message-id]: Remove the message in question from the
structures.
- group-thread [group] [page] [page-size] [last]: This is the main
command used by the web interface, and returns the threads starting
with (threadly speaking) on page PAGE. That is, if there are 1000
articles in the group, asking for 990-1000 will return the ten last
displayed articles in the group, if the group was already threaded.
So they won't be article numbers 990-1000.
|