I’ve been looking some more into recent new features added to bleadperl by demerphq, such as Aho-Corasick trie
matching, and how we can effectively support this in SpamAssassin. Here’s the
state of play.
These are the “base
strings” extracted from the SpamAssassin SVN trunk body ruleset (ignore the
odd mangled UTF-8 char in here, it’s suffering from cut-and-paste breakage).
A “base string” is a simplified subset of the regular expression; specifically,
these are the cases where the “base strings” of the rule are simpler than the
full perl regular expression language, and therefore amenable to fast parallel
string matching algorithms.
The base strings appear in that file as “r” lines, like so:
r I am currently out of the office:__BOUNCE_OOO_3 __DOS_COMING_TO_YOUR_PLACE
r I drive a:__DOS_I_DRIVE_A
r I might be c:__DOS_COMING_TO_YOUR_PLACE
r I might c:__DOS_COMING_TO_YOUR_PLACE
The base string is the part after “r” and before the “:”; after that, the rule
names appear.
Now, here are some limitations that make this less easy:
One string to many rules: each one of those strings corresponds to one or
more SpamAssassin rules.
One rule to many strings: each rule may correspond to one or more of those
strings. So it’s not a one-to-one correspondence either way.
No anchors: the strings may match anywhere inside the line, similar to
("foo bar baz" =~ /bar/).
Multiple rules can fire on the same line: each line can cause multiple
rules to fire on different parts of its text.
Subsumption is not permitted: the base-string extractor plugin has already
established cases where subsumption takes place. Each string will not
subsume another string; so a match of the string “food” against the strings
“food” and “foo” should just fire on “food”, not on “foo”.
Overlapping is permitted: on the other hand, overlapping is fine; “foobar”
matched against “foo” and “oobar” should fire on both base strings. (The
above two are basically for re2c compatibility. This is the main reason the
strings are so simple, with no RE metachars — so that this is possible,
since re2c is limited in this way.)
Most rules are more complex: most of the ruleset — as you can see from the
‘orig’ lines in that file — are more complex than the base string alone. So
this means that a base string match often needs to be followed by a
“verification” match using the full regexp.
Now, the problem is to iterate through each line of the (base64-decoded,
encoding-decoded, HTML-decoded, whitespace-simplified) “body text” of a mail
message, with each paragraph appearing as a single “line”, and run all those
base strings in parallel, identifying the rule names that then need to
be run.
This is turning out to be quite tricky with the bleadperl trie code.
For example, if we have 3 base strings, as follows:
hello:RULE_HELLO
hi:RULE_HI
foo:RULE_FOO
At first, it appears that we could use the pattern itself as a key
into a lookup table to determine the pattern that fired:
%base_to_rulename_lookup = (
'hello' => ['RULE_HELLO'],
'hi' => ['RULE_HI'],
'foo' => ['RULE_FOO']
);
if ($line =~ m{(hello|hi|foo)}) {
$rule_fired = $base_to_rulename_lookup{$1};
}
However, that will fail in the face of the string “hi foo!”, since
only one of the bases will be returned as $1, whereas we want
to know about both “RULE_HI” and “RULE_FOO”.
m//gc might help:
%base_to_rulename_lookup = (
'hello' => ['RULE_HELLO'],
'hi' => ['RULE_HI'],
'foo' => ['RULE_FOO']
);
while ($line =~ m{(hello|hi|foo)}gc) {
$rule_fired = $base_to_rulename_lookup{$1};
}
That works pretty well, but not if two patterns overlap: /abc/
and /bcd/, matching on the string “abcd”, for example, will fire
only on “abc”, and miss the “bcd” hit.
Given this, it appears the only option is to run the trie match, and then
iterate on all the regexps for the base strings it contains:
if ($line =~ m{hello|hi|foo}) {
$line =~ /hello/ and rule_fired("HELLO");
$line =~ /hi/ and rule_fired("HI");
$line =~ /foo/ and rule_fired("FOO");
}
Obviously, that doesn’t provide much of a speedup — in fact,
so far, I’ve been unable to get any at all out of this method. :(
This can be optimized a little by breaking into multiple trie/match
sets:
if ($line =~ m{hello|hi}) {
$line =~ /hello/ and rule_fired("HELLO");
$line =~ /hi/ and rule_fired("HI");
...
}
if ($line =~ m{foo|bar}) {
$line =~ /foo/ and rule_fired("FOO");
$line =~ /bar/ and rule_fired("BAR");
...
}
But still, the reduction in regexp OPs vs the addition of logic
OPs to do this, result in an overall slowdown, even given
the faster trie-based REs.
Suggestions, anyone?
(by the way, if you’re curious, the current code is
here in SVN.)
Tags:
aho corasick hacking optimisation perl regexps spamassassin
I first read
Managing Gigabytes seven years ago. It's quite an interesting book on compression and full text searching - there is lots of software that does that now. The weirdest thing is that seven years ago, gigabytes sounded like a large amount of data. These days a terabyte isn't particularly big. How big will my laptop HD be in seven years time?
The web site for the LPW has now been updated with this year's details. It contains the Call for Proposals and a registration page.
"This is like sitting your mother down to teach her computers and beginning by explaining little endian vs big endian and the intimate details of protected mode memory" -- Michael G. Schwern, firmly believing that there must be a better way of installing perl.
Just off the heels of the Chicago Perl Hack-A-Thon, which focused very much on Parrot, I have a great interview with Chip Salzenberg. Chip is a past Perl 5 Pump King and is the current Parrot Pump King. Parrot, as many of you know, is a virtual machine that is intended be a platform for running Perl 6. Actually, the Parrot virtual machine hopes to run many languages other than Perl 6. There is currently beginning implementations of many languages on Parrot including Ruby and TCL. Parrot is different than many other virtual machines that we know of. It is register-based, as opposed to the common stack-based architectures, and it is designed to run dynamic languages.
I'm in the process of taking over the maintainance of Win32::API module from Aldo Calpini, the original author. I'm collecting patches, related web pages and information as I'd like to make Win32::API work with the current VanillaPerl/Mingw environment. I also updated win32.perl.org Wiki to announce this. So, anyone interested in getting your patches, suggestions, rants, ... please get onboard now and communicate with me. If you already have your bug reports filed on http://rt.cpan.org/, be sure I'm working on those (if I ever get to manage all of these bugs;-D )...
While Pudge and I were comparing police brutality videos on YouTube, he also showed me this guy singing about
Code Monkey. Apparently he quit his programming job to be a singer. Those two groups look a lot a like, I guess.
"Code Monkey like Tab and Mountain Dew!"
Just off the heels of the Chicago Perl Hack-A-Thon, which focused very much on Parrot, I have a great interview with Chip Salzenberg. Chip is a past Perl 5 Pump King and is the current Parrot Pump King.
Parrot, as many of you know, is a virtual machine that is intended be a platform for running Perl 6. Actually, the Parrot virtual machine hopes to run many languages other than Perl 6. There is currently beginning implementations of many languages on Parrot including Ruby and TCL. Parrot is different than many other virtual machines that we know of. It is register-based, as opposed to the common stack-based architectures, and it is designed to run dynamic languages.
Listen to hear what Chip has to say about the state of Parrot and what you can do to help out.
I'm trying to get rid of a lot of the crap that's crufting up my office. Today
I went through my old college papers and threw out nearly all of them. On a
handout from a hermeneutics class, "From Notes Made in 1970-1971" (I assume
it's Derrida), I saw the following marginalia from Young Self:
Prof. Olsen would be a level boss in the Philosophy arcade game.
Strange, but true. He would be.
Now with class data support:
package Universe;
use Class::BuildMethods
pi => {
class_data => 1,
default => 3.1415927
};
chip writes "On behalf of the Parrot team, I'm proud to announce Parrot 0.4.7, "Caique"! Parrot is a virtual machine aimed at running all dynamic languages.
I've been getting a fair number of emails about Crypt::Rijndael, so I guess a lot of people must be using it. I thought it was some little used thing that nobody cared about. :)
I think I've fixed most of
the bugs in RT, but I want to wait for more CPAN Tester reports to ensure the systems that had problems previously now have them fixed. I would certainly appreciate any CPAN testers who can add these data points for Crypt::Rijndael:
Systems that had failures in 0.05 but not passes in 0.06_02 yet:
- Linux 2.4.7-0.44.31 (s390x-linux)
- MSWin32 4.0 (MSWin32-x86-multi-thread)
- Solaris 2.9 (sun4-solaris)
- Aix 5.3.0.0 (aix-thread-multi-64int)
- Darwin 8.0 (darwin-thread-multi-2level)
- Freebsd 6.0-release (alpha-freebsd)
Systems with bugs in RT:
- Debian 3.0 on alpha (guessing 64 bit)
- Compaq Tru64 5.0 on alpha
- Solaris 8 x86
- Solaris 8 SPARC
Today we got new windows. I'm talking about real windows, not Vista or similar crap. When we bought our appartment a year ago, the windows where still the orginal ones from when the building was erected (~1905). The looked nice, but the wind was actually blowing through some cracks, and some windows were hard to open/close, because they were warped badly.
So after lots of fuzzing (do we want new windows or do we want to renovate the old ones, what kind of new windows would we want (plastic, wood, wood/alu)...) we ordered new custom made wood windows, which they started to install today.
They basically tore out the old windows, covering all of our appartment with dust (we covered most of our stuff with plastic sheets because we were expecting this..). It was quite cold, but luckyly they worked fast. They started at ~8:00 and were finished at ~15:00. Well, they only finished the rough bits, which means we now have the new windows in place everywhere, but they still need to finish the windowsill and plaster all of the raw brickwork.
Hopefully, they'll finish everything tomorrow, then we should be able to resume normal living on Wednesday (after cleaning up everything...)
Karl Rove recently stated, "Conservatives saw what happened to us on 9/11 and said: we will defeat our enemies. Liberals saw what happened to us and said: we must understand our enemies." This perfectly explains a large part of the ham-handed handling of the "War on Terror" by the current administration. Yes, we must try to stop terrorism, but we cannot do it solely with military force. That's why I recently wrote about problems with the 9/11 Commission report. Understanding your enemy is a key step in defeating him.
So when the US vetoed a UN resolution that would have condemned both Israel and Palestine for their activities in the latest Gaza violence, despite not a single other Security Council member casting a veto, it should come as no surprise that the US, once again, failed to understand the sensitivity of this issue. The US has frequently been the sole nation blocking UN resolutions regarding Israel. Right or wrong, the knee-jerk responses of the US in this matter has long been a grievance of many Arab nations.
Today, Arab ministers, meeting in Cairo, as a direct response to the US veto, have voted to resume aid to the Palestinian government. Ismail Haniyeh, the Palestinian Prime Minister, offered to step down if that would end the boycott. This would have been a positive event. Now, thanks in large part to the US government's lack of sensitivity to political issues in this area, he may not have to. This is going to complicate things tremendously and more people are going to die because of it.
Found over at this UK blogger’s site. So RFID passports are now deployed in both Ireland and the UK…
Tags:
I finally got off my butt and took photos of our
house and put
them up on Flickr.
The photostream should start at the exterior and go through the second floor.
The third floor photos are earlier, because I took them when we were working up
there.
I should also scan the floorplan that was in our appraisal...
--
rjbs
I've released Crypt-Rijndael-0.06_01 to see if the simple change of using <sys/type.h> solves most of the
bugs in RT. In trying to test this, I've discovered that everything in my life is now some sort of BSD. I used to have linux logins, and I guess I should pul that Windows box out of the closet, but I'll wait to see what CPAN Testers say.
I may be a bit hungover this Sunday morning due mainly to the effects of the subject of this post, but — Guinness National Lottery?
is anyone going to fall for that?
From: hamilton jones
Subject: GUINNESS. CUSTORMERS PROMOTION
GUINNESS. CU
This week on the Perl 6 mailing lists "...problem 2 is probably just me being confused (though I'd love an explanation, from @leo ;-))." -- Jonathan Worthington, in 'set_pmc_keyed_int delegates to set_pmc_keyed...?'