l7-filter Docs Pattern Writing Howto

Pattern Writing Howto


It's fairly easy to add support for more protocols to l7-filter. All you need to do is add a new pattern file to /etc/l7-protocols. This directory and its subdirectories are searched (non-recursively) for pattern files. (Thus, it will find /etc/l7-protocols/http.pat and /etc/l7-protocols/protocols/http.pat, but not /etc/l7-protocols/foo/bar/http.pat.) Please consider submitting any patterns you write for inclusion into the official distribution.

File Format

Basic Format

The basic format is very simple:

  • The name of the protocol on one line
  • A regular expression defining the protocol on the next line (see regular expressions below)

The name of the file must match the name of the protocol, for example, if the protocol is “ftp”, the file must be “ftp.pat”. Lines starting with '#' and blank lines are ignored. Both the kernel and userspace versions of l7-filter will use the given regular expression. For example, vnc.pat could be:

^rfb 00[1-9]\.00[0-9]\x0a$

Defining a Separate Userspace Pattern

Sometimes it will be desirable to define a separate regular expression for the kernel and userspace versions or to pass a custom set of flags to the userspace version's regcomp/regexec (see regular expressions below for why). In this case, add either or both of these lines after the two above:

userspace pattern=<userspace pattern>
userspace flags=<regexec and/or regcomp flags, whitespace delimited>

For example, smtp.pat could be:

^220[\x09-\x0d -~]* (e?smtp|simple mail)
userspace pattern=^220[\x09-\x0d -~]* (E?SMTP|[Ss]imple [Mm]ail)
userspace flags=REG_NOSUB REG_EXTENDED


Pattern files that are part of the official distribution need some metadata at the top for display on the web page and for the use of frontends. The top four lines should look like this:

# <Protocol name and some concise detail about the protocol> # Pattern attributes: [attribute word]* # Protocol groups: [group name]* # Wiki: [link]*

Pattern attributes give information about how good the pattern is on various scales. Attribute words can be any of undermatch, overmatch, superset, subset, great, good, ok, marginal, poor, veryfast, fast, nosofast, or slow. Any number of these may be used. They are defined on the protocols page.

Protocol groups are supposed to give frontends a way to group similar protocols. Group names can be whatever you like, but should match existing names if possible. Any number may be used. More relevant groups should be listed first for sorting purposes. Group names in use are:

  • chat
  • document_retrieval
  • file
  • game
  • ietf_draft_standard
  • ietf_internet_standard
  • ietf_proposed_standard
  • ietf_rfc_documented
  • mail
  • monitoring
  • networking
  • obsolete
  • open_source
  • p2p
  • printer
  • proprietary
  • remote_access
  • secure
  • streaming_audio
  • streaming_video
  • time_synchronization
  • version_control
  • voip
  • worm
  • x_consortium_standard

Wiki gives zero or more links to pages documenting the pattern and other methods of identifying the protocol on

Regular Expressions

The kernel and userspace versions of l7-filter use different regular expressions libraries. They use generally the same syntax, but have some differences.

General Information

Because patterns frequently need to use non-printable characters, both versions of l7-filter add perl-style hex matching on top of their stock libraries. This uses \xHH notation, so to match a tab, use “\x09”. Note that regexp control characters are still control characters even when written in hex:

\x24 == $	\x28 == (
\x29 == )	\x2a == *
\x2b == +	\x2e == .
\x3f == ?	\x5b == [
\x5c == \	\x5d == ]
\x5e == ^	\x7b == { (only a control character for the userspace version)
\x7c == |	\x7d == } (only a control character for the userspace version)

Both versions of l7-filter strip out the nulls (\x00 bytes) from network data so that they can treat it as normal C strings. So (1) you can't match on nulls and (2) fields may appear shorter than expected. For example, if a protocol has a 4 byte field and any of those bytes can be null, it can appear to be any length from 0 to 4.

Kernel Version

The kernel version of l7-filter uses Henry Spencer's 1987 implementation of Bell Version 8 regular expressions (“V8 regexps”), with a few modifications, noted here. V8 regexps are likely more limited than the regexps you are used to. Notably, you cannot use bounds (“foo{3}”), character classes (”punct”) or backreferences.

Because this library does not have a flag for case-sensitivity, the kernel version of l7-filter is always case insensitive. Upper case in patterns is identical to lower case. (This is true even if you write an uppercase letter in hex!)

The kernel version completely ignores any lines in the pattern file after the second non-comment line.

Userspace Version

The userspace version of l7-filter uses the GNU regular expression library, so its behaviour should be more familiar. This library is documented in man 3 regcomp and man 7 regex.

If only one regular expression is specified in the pattern file (see file format above), the userspace version compiles it with the flags REG_EXTENDED | REG_ICASE | REG_NOSUB and executes it with no flags.

If the userspace pattern and userspace flags lines are given, the userspace pattern will be used instead of the first one. It will be compiled and executed with the given flags. (l7-filter will sort out which flags go to regcomp and which to regexec.)

What l7-filter Sees and Does

If you have set up your iptables rules correctly (see the HOWTO), l7-filter sees the data going in both directions in the order that it passes through the computer. For instance, in FTP, the first thing it sees is “221 server ready”, then “USER bob”, then “331 send password”, then “PASS frogbeard”, and so on.

l7-filter can match across packets. For instance, with the above FTP example, the match is first attempted on “221 server ready”, then on “221 server readyUser bob”, then “221 server readyUSER bob331 send password”, (yes, there should be CRLFs in there, which I omitted for clarity) so you could match it with “220.*user.*331”. At each match attempt, the regexp special character ^ will match the beginning of the stream and $ will match the end of the last packet seen so far. Because the Linux kernel's ip_conntrack module tracks connectionless UDP and ICMP sessions as “connections”, this works with them as well as TCP.

Usually the identifying characteristics of a connection are found at the beginning of that connection. For this reason, and to save processing time, l7-filter only looks at the first 10 packets or 2kB of each connection, whichever is smaller. Any match made within this time is applied to the rest of the connection as well.

If only the userspace pattern line is given, the userspace pattern will be compiled with REG_EXTENDED | REG_ICASE | REG_NOSUB and executed with no flags. If only the userspace flags line is given, the single regular expression will be compiled and executed with the given flags.

What Makes a Good Pattern

There are two general guidelines:

1) A pattern must be neither too specific nor not specific enough.

Example 1: The pattern “bear” for Bearshare is not specific enough. This pattern could match a wide variety of non-Bearshare connections. For instance, an HTTP request for would be matched.

Example 2: “220 .*ftp.*(\[.*\]|\(.*\))” for FTP is too specific. Not all servers send ()s or []s after their 220. In fact, servers are not even required to send the string “ftp” at any time, but the vast majority do. Good judgement and testing are necessary for instances such as this.

2) It should use a minimum of processing power. If it's possible to reduce the number of instances of *, + and | in your pattern, you should do so. Use the performance testing program included in the patterns package.

3) It should complete its match on the earliest packet possible. The FTP pattern could be “^220[\x09-\x0d -~]*\x0d\x0aUSER[\x09-\x0d -~]*\x0d\x0a331”, but that won't match until the third data packet. Instead, we use “^220[\x09-\x0d -~]*ftp”, which matches on the first data packet.

Miscellaneous Tips

[\x09-\x0d -~] == printable characters, including whitespace [\x09-\x0d ] == any whitespace [!-~] == non-whitespace printable characters

Recommended Procedure for Writing Patterns

  • Find and read the spec for the protocol you wish to match. If it's an Internet standard, RFCs are a good place to start, although not all standards are RFCs. If it is a proprietary protocol, it is likely that someone has written a reverse-engineered spec for it. Do a general web search to find it. Skipping this step is a good way to write patterns that are overly specific!
  • Use something like Wireshark (formerly known as Ethereal) to watch packets of this protocol go by in a typical session of its use. (If you failed to find a spec for your protocol, but Wireshark can parse it, reading the Wireshark source code may also be worth your time.)
  • Write a pattern that will reliably match one of the first few packets that are sent in your protocol. Test it. Test its performance.
  • Send your pattern to l7-filter-developers{/-\T}lists*sf*net for it to be incorporated into the official pattern definitions (you must subscribe first).

Howto Send a Packet Dump to the Developer

If you do not feel that you are able to do all of the above yourself, you may want to send some packets you have captured to the developers so that others can do the rest. In order for this to be useful, please follow these guidelines:

  • If you have never done anything like this before, use Wireshark. It's easy to use and available for GNU/Linux, Mac and Windows (and FreeBSD, HP-UX, NetBSD, Solaris…). Use File→Save to save the captured packets.
  • Make sure that you start capturing packets before the application that you are testing has started using the network. l7-filter looks at the opening packets of a connection. If these are not present in the packet dump, it is useless.
  • If it makes sense for the protocol in question, send a recognizable text string so that the relevant connection can be found in the packet dump. For instance, if testing an instant messenger, send a message with “hello hello hello.”
  • Along with your capture, send us anything that could be helpful in picking out the relevant data. For example, this could include the server's IP address, what network operations you performed, the version numbers of all software used, any strings you expect to appear in the packets (such as instant messenger text, e-mail addresses, gaming handles, etc.), etc.
  • Try not to capture an excessive number of packets. In particular:
    • Avoid having other programs use the network during your capture. Assuming their traffic is recognizable, the excess packets can be filtered out, but it's annoying.
    • Avoid sending captures that have many thousands of packets from the same connection. All but the first few are useless.
    • However, if you are not sure when the application opens connections, or if it opens many simultaneous connections, it might be necessary to send a large number of packets. This is ok.
  • Send the packets in libpcap format or something else that Wireshark can read. Do not:
    • send only a text hexdump of the packets. This is unnecessarily hard to read.
    • send only the data portion of the packets. The TCP headers in particular are essential for finding streams. You may anonymize addresses if necessary, but try to avoid it.
    • compress the captured packets with anything other than gzip or bzip2. In fact, no compression is needed unless the file is very large.

If you aren't sure how to follow these guidelines, try your best and send the result to us. If it's wrong, we'll be happy to tell you how to fix it.

Except where otherwise noted, content on this wiki is licensed under Creative Commons Attribution-ShareAlike 1.0