Nicolas Martyanoff — Brain dump About

Counting lines with Common Lisp

A good line counting program has two features: it only counts non-empty lines to get a fair estimate of the size of a project, and it groups line counts by file type to help see immediately which languages are used.

A long time ago I got frustrated with two well known line counters. Sloccount spits out multiple strange Perl warnings about locales, and most of the output is a copyright notice and some absurd cost estimations. Cloc has fourteen Perl packages as dependencies. Writing a simple line counter is an interesting exercise; at the time I was discovering Common Lisp, so I wrote my own version.

I made a few changes years after years, but most of the code stayed the same. I thought it would be interesting to revisit this program and present it part by part as a demonstration of how you can use Common Lisp to solve a simple problem.

We are going to write the program bottom-up, starting with the smallest building blocks and progressively building upon them.

The program

The program is written in Common Lisp. The most convenient way of storing and executing it is a single executable file stored in a directory being part of the PATH environment variable. In my case, the script will be called locc, for “line of code counter”, and will be stored in the ~/bin directory.

We start the file with a shebang indicating how to execute the file. We use the SBCL implementation because it is stable and actively developed. It also makes it easy to execute a simple file:

#!/usr/bin/sbcl --script

Finding files

Our line counter will operate on directories, so it has to be able to list files in them. Path handling functions are very disconcerting at first. Common Lisp was designed a long time ago, and operating systems were different at the time. Let us dig in!

First let us write a simple function to check if a pathname object is a directory pathname:

(defun directory-path-p (path)
  "Return T if PATH is a directory or NIL else."
  (declare (type (or pathname string) path))
  (and (not (pathname-name path))
       (not (pathname-type path))))

Then we write a function to identify hidden files and directories since we do not want to include them:

(defun hidden-path-p (path)
  "Return T if PATH is a hidden file or directory or NIL else."
  (declare (type pathname path))
  (let ((name (if (directory-path-p path)
                  (car (last (pathname-directory path)))
                  (file-namestring path))))
    (and (plusp (length name))
         (eq (char name 0) #\.))))

As you can see we use DIRECTORY-PATH-P to extract the basename of the path, then check if it starts with a full stop (only if it is not empty of course).

Finally we can write the function to actually list files in a directory recursively:

(defun directory-path (path)
  "If PATH is a directory pathname, return it as it is. If it is a file
pathname or a string, transform it into a directory pathname."
  (declare (type (or pathname string) path))
  (if (directory-path-p path)
      path
      (make-pathname :directory (append (or (pathname-directory path)
                                            (list :relative))
                                        (list (file-namestring path)))
                     :name nil :type nil :defaults path)))

(defun find-files (path)
  "Return a list of all files contained in the directory at PATH or any of its
subdirectories."
  (declare (type (or pathname string) path))
  (flet ((list-directory (path)
           (directory
            (make-pathname :defaults (directory-path path)
                           :type :wild :name :wild))))
    (let ((paths nil)
          (children (list-directory (directory-path path))))
      (dolist (child children paths)
        (unless (hidden-path-p child)
          (if (directory-path-p child)
              (setf paths (append paths (find-files child)))
              (push child paths)))))))

We use the DIRECTORY standard function with a path containing a wildcard component to list the files in a directory, and do so recursively.

Counting lines

Now that we have files, we can start counting lines. Let us first write a function to count the number of non-empty lines in a file.

(defun count-file-lines (path)
  "Count the number of non-empty lines in the file at PATH. A line is empty if
it only contains space or tabulation characters."
  (declare (type pathname path))
  (with-open-file (stream path :element-type '(unsigned-byte 8))
    (do ((nb-lines 0)
         (blank-line t))
        (nil)
      (let ((octet (read-byte stream nil)))
        (cond
          ((or (null octet) (eq octet #.(char-code #\Newline)))
           (unless blank-line
             (incf nb-lines))
           (when (null octet)
             (return-from count-file-lines nb-lines))
           (setf blank-line t))
          ((and (/= octet #.(char-code #\Space))
                (/= octet #.(char-code #\Tab)))
           (setf blank-line nil)))))))

We open the file to obtain a steam of octets, and read it octet by octet, keeping track of whether the current line is blank or not. Note how we make sure to count the last line even if it does not end with a newline character.

Reading a file one octet could be a disaster for performances. Fortunately SBCL file streams are buffered, something we can easily check by running our program with strace -e trace=openat,read. We would not rely on this property if we wanted our program to work on multiple Common Lisp implementations, but this is a non issue here.

Identifying the file type

Counting lines is one thing, but we need to identify their content. The simplest way is to do so based on the file extension.

Obviously we will want to ignore various files which are known not to contain text content, so we start by building a hash table containing these extensions:

(defparameter *ignored-extensions*
  (let ((extensions '("a" "bin" "bmp" "cab" "db" "elc" "exe" "gif" "gz"
                      "jar" "jpeg" "jpg" "o" "pcap" "pdf" "png" "ps" "rar"
                      "svg" "tar" "tgz" "tiff" "zip"))
        (table (make-hash-table :test 'equal)))
    (dolist (extension extensions table)
      (setf (gethash extension table) t)))
  "A hash table containing all file extensions to ignore.")

We then create another hash table to associate a type symbol to each known file extension:

(defparameter *extension-types*
  (let ((pairs '(("asm" . assembly) ("s" . assembly)
                 ("adoc" . asciidoc)
                 ("awk" . awk)
                 ("h" . c) ("c" . c)
                 ("hpp" . cpp) ("cpp" . cpp) ("cc" . cpp)
                 ("css" . css)
                 ("el" . elisp)
                 ("erl" . erlang)
                 ("go" . go)
                 ("html" . html) ("htm" . html)
                 ("ini" . ini)
                 ("hs" . haskell)
                 ("java" . java)
                 ("js" . javascript)
                 ("json" . json)
                 ("tex" . latex)
                 ("lex" . lex)
                 ("lisp" . lisp)
                 ("mkd" . markdown) ("md" . markdown)
                 ("rb" . ruby)
                 ("pl" . perl) ("pm" . perl)
                 ("php" . php)
                 ("py" . python)
                 ("sed" . sed)
                 ("sh" . shell) ("bash" . shell) ("csh" . shell)
                 ("zsh" . shell) ("ksh" . shell)
                 ("scm" . scheme)
                 ("sgml" . sgml)
                 ("sql" . sql)
                 ("texi" . texinfo)
                 ("texinfo" . texinfo)
                 ("vim" . vim)
                 ("xml" . xml) ("dtd" . xml) ("xsd" . xml)
                 ("yaml" . yaml) ("yml" . yaml)
                 ("y" . yacc)))
        (table (make-hash-table :test 'equal)))
    (dolist (pair pairs table)
      (setf (gethash (car pair) table) (cdr pair))))
  "A hash table containing a symbol identifying the type of a file for
  each known file extension.")

With these hash tables, the function identifying the type of a file is trivial:

(defun identify-file-type (path)
  "Return a symbol identifying the type of the file at PATH, or UNKNOWN if the
file extension is not known."
  (declare (type pathname path))
  (let ((extension (pathname-type path)))
    (unless (gethash extension *ignored-extensions*)
      (gethash extension *extension-types* 'unknown))))

Collecting file information

Up to this point, we wrote several functions without connecting them. But we now have all the building blocks we need. Let us use them to accumulate information about the files in a list of directories.

(defun collect-line-counts (directory-paths)
  "Collect the line count of all files in the directories located at one of the
paths in DIRECTORY-PATHS and return them grouped by file type as an
association list."
  (declare (type list directory-paths))
  (let ((line-counts (make-hash-table)))
    (dolist (directory-path directory-paths)
      (dolist (path (find-files directory-path))
        (handler-case
            (let ((type (identify-file-type path)))
              (when (and type (not (eq type 'unknown)))
                (let ((nb-lines (count-file-lines path)))
                  (incf (gethash type line-counts 0) nb-lines))))
          (error (condition)
            (format *error-output* "~&error while reading ~A: ~A~%"
                    path condition)))))
    (let ((line-count-list nil))
      (maphash (lambda (type nb-lines)
                 (push (cons type nb-lines) line-count-list))
               line-counts)
      line-count-list)))

We get to use all previous functions. We iterate through directory paths to find non-hidden files using FIND-FILES, then we use IDENTIFY-FILE-TYPE to obtain a type symbol and COUNT-FILE-LINES to count the number of non-empty lines in each file. Results are accumulated by file type in the LINE-COUNTS hash table. During this process, we handle errors that may occur while reading files with a message on the error output. Finally we transform the hash table into an association list and return it.

Presenting results

Executing all these functions in the Lisp REPL is quite practical during developement, but the default pretty printer is not really what you expect for the final command line tool:

REPL output

So let us write a function to format this association list:

(defun format-line-counts (line-counts &key (stream *standard-output*))
  "Format the line counts in the LINE-COUNTS association list to STREAM."
  (declare (type list line-counts)
           (type stream stream))
  (dolist (entry (sort line-counts '> :key 'cdr))
    (let ((type (car entry))
          (nb-lines (cdr entry)))
      (format stream "~12A  ~8@A~%"
              (string-downcase (symbol-name type)) nb-lines))))

We print the list sorted in descending order, meaning that the file type with the most number of lines comes first. Of course the output is padded to make sure numbers are all aligned.

Finalizing the program

The only thing left to do is the entry point of the script. This is the only place where we need to call a non-standard function in order to access command line arguments. If no command line argument were passed to the script, we look for files in the current directory.

(let ((paths (or (cdr sb-ext:*posix-argv*) '("."))))
  (format-line-counts
   (collect-line-counts paths)))

Shell output

Much better!

There does not seem to be any performance issue: on the Linux kernel source tree, this program is almost 11 times faster than sloccount. It would be interesting to profile the program to make sure IO is the bottleneck and improve inefficient parts, but this code is fast enough for my needs.

As you can see, it is not that hard to use Common Lisp to solve problems.

Share the word!

Liked my article? Follow me on Twitter or on Mastodon to see what I'm up to.