Monday, August 18, 2008

Running Lucli in Batch mode

Lucli is an interactive command line tool that provides functionality similar to Luke, ie the ability to look inside a Lucene index. Sometimes, such as when the indexes are large and sitting on a remote machine, and when you don't need the full power of Luke to query it, it is often more convenient to use Lucli for examining the index, than copying it over to your local machine and use Luke to get at it. Of course, there are other ways to use Luke, such as X-forwarding or using VNC which requires setup on the server side, which may not be feasible in all cases.

Yet another possible good use for Lucli is to have it be called by shell scripts. I mean, if an index is worth spending effort to examine manually, its probably worth scripting this work so it can happen without human intervention. However, because it is an interactive tool, it is not possible with the version (2.4 at the time of this writing) that is available in the Lucene SVN repository. This post describes what I had to do to build this functionality into Lucli, and to add a new custom method that is (perhaps) unique to us.

Adding the --file option

I added a -f (or --file GNU style) option that is recognized by Lucli on the command line, and if so, it creates a new ConsoleReader object that reads from the file name specified after the option. The script File object is retrieved in the parseArgs() method, shown below.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
    private void parseArgs(String[] args) {
      String errorMessage = null;
        if (args.length > 0) {
          if (args.length == 2 &&
              ("--file".equals(args[0]) || "-f".equals(args[0]))) {
            File scriptfile = new File(args[1]);
            if (scriptfile.exists() && scriptfile.canExecute()) {
              this.script = scriptfile;
              return;
            } else {
              errorMessage = "File:" + args[1] + " does not exist or is not executable";
            }
          }
          usage(errorMessage);
          System.exit(1);
        }
    }

And the Lucli constructor would use a null check on the script File object to determine if it should open a no-args ConsoleReader, or a special one that reads from the script file. Like this:

1
2
3
4
5
6
7
        ConsoleReader cr = null;
        if (script != null) {
          cr = new ConsoleReader(new FileInputStream(script), new PrintWriter(System.out));
        } else {
          cr = new ConsoleReader();
        }
        ...

I also updated the usage() method to report that --file filename is an optional but supported parameter.

1
2
3
4
5
6
    private void usage(String errorMessage) {
        message("Usage: lucli.Lucli [--file script_file]");
        if (errorMessage != null) {
          message("(" + errorMessage + ")");
        }
    }

Finally, I modified the call to the lucli.Lucli in the shell script to accept command line parameters.

1
2
3
#!/bin/bash
...
$JAVA_HOME/bin/java -Xmx${LUCLI_MEMORY} -cp $CLASSPATH lucli.Lucli $*

Usage: example script to find #-records

To find the number of records in an index, I would run Lucli interactively from the command line as follows:

1
2
3
4
5
6
7
8
9
sujit@sirocco:~$ ./run.sh 
Lucene CLI. Using directory 'index'. Type 'help' for instructions.
lucli> index /path/to/my/index
Lucene CLI. Using directory '/path/to/my/index'. Type 'help' for instructions.
Index has 6626 documents 
All Fields:[...]
Indexed Fields:[...]
lucli> quit
sujit@sirocco:~$ 

So to run this in batch mode, we create a file (call it /tmp/script1.lucli) like so:

1
2
index /path/to/my/index
quit

And then, to get the number of records, we run the following command line script. Obviously, this command can now be put into another script that does something with the number of records.

1
2
3
4
sujit@sirocco:~$ ./run.sh --file /tmp/script1.lucli | grep "Index has" |\
  gawk '{print $3}'
6626
sujit@sirocco:~$

New method list([fieldname1;fieldname2;...])

Another question that I often have to answer is if a particular record is in the index or not, or what new records are available in a freshly built index. Because it is so easy to write this sort of ad-hoc stuff, I have a Python and a Java version that I use interchangeably, depending on whether I have PyLucene set up on the target environment or not. However, this seemed to be a good time to standardize on one tool, so I added a "list" method to Lucli. You can see it here:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
sujit@sirocco:~$ ./run.sh 
lucli> help
 count: Return the number of hits for a search. Example: count foo
 explain: Explanation that describes how the document scored against query. 
          Example: explain foo
 help: Display help about commands
 index: Choose a different lucene index. Example index my_index
 info: Display info about the current Lucene index. Example: info
 list: Lists value of field list (field1;field2;...) or all fields for all 
          records in the selected index
 optimize: Optimize the current index
 quit: Quit/exit the program
 search: Search the current index. Example: search foo
 terms: Show the first 100 terms in this index. Supply a field name to only 
          show terms in a specific field. Example: terms
 tokens: Does a search and shows the top 10 tokens for each document. Verbose! 
          Example: tokens foo
lucli> 

For this, I had to add a new method list() in LuceneMethods and add code to Lucli to trigger this method if it encounters a list call. This consists of an addCommand("list", LIST, ...) call and a case LIST in the switch in the Lucli.handleCommand() method. The case statement is shown below:

1
2
3
4
5
6
                        case LIST:
                          for (int ii = 1; ii < words.length; ii++) {
                            query += words[ii] + ";";
                          }
                          luceneMethods.list(query);
                          break;

And here is the contents of the list() method in the LuceneMethods class. As you can see, its fairly straightforward. If field names are given, then it just returns the field values and if no field names are given, then it returns all fields. The filtering is done in Unix. This is OK to do here, since these are really ad-hoc usages and so are unlikely to impact performance.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
  /** Lists out named fields from the index (all records)
   * @throws IOException
   */
  public void list(String query) throws IOException {
    String[] fieldNames = null;
    if ("".equals(query.trim())) {
      getFieldInfo();
      fieldNames = new String[fields.size()];
      for (int i = 0; i < fieldNames.length; i++) {
        fieldNames[i] = (String) fields.get(i);
      }
    } else {
      fieldNames = query.split(";");
    }
    IndexReader indexReader = IndexReader.open(indexName);
    int maxDoc = indexReader.maxDoc();
    for (int i = 0; i < maxDoc; i++) {
      Document doc = indexReader.document(i);
      StringBuffer buf = new StringBuffer();
      for (int j = 0; j < fieldNames.length; j++) {
        if (j > 0) {
          buf.append(";");
        }
        buf.append(doc.get(fieldNames[j]));
      }
      message(buf.toString());
    }
    indexReader.close();
  }

Usage: example script to find records with specific URL pattern

As mentioned before, we do our pattern matching stuff using Unix tools. This keeps the Lucli method simple and more generic. The example use case is for finding the records (identified by title) that satisfy a particular URL pattern. As before, we can experiment in the interactive shell, and build our script file (script2.lucli) like so:

1
2
3
index /path/to/my/index
list title;url
quit

And call it like so:

1
2
sujit@sirocco:~$ ./run.sh --file /tmp/script2.lucli | \
  gawk -F';' --source '{if ($2 ~ /my_url_pattern/) printf("%s %s\n", $1, $2)}'

The full patch file

You can just patch the stock Lucli with the output of "svn diff" from the src/java/lucli subdirectory. I had to do some hacks (listed below) to get the Lucli to compile, which won't be captured in an "svn diff" output, so I am not publishing the diff of the entire module.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
Index: LuceneMethods.java
===================================================================
--- LuceneMethods.java (revision 683771)
+++ LuceneMethods.java (working copy)
@@ -352,6 +352,36 @@
     indexReader.close();
   }
 
+  /** Lists out named fields from the index (all records)
+   * @throws IOException
+   */
+  public void list(String query) throws IOException {
+    String[] fieldNames = null;
+    if ("".equals(query.trim())) {
+      getFieldInfo();
+      fieldNames = new String[fields.size()];
+      for (int i = 0; i < fieldNames.length; i++) {
+        fieldNames[i] = (String) fields.get(i);
+      }
+    } else {
+      fieldNames = query.split(";");
+    }
+    IndexReader indexReader = IndexReader.open(indexName);
+    int maxDoc = indexReader.maxDoc();
+    for (int i = 0; i < maxDoc; i++) {
+      Document doc = indexReader.document(i);
+      StringBuffer buf = new StringBuffer();
+      for (int j = 0; j < fieldNames.length; j++) {
+        if (j > 0) {
+          buf.append(";");
+        }
+        buf.append(doc.get(fieldNames[j]));
+      }
+      message(buf.toString());
+    }
+    indexReader.close();
+  }
+  
   /** Sort Hashtable values
    * @param h the hashtable we're sorting
    * from http://developer.java.sun.com/developer/qow/archive/170/index.jsp
Index: Lucli.java
===================================================================
--- Lucli.java (revision 683771)
+++ Lucli.java (working copy)
@@ -55,7 +55,9 @@
  */
 
 import java.io.File;
+import java.io.FileInputStream;
 import java.io.IOException;
+import java.io.PrintWriter;
 import java.io.UnsupportedEncodingException;
 import java.util.Iterator;
 import java.util.Set;
@@ -95,11 +97,13 @@
  final static int INDEX = 7;
  final static int TOKENS = 8;
  final static int EXPLAIN = 9;
+ final static int LIST = 10;
 
  String historyFile;
  TreeMap commandMap = new TreeMap();
  LuceneMethods luceneMethods; //current cli class we're using
  boolean enableReadline; //false: use plain java. True: shared library readline
+ File script = null;
 
  /**
   Main entry point. The first argument can be a filename with an
@@ -124,11 +128,17 @@
   addCommand("index", INDEX, "Choose a different lucene index. Example index my_index", 1);
   addCommand("tokens", TOKENS, "Does a search and shows the top 10 tokens for each document. Verbose! Example: tokens foo", 1);
   addCommand("explain", EXPLAIN, "Explanation that describes how the document scored against query. Example: explain foo", 1);
-
+  addCommand("list", LIST, "Lists value of field list (field1;field2;...) or all fields for all records in the selected index");
+  
   //parse command line arguments
   parseArgs(args);
 
-  ConsoleReader cr = new ConsoleReader();
+  ConsoleReader cr = null;
+  if (script != null) {
+    cr = new ConsoleReader(new FileInputStream(script), new PrintWriter(System.out));
+  } else {
+    cr = new ConsoleReader();
+  }
   //Readline.readHistoryFile(fullPath);
   cr.setHistory(new History(new File(historyFile)));
   
@@ -234,6 +244,12 @@
     }
     luceneMethods.search(query, true, false, cr);
     break;
+   case LIST:
+     for (int ii = 1; ii < words.length; ii++) {
+       query += words[ii] + ";";
+     }
+     luceneMethods.list(query);
+     break;
    case HELP:
     help();
     break;
@@ -315,18 +331,31 @@
  }
 
  /*
-  * Parse command line arguments (currently none)
+  * Only parse command line argument --file (or -f).
   */
  private void parseArgs(String[] args) {
+   String errorMessage = null;
   if (args.length > 0) {
-   usage();
+    if (args.length == 2 && 
+        ("--file".equals(args[0]) || "-f".equals(args[0]))) {
+      File scriptfile = new File(args[1]);
+      if (scriptfile.exists() && scriptfile.canExecute()) {
+        this.script = scriptfile;
+        return;
+      } else {
+        errorMessage = "File:" + args[1] + " does not exist or is not executable";
+      }
+    }
+   usage(errorMessage);
    System.exit(1);
   }
  }
 
- private void usage() {
-  message("Usage: lucli.Lucli");
-  message("(currently, no parameters are supported)");
+ private void usage(String errorMessage) {
+  message("Usage: lucli.Lucli [--file script_file]");
+  if (errorMessage != null) {
+    message("(" + errorMessage + ")");
+  }
  }
 
  private class Command {

To get Lucli to compile locally, I had to make the following changes to build.xml.

  • Change the reference to ../contrib-build.xml to contrib-build.xml and copy the contrib-build.xml from svn to my project root directory.
  • Change the reference to ../common-build.xml in contrib-build.xml and copy the common-build.xml from svn to my project root directory.
  • Change the location of lucene.jar to a ${project.root}/lib and copy over an existing lucene jar there, to suppress the target "build-lucene" from firing and giving errors.

Conclusion

As you can see, extending Lucli is quite easy. There are just two classes and the code is quite easy to read. Because it does not have too much functionality, when faced with the task of modifying it to suit your own needs, it is very easy (perhaps easier) to just go ahead write your own little subset. The reason I extended Lucli rather than do that are:

  • this gives me all the other cool stuff that Lucli already has without my writing code for it,
  • my changes may potentially benefit a larger number of people, and
  • what goes around comes around, and one day I will benefit from somebody else's extension to Lucli. To be fair, I have already benefited a lot from the code and information contributions by others over the years.

Be the first to comment. Comments are moderated to prevent spam.