Mozilla Skin

Lusql v09 usage

From CISTI-ICIST LAB WIKI

LuSql Home

Contents

Quick Start

LuSql can index multiple fields using SQL queries that involve complex joins, as well as adding additional information to the Lucene Document through a plugin mechanism. Please refer to the user manual for a tutorial explaining these use cases and for a more complete review of LuSql's features.

An example complex query from the tutorial used to populate LuSql:

select  Publisher.name as pub, Journal.title as jo, Journal.issn, Volume.number as vol,\
  Volume.coverYear as year, Issue.number as iss, Article.id as id, Article.title as ti,\
  Article.abstract as ab, Article.startPage as startPage, Article.endPage as endPage\
  from Publisher, Journal, Volume, Issue, Article \
  where Publisher.id = Journal.publisherId and Journal.id = Volume.journalId \
  and Volume.id = Issue.volumeId and Issue.id = Article.issueId;


Indexing a Single Table with a simple query

In this simple example, we will show how to use LuSql to index the fields from a single table.

Given the following MySQL table:

mysql> describe Article;
  +--------------+-------------+------+-----+---------+-------+
  | Field        | Type        | Null | Key | Default | Extra |
  +--------------+-------------+------+-----+---------+-------+
  | id           | int(11)     | YES  |     | NULL    |       | 
  | articleTitle | varchar(64) | YES  |     | NULL    |       | 
  | abstract     | text        | YES  |     | NULL    |       | 
  | startPage    | varchar(16) | YES  |     | NULL    |       | 
  | journalTitle | text        | YES  |     | NULL    |       | 
  | issn         | char(12)    | YES  |     | NULL    |       | 
  | volumeNumber | varchar(32) | YES  |     | NULL    |       | 
  | volumeYear   | char(16)    | YES  | MUL | NULL    |       | 
  | issueNumber  | varchar(32) | YES  |     | NULL    |       | 
  +--------------+-------------+------+-----+---------+-------+
 mysql> select count(id) from Article;
  select count(id) from Article;
  +-----------+
  | count(id) |
  +-----------+
  |   6409484 | 
  +-----------+
  mysql> select count(id)  from Article where volumeYear > 2007;
  +-----------+
  | count(id) |
  +-----------+
  |       218 | 
  +-----------+

Let's build an index the first 100 articles (all the fields of all the returned records) more recent than 2007 using the following invocation of LuSql:

java -jar lusql.jar \
 -c "jdbc:mysql://dbhost/dbname?user=ID&password=PASS"\ 
 -q "select * from Article where  volumeYear >  2007" \
 -n  100  \
 -l tutorial-1 \
 -v 

Note that the LuSql jar contains all the Apache Commons class libraries, the Lucene 2.3.1 jar, and the JDBC driver for MySQL. If you need to communicate with a different DBMS or JDBC driver, or use more recent versions of the jar files, you need to put all of these in your CLASSPATH (for more information about how to change the JDBC driver, see the user manual), and invoke LuSql in the following fashion:

java ca.nrc.cisti.lusql.LuSqlMain \
 -c "jdbc:mysql://dbhost/dbname?user=ID&password=PASS"\ 
 -q "select * from Article where  volumeYear >  2007" \
 -n  100  \
 -l tutorial-1 \
 -v  

Explanation of command line arguments

  • -c conn : JDBC connect string for the database
  • -q "sql" : this is the SQL query to run, in quotes. It can be any valid (for the database in use) SQL. All fields returned by this query are indexed. The default field parameters are: Field.Index.TOKENIZED, Field.Store.YES, Field.TermVector.YES. You can change these defaults - either globally or on a per-field basis - using the -i or -I option. See the Command Line Arguments below or the user manual for more information.
  • -n integer : the maximum number of records to use from the database
  • -l luceneIndexName : the name of the Lucene index to write to
  • -v : produce verbose output

The above command produces the following output:

Using sql:[select * from Article where volumeYear > 2007]
Using Analyzer:[org.apache.lucene.analysis.standard.StandardAnalyzer]
Using Stop Word FileName:[null]
Using Properties FileName:[null]
Using DB driver name:[com.mysql.jdbc.Driver]
Using DB URL:[jdbc:mysql://dbhost/dbname?user=USERID\&password=PASSWORD]
Using Lucene index:tutorial_1
Using Lucene index RAMBUFFER MBs:48.0
Using multithreaded:true
Using Test:false
Using Field parameters:211
Using setting DB fetchsize=0 (see -m)
Using Num documents to add:100
Using Lucene index directory:tutorial_1
Opening Lucene index: tutorial_1
Opening MySQL connection
Querying:select * from Article where volumeYear > 2007
Indexing
Threading: Queue size=60
Threading: # threads=20

Number of records added= 100

Optimizing index
  Closing index
  Optimizing index time: 0 seconds
Closing JDBC: result set
Closing JDBC: statement
Closing JDBC: connection
*********** Elapsed time: 7 seconds

Lucene index produced by LuSql examined using Luke:

Luke overview panel

Image:Lusql 0.9 luke 1 1.png

Luke Documents panel

Image:Lusql 0.9 luke 1 2.png

Luke Search panel

Image:Lusql 0.9 luke 1 3.png

Command Line Arguments

usage: java -jar lusql OR java ca.nrc.cisti.lusql.LuSqlMain    [-A] [-a <arg>] -c <arg> [-d <arg>] [-f <arg>] [-g <arg>] [-I <arg>] [-i <arg>] -l <arg> [-M <arg>] [-m]
      [-n <arg>] [-N <arg>] [-p <arg>] [-Q] -q <arg> [-r <arg>] [-s <arg>] [-T] [-t] [-v]
-A         Append to existing Lucene index.
-a <arg>   Full name class implementing Lucene Analyzer; Default: org.apache.lucene.analysis.standard.StandardAnalyzer
-c <arg>   JDBC connection URL: REQUIRED
-d <arg>   Full name of DB driver class (should be in CLASSPATH); Default: com.mysql.jdbc.Driver
-f <arg>   Full name class implementing DocumentFilter; Default: ca.nrc.cisti.lusql.core.NullFilter (does nothing).
           This is applied before each Lucene Document is added to the Index. If it returns null, nothing is added
-g <arg>   Set global Document field and value. Format: "field=value" or "NNN:field=value" (See -i for NNN values)
-I <arg>   Global field index parameters. This sets all the fields parameters to this one set. Format: NNN. See -i for
           NNN values. Only one of -i or -I can be used at one time.
-i <arg>   Field index parameters.
           Format: "NNN NNN NNN".One set per field in SQL, and in same order as in SQL. Used only if you want to overide the
           defaults (below). See for more information Field.Index, Field.Store, Field.TermVector inorg.apache.lucene.document.Field
           http://lucene.apache.org/java/2_2_0/api/org/apache/lucene/document/Field.html
           Default: NNN=211
           Field Index Parameter values:
           Index: Default:TOKENIZED
            0:NO
            1:NO_NORMS
            2:TOKENIZED
            3:UN_TOKENIZED
           Store: Default:YES
            0:NO
            1:YES
            2:COMPRESS
           Term vector: Default:NO
            0:NO
            1:YES
            2:WITH_OFFSETS
            3:WITH_POSITIONS
            4:WITH_POSITIONS_OFFSETS
-l <arg>   Lucene directory. Default: index
-M <arg>   Changes the meta replacement string for the -Q command line parameters. Default: @
-m         Turns off need get around MySql driver-caused OutOfMemory problem in large queries. Sets
           Statement.setFetchSize(Integer.MIN_VALUE)
           See http://benjchristensen.wordpress.com/2008/05/27/mysql-jdbc-memory-usage-on-large-resultset
-n <arg>   Number of documents to add. If unset all records from query are used.
-N <arg>   Number of thread for multithreading. Defaults: Runtime.getRuntime().availableProcessors()) *2.5. This
           machine this is:
-p <arg>   Properties file
-Q         Subquery in the form "field|NNN|sql" or "field|NNN NNN...|sql" or "field|sql"  (See -i for NNN values). Note
           that you can have multiple -Qs.
-q <arg>   Primary SQL query (in double quotes). REQUIRED
-r <arg>   LuceneRAMBufferSizeInMBs: IndexWriter.setRAMBufferSizeMB(). Default: 48.0
-s <arg>   Name of stop word file for Lucene to use (relative or full path)
-T         Turn off multithreading. Note that multithreading does not guarantee the ordering of documents. If you want
           the order of Lucene documents to match the ordering of DB records generated by the SQL query, turn-off multithreading
-t         Test mode. Does not open up Lucene index. Prints (-n) records from SQL query
-v         Verbose mode


LuSql Home