Lusql v09 usage
From CISTI-ICIST LAB WIKI
Contents |
Quick Start
LuSql can index multiple fields using SQL queries that involve complex joins, as well as adding additional information to the Lucene Document through a plugin mechanism. Please refer to the user manual for a tutorial explaining these use cases and for a more complete review of LuSql's features.
An example complex query from the tutorial used to populate LuSql:
select Publisher.name as pub, Journal.title as jo, Journal.issn, Volume.number as vol,\ Volume.coverYear as year, Issue.number as iss, Article.id as id, Article.title as ti,\ Article.abstract as ab, Article.startPage as startPage, Article.endPage as endPage\ from Publisher, Journal, Volume, Issue, Article \ where Publisher.id = Journal.publisherId and Journal.id = Volume.journalId \ and Volume.id = Issue.volumeId and Issue.id = Article.issueId;
Indexing a Single Table with a simple query
In this simple example, we will show how to use LuSql to index the fields from a single table.
Given the following MySQL table:
mysql> describe Article; +--------------+-------------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +--------------+-------------+------+-----+---------+-------+ | id | int(11) | YES | | NULL | | | articleTitle | varchar(64) | YES | | NULL | | | abstract | text | YES | | NULL | | | startPage | varchar(16) | YES | | NULL | | | journalTitle | text | YES | | NULL | | | issn | char(12) | YES | | NULL | | | volumeNumber | varchar(32) | YES | | NULL | | | volumeYear | char(16) | YES | MUL | NULL | | | issueNumber | varchar(32) | YES | | NULL | | +--------------+-------------+------+-----+---------+-------+
mysql> select count(id) from Article; select count(id) from Article; +-----------+ | count(id) | +-----------+ | 6409484 | +-----------+
mysql> select count(id) from Article where volumeYear > 2007; +-----------+ | count(id) | +-----------+ | 218 | +-----------+
Let's build an index the first 100 articles (all the fields of all the returned records) more recent than 2007 using the following invocation of LuSql:
java -jar lusql.jar \ -c "jdbc:mysql://dbhost/dbname?user=ID&password=PASS"\ -q "select * from Article where volumeYear > 2007" \ -n 100 \ -l tutorial-1 \ -v
Note that the LuSql jar contains all the Apache Commons class libraries, the Lucene 2.3.1 jar, and the JDBC driver for MySQL. If you need to communicate with a different DBMS or JDBC driver, or use more recent versions of the jar files, you need to put all of these in your CLASSPATH (for more information about how to change the JDBC driver, see the user manual), and invoke LuSql in the following fashion:
java ca.nrc.cisti.lusql.LuSqlMain \ -c "jdbc:mysql://dbhost/dbname?user=ID&password=PASS"\ -q "select * from Article where volumeYear > 2007" \ -n 100 \ -l tutorial-1 \ -v
Explanation of command line arguments
-
-c conn: JDBC connect string for the database -
-q "sql": this is the SQL query to run, in quotes. It can be any valid (for the database in use) SQL. All fields returned by this query are indexed. The default field parameters are: Field.Index.TOKENIZED, Field.Store.YES, Field.TermVector.YES. You can change these defaults - either globally or on a per-field basis - using the -i or -I option. See the Command Line Arguments below or the user manual for more information. -
-n integer: the maximum number of records to use from the database -
-l luceneIndexName: the name of the Lucene index to write to -
-v: produce verbose output
The above command produces the following output:
Using sql:[select * from Article where volumeYear > 2007] Using Analyzer:[org.apache.lucene.analysis.standard.StandardAnalyzer] Using Stop Word FileName:[null] Using Properties FileName:[null] Using DB driver name:[com.mysql.jdbc.Driver] Using DB URL:[jdbc:mysql://dbhost/dbname?user=USERID\&password=PASSWORD] Using Lucene index:tutorial_1 Using Lucene index RAMBUFFER MBs:48.0 Using multithreaded:true Using Test:false Using Field parameters:211 Using setting DB fetchsize=0 (see -m) Using Num documents to add:100 Using Lucene index directory:tutorial_1 Opening Lucene index: tutorial_1 Opening MySQL connection Querying:select * from Article where volumeYear > 2007 Indexing Threading: Queue size=60 Threading: # threads=20 Number of records added= 100 Optimizing index Closing index Optimizing index time: 0 seconds Closing JDBC: result set Closing JDBC: statement Closing JDBC: connection *********** Elapsed time: 7 seconds
Lucene index produced by LuSql examined using Luke:
Luke overview panel
Luke Documents panel
Luke Search panel
Command Line Arguments
usage: java -jar lusql OR java ca.nrc.cisti.lusql.LuSqlMain [-A] [-a <arg>] -c <arg> [-d <arg>] [-f <arg>] [-g <arg>] [-I <arg>] [-i <arg>] -l <arg> [-M <arg>] [-m]
[-n <arg>] [-N <arg>] [-p <arg>] [-Q] -q <arg> [-r <arg>] [-s <arg>] [-T] [-t] [-v]
-A Append to existing Lucene index.
-a <arg> Full name class implementing Lucene Analyzer; Default: org.apache.lucene.analysis.standard.StandardAnalyzer
-c <arg> JDBC connection URL: REQUIRED
-d <arg> Full name of DB driver class (should be in CLASSPATH); Default: com.mysql.jdbc.Driver
-f <arg> Full name class implementing DocumentFilter; Default: ca.nrc.cisti.lusql.core.NullFilter (does nothing).
This is applied before each Lucene Document is added to the Index. If it returns null, nothing is added
-g <arg> Set global Document field and value. Format: "field=value" or "NNN:field=value" (See -i for NNN values)
-I <arg> Global field index parameters. This sets all the fields parameters to this one set. Format: NNN. See -i for
NNN values. Only one of -i or -I can be used at one time.
-i <arg> Field index parameters.
Format: "NNN NNN NNN".One set per field in SQL, and in same order as in SQL. Used only if you want to overide the
defaults (below). See for more information Field.Index, Field.Store, Field.TermVector inorg.apache.lucene.document.Field
http://lucene.apache.org/java/2_2_0/api/org/apache/lucene/document/Field.html
Default: NNN=211
Field Index Parameter values:
Index: Default:TOKENIZED
0:NO
1:NO_NORMS
2:TOKENIZED
3:UN_TOKENIZED
Store: Default:YES
0:NO
1:YES
2:COMPRESS
Term vector: Default:NO
0:NO
1:YES
2:WITH_OFFSETS
3:WITH_POSITIONS
4:WITH_POSITIONS_OFFSETS
-l <arg> Lucene directory. Default: index
-M <arg> Changes the meta replacement string for the -Q command line parameters. Default: @
-m Turns off need get around MySql driver-caused OutOfMemory problem in large queries. Sets
Statement.setFetchSize(Integer.MIN_VALUE)
See http://benjchristensen.wordpress.com/2008/05/27/mysql-jdbc-memory-usage-on-large-resultset
-n <arg> Number of documents to add. If unset all records from query are used.
-N <arg> Number of thread for multithreading. Defaults: Runtime.getRuntime().availableProcessors()) *2.5. This
machine this is:
-p <arg> Properties file
-Q Subquery in the form "field|NNN|sql" or "field|NNN NNN...|sql" or "field|sql" (See -i for NNN values). Note
that you can have multiple -Qs.
-q <arg> Primary SQL query (in double quotes). REQUIRED
-r <arg> LuceneRAMBufferSizeInMBs: IndexWriter.setRAMBufferSizeMB(). Default: 48.0
-s <arg> Name of stop word file for Lucene to use (relative or full path)
-T Turn off multithreading. Note that multithreading does not guarantee the ordering of documents. If you want
the order of Lucene documents to match the ordering of DB records generated by the SQL query, turn-off multithreading
-t Test mode. Does not open up Lucene index. Prints (-n) records from SQL query
-v Verbose mode



