Solr

Solr is an Apache search engine.  CEHD is used it as a means of search across multiple Drupal sites.

 

I essentially got this working in a way that I think will serve the college's needs.

A prototype of the setup is located here: http://strader.cehd.tamu.edu/Demos/solr/

The gist is this,

  • Setup Solr 3.1 and Nutch 1.2
  • Use Nutch to crawl the servers we want to search with our system
  • Use the custom front end to get the query from the user, submit to Solr, retrieve response in XML format, and display appropriately.

To start:

  • cd /usr/local/apache-solr-3.1.0/cehd/
  • /usr/local/java/bin/java -jar start.jar &

Tutorial

Installation

Example

  • cd solr/example
  • Start application: java -jar start.jar
  • index data manually. cd exampledocs; java -jar post.jar *.xml

Nutch Crawler

  1. Edit NUTCH_ROOT/conf/nutch-default.xml and set the value of http.agent.name to be the name of your crawler. You can then fill in any other info about your crawler that you wish, but it is not necessary.
  2. Create folder NUTCH_ROOT/crawl
  3. Create file NUTCH_ROOT/urls/nutch and into it type all the URLs you wish to crawl (one per line) - make sure to include ‘http://’ and the trailing slash.
  4. Edit NUTCH_ROOT/conf/crawl-urlfilter.txt – beneath the line ‘# accept hosts in MY.DOMAIN.NAME’ replace MY.DOMAIN.COM with the first of the URLs you wish to crawl and then make a new line for each of the URLs (formatted in the same way as the first one).
  5. Configure Solr:
    1. Copy all the files from the NUTCH_ROOT/conf into SOLR_ROOT/example/solr/conf (overwrite any files it asks you to).
    2. Edit SOLR_ROOT/example/solr/conf/schema.xml and in line 71 change the stored attribute form false to true.
    3. Edit SOLR_ROOT/example/solr/conf/solrconfig.xml and add the following above the first requestHandler tag:
  6. Start Solr:
    1. $ cd SOLR_ROOT/example
    2. $ java -jar start.jar
  7. Start the crawl:
    1. $ cd NUTCH_ROOT (mycehd: /usr/local/apache-solr-3.1.0/nutch-1.2)
    2. The crawl command has the following options:
      • -dir names the directory to put the crawled data into
      • -threads determines the number of threads that will be fetched in parallel (optional)
      • -depth indicates the link depth from the root page that should be crawled
      • -topN determines the maximum number of URLs to be retrieved at each level up to the depth
      • You can set these numbers to whatever you like, but the general rule is that the higher the numbers are then the more data you will crawl and the longer your crawl will take. This all depends on the setup of your server and what you want to do with your crawl. For example, this is a crawl command that will take a couple of days to complete:
    3. $ bin/nutch crawl urls -dir crawl -depth 20 -topN 2000000
  8. Index the crawl results:
    1. $ bin/nutch solrindex http://HOST_ADDRESS:8983/solr/ crawl/crawldb crawl/linkdb crawl/segments/* (the port number here and in the next point will differ depending on your server set up – check the Solr wiki for more info about that).
    2. Go to http://HOST_ADDRESS:8983/solr/admin for the default Solr admin panel to search the index. You can also hit the results XML directly by hitting the right URL – you will see this URL in the address bar when you get to the results.
 Port on mycehd is 8984

I essentially got this working in a way that I think will serve the college's needs.

A prototype of the setup is located here: http://strader.cehd.tamu.edu/Demos/solr/

The gist is this,

  • Setup Solr 3.1 and Nutch 1.2
  • Use Nutch to crawl the servers we want to search with our system

Working with Solr Index

Examples are from installation of Solr or mycehd.tamu.edu

http://lucene.apache.org/solr/tutorial.html

Deleting contents of index

 (Deleting a single item) java -Ddata=args -Dcommit=no -jar post.jar "<delete><id>SP2514N</id></delete>"
  • Deleting all pages with URLs containing "node"
 /usr/local/java/bin/java -Ddata=args -Durl=http://localhost:8984/solr/update -jar post.jar 
 "<delete><query>url:node</query></delete>"

Integrating Solr with Drupal

  • using apachesolr drual integration module
  • placed the file schema.xml from the apachesolr module directory at /usr/local/apache-solr-3.1.0/cehd/solr/conf/schema.xml (kept nutch version as schema.xml.nutch)
  • Kill Solr. Start solr (with new schema.xml file in place).
  • enable apachsolr module in drupal
  • Start mycehd instance: /usr/local/apache-solr-3.1.0/cehd% /usr/local/java/bin/java -jar start.jar &
  • index the site through Drupal
  • There are now 2 instances of solr running (one for mycehd and one for the marketing sites). The marketing instance is identical to mycehd except it is using port 8993.
  • To start the 2nd instance: /usr/local/apache-solr-3.1.0/cehd-public% /usr/local/java/bin/java -jar start.jar &
  • I added a 3rd instance for multi-site.  This will be for the dept dev sites and eventually the college site.
  • To add a new instance I
    • Made a copy of the cehd directory under /usr/local/apache-solr-3.1.0/
    • change the port in solr/conf/scripts.conf, solr/conf/solrconfig.xml.nutch (may not be needed), etc/jetty.xml to 8995
    • add the 3rd instance to the /root/Scripts/checksolr.pl script

 

Taxonomy: