-
Notifications
You must be signed in to change notification settings - Fork 753
Per project management and workflow
OpenGrok can be run with or without projects. A project is simply a directory directly underneath the OpenGrok source root directory. A project can have zero or more Source Code Management repositories underneath. In a setup without projects, all of the data has to be indexed at once. With projects however, each project has its own index so it is possible to index projects separately. What's more it is possible to reindex them in parallel, potentially speeding the overall process.
When working with project data, there are 2 types of processing that can take a long time:
- synchronization: updating project data so that it matches its origin
- usually involves running commands like
git pull
in all the repositories for given project.
- usually involves running commands like
- indexing: updating the index so that it matches the project data
For some projects either or both steps can take a long time. Say you have a repository that has its origin residing on a NFS share across the Atlantic so it has high latency plus it uses legacy VCS that operates not on changesets but on individual files and therefore the repository takes a long time (say tens of minutes if not hours) to synchronize. Or, there is a repository that has a large number of files so the initial phase of indexing always takes a long time (due to scanning the whole project directory tree for changed files) even though the incremental changes are small.
Or maybe there are lots of projects that exhibit some of these characteristics.
Previously, it was necessary to index all of source root in order to discover new projects and put them to configuration. Starting with OpenGrok 1.1, it is possible to manage and index projects separately.
As a result, the indexing of complete source root is only necessary when upgrading across OpenGrok version with incompatible Lucene indexes.
Combine these procedures with the parallel processing tools (see repository synchronization) and you have per-project management with parallel processing.
The following examples assume that OpenGrok install base is under the /opengrok
directory.
It is possible to start from scratch or use OpenGrok instance that already indexes all projects in one go and convert it to index projects separately and in parallel.
There are some design choices that need to be dealt with:
- The indexer either has to discover projects and their repositories during the indexing preparation or it has to know them in advance.
- The configuration file has to be written once a project was added or modified or indexed for the first time.
- The indexer uploads the configuration to the web app at the end of the indexing.
Thus, when indexing newly added project, it is necessary to add it to the configuration first, then index it and lastly make the new configuration persistent.
This page lists all the pieces and how to operate them.
Also see https://github.com/oracle/opengrok/wiki/Tuning-for-large-code-bases#tomcat-threads
The following is assuming that the commands opengrok-projadm
, opengrok-groups
and opengrok-config-merge
tools are in PATH
. You can install these from the opengrok-tools
python package available in the release tarball.
Using the opengrok-projadm
tool (that utilizes the opengrok-config-merge
tool and RESTful API) it is possible to manage the projects.
The next sections start by suggesting to backup current configuration. This could be done by e.g. copying the configuration.xml
(that is written by the indexer when using the -W
indexer option) file aside, taking file-system snapshot of the directory the configuration is stored in etc.
This is necessary as a prevention if something goes wrong.
- backup current config
- add the project data to a directory under the source root directory
- this usually involves running VCS command such as
git clone
, extracting source code from an archive, etc.
- this usually involves running VCS command such as
- perform any necessary authorization adjustments
- add the project to configuration (also refreshes the configuration on disk):
opengrok-projadm -b /opengrok -a PROJECT
- change any per-project settings (see Web services)
The indexing part of the wiki explains how to run the indexer in general.
Running the indexer for single project has several constraints:
- scanning for repositories/projects is not wanted - no
-P
or-S
options- however the indexer has to know the project/repository information so it needs to be either retrieved from the web application or use the persistent configuration on disk
- it is undesirable to write the configuration that is created during the indexer run to disk - no
-W
option
Thus, running the indexer for single project may look like this:
$ curl -s -X GET http://localhost:8080/source/api/v1/configuration -o fresh_config.xml
$ opengrok-indexer -a /opengrok/dist/lib/opengrok.jar -- \
-c /usr/local/bin/ctags \
-U 'http://localhost:8080/source' \
-R fresh_config.xml
-H PROJECT_NAME \
PROJECT_NAME
The -U
option is important as it pokes the web app to use the most recent index (that was just created). Also it updates it with the latest repository metadata.
This does not deal with logging to a separate log file. Also, this is not robust when run in parallel due to the configuration handling (it should be stored in temporary file with random name).
Now, there is the opengrok-reindex-project
script which is recommended to use. It downloads fresh configuration from the webapp so that the indexer has the knowledge about indexed project and its repositories. It can also generate logging configuration on the fly.
Once the project reindex is done, save the configuration (this is necessary so that the indexed flag of the project is persistent. If not made consistent and the web app restarts the project will not be accessible in the web app).
opengrok-projadm -b /opengrok -r
The -R
indexer option can be used for opengrok-projadm
to supply path to read-only configuration so that it is merged with current configuration.
- backup current config
- delete the project from configuration (deletes project's index data and refreshes on disk configuration).
opengrok-projadm -b /opengrok -d PROJECT
- perform any necessary authorization, group, per-project adjustments in read-only configuration (if any)
The -R
indexer option can be used with the opengrok-projadm
script to supply the path to read-only configuration so that it is merged with current configuration.
provides a way how to run a sequence of commands for a set of projects in parallel. Thus, it can be used to synchronize and reindex a project.
The script accepts the configuration either in JSON or YAML.
Use e.g. like this:
$ opengrok-sync -c /scripts/sync.conf -d /ws-local/
where the sync.conf
file contents might look like this:
commands:
- call:
uri: http://localhost:8080/source/api/v1/messages
method: POST
data:
cssClass: info
duration: PT1H
tags: ['%PROJECT%']
text: resync + reindex in progress
- command:
args: [sudo, -u, wsmirror, /opengrok/dist/bin/opengrok-mirror, -c, /opengrok/etc/mirror-config.yml, -U, 'http://localhost:8080/source']
- command:
args: [sudo, -u, webservd, /opengrok/dist/bin/opengrok-reindex-project, -J=-d64,
'-J=-XX:-UseGCOverheadLimit', -J=-Xmx16g, -J=-server, --jar, /opengrok/dist/lib/opengrok.jar,
-t, /opengrok/etc/logging.properties.template, -p, '%PROJ%', -d, /opengrok/log/%PROJECT%,
-P, '%PROJECT%', -U, 'http://localhost:8080/source', --, --renamedHistory, 'on', -r, dirbased, -G, -m, '256', -c,
/usr/local/bin/ctags, -U, 'http://localhost:8080/source', -o, /opengrok/etc/ctags.config,
-H, '%PROJECT%']
env: {LC_ALL: en_US.UTF-8}
limits: {RLIMIT_NOFILE: 1024}
- call:
uri: 'http://localhost:8080/source/api/v1/messages?tag=%PROJECT%'
method: DELETE
data: ''
- command: [/scripts/check-indexer-logs.ksh]
cleanup:
- call:
uri: 'http://localhost:8080/source/api/v1/messages?tag=%PROJECT%'
method: DELETE
data: ''
Note: the above -U 'http://localhost:8080/source'
twice in opengrok-reindex-project
arguments is not a typo. It must be specified twice - for the python and for the indexer.
The above opengrok-sync
command will basically take all directories under /ws-local
and for each it will run the sequence of commands specified in the sync.conf
file. This will be done in parallel - on project level. The level of parallelism can be specified using the the --workers
option (by default it will use as many workers as there are CPUs in the system).
Another variant of how to specify the list of projects to be synchronized is to use the --indexed
option of opengrok-sync
that will query the webapp configuration for list of indexed projects and will use that list. Otherwise, the --projects
option can be specified to process just specified projects.
The commands above will basically:
- mark the project with alert (to let the users know it is being synchronized/indexed) using the RESTful API call (the
%PROJECT%
string is replaced with current project name) - pull the changes from all the upstream repositories that belong to the project using the
opengrok-mirror
command - reindex the project using
opengrok-reindex-project
- clear the alert using the second RESTful API call
- execute the
/scripts/check-indexer-logs.ksh
script to perform some pattern matching in the indexer logs to see if there were any serious failures there. The script can look e.g. like this:
#!/usr/bin/ksh
#
# Check OpenGrok indexer logs in the last 24 hours for any signs of serious
# trouble.
#
if (( $# != 1 )); then
print -u2 "usage: $0 <project_name>"
exit 1
fi
project_name=$1
typeset -r log_dir="/opengrok/log/$project_name/"
if [[ ! -d $log_dir ]]; then
print -u2 "cannot open log directory $log_dir"
exit 1
fi
# Check the last log file.
if grep SEVERE "$log_dir/opengrok0.0.log"; then
exit 1
fi
The opengrok-sync
script will print any errors to the console and uses file level locking to provide exclusivity of run so it is handy to run from crontab
periodically.
Each "command" can be either normal command execution (supplying the list of program arguments) or RESTful API call (supplying the HTTP verb and optional payload).
Note that the cleanup
is a set of commands. If any of them fails (i.e. returns non zero value), the process is not interrupted, unlike the main command sequence.
Note that if the web application is listening on non-standard host or port (localhost
and 8080 is the default), the URI has to be used everywhere where it matters. Given that opengrok-sync
performs RESTful API queries itself, one has to specify the location using the -U option of this script and then again it is necessary to specify it in the configuration file - for any RESTful API calls or for opengrok-indexer
command (which also uses the -U option).
In the configuration, each command that does an API call, can specify set of HTTP headers via dictionary in the 4th element of the list, e.g.:
- call:
uri: 'http://192.160.0.1:8080/source/api/v1/messages?tag=%PROJECT%'
method: DELETE
data: 'resync + reindex in progress'
headers:
'Content-type': 'text/plain'
'Authorization': 'Bearer foobar'
Also, it is possible to specify common set of headers to be used for each RESTful API command with top level headers
item:
headers:
'Authorization': 'Bearer foobar'
'X-another-header': 'whoohooo'
commands:
- call:
uri: http://192.160.0.1:8080/source/api/v1/messages
method: POST
data:
messageLevel: warning
duration: PT1H
tags: ['%PROJECT%']
text: resync + reindex in progress
headers:
'X-special-header': 'foo'
'X-just-another-header': 'ha'
...
What happens is that the headers from the headers
top level configuration item are merged with the command specific headers. So, in the above example, the RESTful command will have the Authorization
, X-another-header
, X-special-header
, X-just-another-header
headers.
Further, if opengrok-sync
is run with the -H
command line option (that can have multiple arguments, i.e. it is possible to specify multiple headers with it), these headers will be merged with the headers from the top level headers
configuration item and the result will be merged with per command specific headers.
If any of the commands in "commands"
fail, the "cleanup"
command will be executed. This is handy in this case since the first RESTful API call will mark the project with alert in the WEB UI so if any of the commands that follow fails, the cleanup call will be made to clear the alert.
Normal command execution can be also performed in the cleanup
section.
Sometimes it is useful to ignore some projects in a opengrok-sync
run (assuming it is run e.g. with --indexed
so it retrieves the list of projects to process from the web application) such as when a project needs special handling, e.g. different schedule.
These projects can be specified either in the "ignore_projects"
section in the configuration file (holds a list of project names) or using the --ignore_project
command line option (can have multiple arguments).
Some project can be notorious for producing spurious errors so their errors can be ignored via the "ignore_errors"
section.
In the above example it is assumed that opengrok-sync
is run as root
and synchronization and reindexing are done under different users. This is done so that the web application cannot tamper with source code even if compromised.
The commands got appended project name unless one of their arguments contains
%PROJECT%
, in which case it is substituted with project name and no append is
done.
For per-project reindexing to work properly, opengrok-reindex-project
uses
the logging.properties.template
to make sure each project has its own
log directory. The file can look e.g. like this:
handlers= java.util.logging.FileHandler
.level= FINE
java.util.logging.FileHandler.pattern = /opengrok/log/%PROJ%/opengrok%g.%u.log
# Create one file per indexer run. This makes indexer log easy to check.
java.util.logging.FileHandler.limit = 0
java.util.logging.FileHandler.append = false
java.util.logging.FileHandler.count = 30
java.util.logging.FileHandler.formatter = org.opengrok.indexer.logger.formatter.SimpleFileLogFormatter
java.util.logging.ConsoleHandler.level = WARNING
java.util.logging.ConsoleHandler.formatter = org.opengrok.indexer.logger.formatter.SimpleFileLogFormatter
The %PROJ%
template is passed to the script for substitution in the
logging template. This pattern must differ from the %PROJECT%
pattern, otherwise the sync.py
script would substitute it in the command arguments and the substitution in the template file
would not happen.
You can find a logging.properties.template
file in the final release tarball, under doc
directory.