Each saffron instance need a configuration file. Below is the structure of the configuration file.
Default value: main
Can receive the values main
and worker
. It specifies if the saffron instance will be the main instance
or a worker only instance.
If the environment variable SAFFRON_MODE
is set properly it will override this field.
Default value: none
A function that will be called after saffron is done scraping and parsing the articles. This function is useful to save the parsed articles in the database.
Arguments:
tableName
: The table (or collection) where the articles will be saved.articles
: An array of all the retrieved articles.
Default value: ./sources
The directory where the source file are located.
All .js
and .json
files will be treated as source files (including subdirectories)
Default value: true
If true
the Saffron will scan all the sub directories inside the path
directory.
An array containing all the implementations for the dynamic source files.
Default value: JSON.parse(fs.readFileSync(filepath))
A custom loader that will allow to manually load each file to Saffron.
It was created to allow ES6 projects to load javascript files for the old dynamic parser. Now it can be used to preprocess the files before passing them to Saffron.
loader: async (filepath: string) => {
let data = JSON.parse(fs.readFileSync(filepath, 'utf-8'));
// Proccess the content of the source file
// ...
return data;
}
Default value: []
When not empty it will filter the parsed sources and include only the ones that are mentioned. The source identification is made using the source name (and not the source file name).
This field is useful when testing specific source files.
Default value: []
When not empty it will exclude the parsed sources and include only the ones that are not mentioned here.
It will exclude source files even if they are in includeOnly
option.
The source identification is made using the source name (and not the source file name).
This field is useful when a source file is deprecated and must be excluded. Setting a environment variable for this may be a good choice.
Default value: 1
The workers that will be initialized. All workers run in the same instance. There is not a clear benefit in having multiple workers, they will not run in different threads or processes. They exist for purpose of analytics.
If you want to run saffron in different processes check out about Grid
.
If you want to assign a name to each worker, you can replace the number with
an array of names (strings), like: nodes: ['Worker 1', 'Worker 2', 'Billy']
.
Default value: 0
There is a case where a website may rate limit the requests it accepts, so a source file with multiple urls may fail due to this limit.
Increasing the option delayBetweenRequests
(in milliseconds) will put a distance
between the request that belong in the same source file.
Axios' configuration that will be applied to the requests made by saffron.
The method of the requests is by default GET
, but it can be overridden here.
It supports synchronous callback like:
axios: async (source: Source) => {
return {
method: 'POST',
data: {
key: 'value'
},
timeout: 3000
};
}
A function that will edit the response of the http request before passing it to the parser.
It is helpful in many cases where modifications are needed to the document
before parsing, for example an RSS file which contains the character &
instead of the encoded representation &
.
It supports synchronous callback like:
// type RequestsResult = AxiosResponse | AxiosResponse[];
axios: async (responses: RequestsResult, source: Source) => {
// ...
return responses;
}
This function does apply to the dynamic
parser.
Default value: 30
The maximum amount of articles that saffron will return for each source scraping job.
Default value: true
If the generated article's attachments will contain the extracted urls from article's content.
Default value: categories
Accepted values are categories
or extras
.
THis field will specify where to store the categories mentioned in the url
field of the source file.
Default value: 3600000
(60 minutes)
The interval in milliseconds, between each source scraping job for a specific source.
This option is also used by the scheduler to spread the source files evenly inside a specific timeframe. This will help to ease your machine by not executing all the source scraping jobs together.
A lower value for this interval means a heavier load on the websites that are scraped.
This field has a recommended minimum value of 5000ms
.
Default value: 86400000
(24 hours)
When a source scraping job fails consecutively 10 times in a row it wil trigger
the freeze mechanism of saffron and schedule the source to be executed again after
heavyJobFailureInterval
milliseconds.
It is recommended to have a high value for this field as it will be used when a website is down.
Default value: 2
A multiplier of the source interval. When exceeding that product it will re-assign a worker.
For example for an interval
of 1 minute and noResponseThreshold
of 2, if a worker does
not report the source scraping job as passed or failed after interval
* noResponseThreshold
=
2 minutes, it will replace the current worker.
This field is useful in cases where a remote node has crashed or lost connection with main instance.
Default value: -300 to 300 seconds
The scheduler adds/subtracts a random time from the interval of each job to avoid making requests at fixed intervals. The result of the function must be in milliseconds.
randomizeInterval: () => {
const random = Math.floor(Math.random() * (high - low) + low) * 1000;
return Math.random() >= 0.5 ? random : -random;
}
Default value: false
In the main
instance it will start a socket.io-server
and wait for nodes to be connected.
In case of a worker
instance it will try to connect to the main
by using socket.io-client
.
Default value: false
Use HTTPS instead of HTTP server.
The server's host where the worker
nodes will connect to.
Without the http://
or https://
prefix and :<port>
postfix.
Default value: 3000
The port where the main
node will listen and worker
nodes will connect
A string that will authenticate a worker
node to the main
node.
if the authentication fails it will disconnect the socket.
The key used by the HTTPS server.
The certificate used by the HTTPS server.
Default value: all
The log level of saffron.
Each level is a subset of the next one: none
< errors
< info
< all
Default value: 0
The time in milliseconds where the events will be delayed before delivered.
Saffron supports different configuration based on the environment.
The configuration will first receive the root configuration:
const config = {
newArticles: (tableName: string, articles: Article[]) => {
// push to production database
},
// ...
misc: {
log: "errors" // log only errors
},
// ...
};
and then read the child object development
, production
or testing
based on the environment:
const config = {
newArticles: (tableName: string, articles: Article[]) => {
// push to production database
},
// ...
misc: {
log: "errors" // log only errors
},
// ...
development: {
newArticles: (tableName: string, articles: Article[]) => {
// push to local database
},
// ...
misc: {
log: "all" // log everything
},
// ...
}
};