-
Notifications
You must be signed in to change notification settings - Fork 210
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Limit the maximum number of feed items when parsing #199
Comments
UpdateI did a little bit of digging and found out implementing this is actually not as easy at it seems.
So in the results above we compare:
|
This is a great idea. I also am keen on introducing a config object, and putting some other things in there as well like user agent, proxy, strictness settings. We could put it behind a setter, as you suggested, or cut a v2 breaking change and clean up the interface a bit. |
I'm been kind of mulling over what options we might want in this config object, that have been requested in the past, or might be beneficial to give control over. These are what I've come up with. Let me know your thoughts on these if you get a chance. type ParseOptions struct {
// MaxItems specifies the maximum number of feed items to parse.
// The default value is 0, which means no limit.
MaxItems int
// ParseDates determines if the feed parser will attempt to parse dates into `time.Time` objects.
// The default value is true.
ParseDates bool
// KeepOriginalFeed specifies if the parser should retain the raw feed in the `Feed` struct's `RawFeed` field.
// The default value is false.
KeepOriginalFeed bool
// StrictMode determines if strict parsing rules should be applied.
// If set to true, all strictness settings are enabled.
// The default value is false, and all StrictnessOptions default to their least strict settings.
StrictMode bool
// StrictnessOptions holds the options for controlling the strictness of the parsing.
StrictnessOptions StrictnessOptions
// RequestOptions holds the options for the HTTP request.
RequestOptions RequestOptions
}
type StrictnessOptions struct {
// StripInvalidCharacters specifies if invalid feed characters should be stripped out.
// The default value is true.
StripInvalidCharacters bool
// AutoCloseTags specifies if the parser should automatically close unclosed tags.
// The default value is true.
AutoCloseTags bool
// AllowUndisclosedXMLNamespaces specifies if the parser should allow undisclosed XML namespaces.
// The default value is true.
AllowUndisclosedXMLNamespaces bool
// AllowIncorrectDateFormats specifies if the parser should allow incorrect date formats.
// The default value is true.
AllowIncorrectDateFormats bool
// AllowUnescapedMarkup specifies if the parser should allow unescaped / naked markup in feed elements.
// The default value is true.
AllowUnescapedMarkup bool
}
type RequestOptions struct {
// Timeout is the maximum amount of time to wait for the request to complete.
// The default value is 0, which means no timeout.
Timeout time.Duration
// UserAgent is the value of the `User-Agent` header to be sent in the request.
// This header is used to identify the client software and version to the server.
// The default value is "gofeed/2.0.0" (the current version of gofeed).
UserAgent string
// IfNoneMatch is the value of the `If-None-Match` header to be sent in the request.
// This header is used to make a conditional request for a resource, and the server will return the resource only if it does not match the provided entity tag.
// If the server has a matching `Etag`, it will not respond with the full feed, and instead return a `304 Not Modified` response.
// The default value is an empty string, which means no `If-None-Match` header will be sent.
IfNoneMatch string
// IfModifiedSince is the value of the `If-Modified-Since` header to be sent in the request.
// This header is used to make a conditional request for a resource, and the server will return the resource only if it has been modified since the provided time.
// If the server determines that the resource has not been modified since the provided `If-Modified-Since` time, it will not respond with the full feed, and instead return a `304 Not Modified` response.
// The default value is an empty string, which means no `If-Modified-Since` header will be sent.
IfModifiedSince string
} I also realized that the universal feed parser's |
I thought about the options. Overall, I like your proposal! I have been working on extracting feeds for over a year now (for a personal project). Given the challenges I faced, I thought I share my suggestions: 1- This is very subjective, but I think the request fetching is out of the scope of gofeed, which is concerned with parsing. The fact that it can parse from urls is fantastic, but I think if someone needs a very advanced fetching mechanism, they can implement them themselves. For instance, some RSS feeds have huge body sizes and require special headers, auth, etc. Overall, I think parsing differs from fetching, and it may not be worth it if it makes the code/API more complex. But this is just my opinion, obviously :) 2- Regarding the 3- Related to the previous suggestion, I think gofeed can have separate utility functions that are helpful when working with feeds. For instance, we can detect RSS links based on some heuristics. Other utilities could be URL resolution and HTML sanitization. I may be able to help with these. All in all, I think the proposal is very nice! Thank you again for sharing gofeed with us! :) |
Thanks @imirzadeh . I think you are right. I'll leave the ParseURL functions, but will encourage people who want to have custom behavior to use their own client implementation. |
Motivations
1- Some feeds are not well-maintained, especially podcasts like NYTimes Daily or this one. In those feeds, usually all/most of the history (hundreds of items) is in the feed. This leads to a very long parsing time.
2- Usually, people use RSS parser libraries to check the latest news and don't necessarily need all the items to be parsed. Given that gofeed parses xml using the pull method, it can stop processing at certain points and save time and compute.
Suggestion
For the reasons explained above, I think we can add an optional
ParseConfig
object to theParser
to fetch the posts faster.To make things backward compatible, we can add a
SetConfig
method that users can explicitly call to change options.The text was updated successfully, but these errors were encountered: