-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal for new R interface (discussion thread) #9734
Comments
I am the person who added the section
The sklearn interface of XGBoost extracts the necessary key-value pairs from the object attributes, serialize them as a JSON string, and store the JSON string in the Admittedly this approach is not foolproof. This is why we ask users to save the model using XGBoost's native serializer ( It appears that the expectations are somewhat different in the R community. After seeing lots of bug reports arising from calling saveRDS on old model files (#5794), I had to add the big warning about
Checking the version of the C Booster is sufficient, because we always bump the version of the native package and the R package in lockstep. For example, when the native package had its version bumped from
You can extract the |
Let me try to clear some of the questions before making any comments. ;-)
No, and it's unlikely in the near future. XGBoost mostly works with row-major data with
No. The
Yes. Internally, XGBoost marks the maximum coded (like ordinal-encoding) categorical value and
No. Internally, the decision tree is always split using f32. In the long run, we can probably
You can query the objective using
Mostly by hand, unfortunately. "Mostly" since we have the native interface, which accepts
No need to explicitly set the default value. Default values are set inside libxgboost
I don't know, you will have to reach out for questions. @ExpandingMan :-)
We insist that one should not use pickle for long-term storage. An update to the cPython
However, what happens in R/Python is beyond our scope. Please take a look at our
hopefully, the previous two answers clear this up as well.
Yes, Python scikit-learn is using attributes. But it's not important, we can remove it
No. It took some time before we got where we are, to gradually formalize what should be If you think a new field should work across bindings, please share, we can always make
At the moment, it's the row order, we can have extra fields later. It's a new feature in
I removed it very recently. (if something is missing, it's me to be blamed. -.- ). I'm |
Comments unrelated to questions. I will update this comment when I can put together some more thoughts.
CSR is well supported for all purposes.
Depending on the use case, one can fix the intercept to a constant (like 0) value or let xgboost find one based on MLE. |
Thanks for the answers. So it sounds like a good way to approach serialization and transfers between interfaces would be to keep as many attributes as possible inside of the C booster, then have a function that default-initializes them, and then fills them with info from the booster if available at the moment it gets deserialized. This unfortunately wouldn't work for things like R formulas, so one would strictly need to use As a preliminary step, I think it would make sense to extend the the current optional attributes that are kept inside of the C booster to include categories of both Scikit-learn currently accepts I think it could even be handy to keep levels of categorical features as part of the metadata when using the scikit-learn interface and passing (I've opened a separate issue for the Note that this might not be as straightforward as it sounds, as |
I didn't bother to read most of this thread, but I was pinged concerning 1-based indexing, as I am one of the maintainers of the Julia wrapper, Julia being a 1-based indexed language. Nearly all Julia packages whether wrappers or not, stick with purely 1-based index semantics. In the case of wrappers like XGBoost.jl, this simply means that the wrapper functions need to translate the indexing. There is a relatively low level direct wrapper of the
I'm not sure what you're referring to, but no, that's not the case. The aforementioned low level C wrapper should not show up in the docs, |
If the object is R-specifc, and R does such a good job of keeping
Thank you for clearing that up @ExpandingMan ! That helps a lot. |
@mayer79 would be nice to hear some comments about the proposed interface here, especially around handling of formulas. |
@david-cortes : I was a couple of days afk and will give my feedback soon. Thank you so much for starting this thread. We might get some inspiration of https://github.com/MathiasAmbuehl/modeltuner (no possibility to select how to encode categoricals - it just uses OHE), but otherwise it looks neat. |
Curiously, I don't know the answers to the questions to me. But I think they are less important. My current mood:
Formula parser (dropping all transformations), just extracting column names: formula_parser <- function(form, data) {
form <- as.character(form)
y <- form[2L]
all_vars <- all.vars(
stats::terms.formula(stats::reformulate(form[3L]), data = data[1L, ])
)
list(y = y, x = setdiff(all_vars, y))
}
formula_parser(Sepal.Width ~ ., data = iris)
# $y
# [1] "Sepal.Width"
#
# $x
# [1] "Sepal.Length" "Petal.Length" "Petal.Width" "Species" |
Regarding point (1), under the proposal here, some of the implications are that:
Thus, it might not be necessary to create a new package, but then it'd be necessary to work with reverse dependencies to adapt. Regarding point (2), I do not think that withholding categorical support for now would be beneficial. At some point, the core xgboost library will have wider support for things that involve categorical variables and will hopefully have it enabled by default. We haven't even started work on the R interface yet and we don't know when it will be ready. Also, in my view, I would not consider SHAP support as a blocker - there's plenty of use-cases for xgboost that benefit from better handling of categorical features and which do not involve calculating SHAP values; and if needed, an external library could be used for it even if it might not be optimal. Regarding the proposed parser for formulas: I do not feel that such an interface would bring value to users if it's only used to select variables. In my opinion, one of the most useful aspects of formulas is that they can act as quick data preparation pipelines, which are helpful when e.g. prototyping models in an interactive session - for example:
In these cases, the extra processing would be lost, and the selection of variables could be achieved just as easily through e.g. What's more, if the formula were to be used to one-hot-encode categorical variables, it may be less efficient than what a user could achieve by e.g. using a sparse encoder instead of base R's when appropriate. Also, choosing the way of handling categorical variables like that would then raise the question of what happens when parameter As such, I am thinking that if a formula parser is too complex to implement and there isn't any readily available parser with a compatible license, it might be better to not develop a formula interface, or maybe leave it for a future version. |
A side note, we support SHAP values with categorical data for the original SHAP algorithm (including interaction). The issue is the plotting library needs some additional work to read the model properly. We can work on the SHAP package after clearing up some priorities. |
@trivialfis Nice! So we need categorical support in R indeed at higher priority. In R, shapviz already works with categoricals, e.g., for LightGBM. So at least the plotting of SHAP values is ready in R. |
I just realized that the ALTREP system in newer R versions can be used (albeit in a rather hacky way) to manage custom serialization of arbitrary objects, provided that the user doesn't start interacting with and modifying those objects through R functions. Would be ideal if xgboost handles could be made to only trigger a serialization when calling a method like Here I've composed a small example of how to do this with Note that this will require changing the class and the structure of the current handles, so a first step would be to make these classes internal-only as mentioned in the first message. |
Thank you for sharing! That's quite interesting! The flexibility is a bit scary though, probably just due to my very limited understanding. I can learn more about the framework. I think if we were to use it, some documentation for future developers would be nice.
Indeed, that would be extremely helpful in keeping the states consistent. As for error handling, I think the only concern is that XGB is not great at forward compatibility, loading a new model with an older version of XGB might result in errors (C API returning |
As we haven't received any further comments about these proposals, guess it's reasonable to start with a plan and checklist of things to do based on the current proposal. As I see it, the following would be a high-level overview of what needs to be done:
Other topics from the earlier issue which aren't mentioned in the list above:
Would like to hear some comments about it, particularly around After that, guess we can create a roadmap issue to keep track of progress and discuss about plans for working on these issues from different people. I think there's a github functionality for it but I've never used it and don't know how to so perhaps maintainers could help here. |
Thank you for writing a comprehensive list! I'm really excited for the proposed changes here. Let me try to comment on some of the items considered to be high priority:
One can run
If memory serves, I think the call to set types is already implemented in the current code base. We need additional support for recognizing factor automatically. I can double check later and share how to set the types.
I will debug it, do you have a specific example?
I will work on it, but would like to have some feedbacks from you as well. Like is there any difference in the underlying arrays between table and frame?
Feel free to pick whichever that fits your workflow best. One can use Todo items, multiple issues, or GitHub project to track progress. Tips: this is an to-do item
I think there's a button in GitHub to split an issue with multiple items into multiple issues. I can help create a GitHub classical project to keep track of issues if needed. Feel free to ping me. |
Thanks for the hint. Can confirm that
Actually, this seems to have been already fixed when creating a library(xgboost)
library(Matrix)
data("mtcars")
y <- mtcars$mpg
x <- mtcars[, -1]
x.sp <- as(as.matrix(x), "RsparseMatrix")
model <- xgboost(data=x.sp, label=y) Nevertheless, since we want to remove the code behind this function, I guess it's not worth fixing at this point.
There is no difference in the arrays, both are just a list (an R array object which in C has internal type Objects from In terms of creating I will create a separate roadmap issue with checklists to discuss work on these issues and progress. |
A couple things I am wondering: A. xgboost supports the sorted indices splitting algorithm ( Under this algorithm, infinite values should simply be treated as greater or smaller than every other value, and since the raw values aren't used in any other statistic, it should be possible to fit a model with them. But it seems that xgboost will not accept infinite values regardless in the DMatrix construction. Are infinite non-missing values never accepted in input features? B. I see that there are functions like Does that kind of function allocate arrays inside of the booster pointer? If so, does it mean that predicting on larger inputs makes the model heavier in memory? C. I see there's some parallelization in the R wrapper for things like converting signed integers to unsigned. Part of the current operations include retrieving a pointer to the data in every iteration of the loop, but if that part were to be removed by pre-retrieving the pointer, would that parallelization actually result in faster operations? As far as I understand, for a regular amd64 platform, that sort of operation should be so fast that it's basically limited by the speed at which memory is retrieved from RAM, and having multiple threads retrieving from RAM shouldn't make it faster. |
I disabled the inf input due to lack of tests and other algorithms running into issues, to properly support it we need to have test cases for all potential uses, including things like SHAP, categorical features, etc. At the moment, it's not a very high priority. A use case would help.
It's using thread-local static memory as a return buffer for thread-safe inference. The system will free it. static thread_local int myvar;
I agree. Feel free to remove them.
I think we can entirely remove the data copies by using array interface, which is also supported by the |
Another question: are there any guidelines for using SIMD code in xgboost? I see for example there's a function in the R wrapper that substracts 1 from elements in an array: xgboost/R-package/src/xgboost_R.cc Line 286 in 95af5c0
I was thinking of adding |
No, we don't have those. Feel free to add it. I thought about having an aligned allocator, but dropped the work as my attempt of using sse did not speedup the histogram kernel. Another attempt of using it is to speedup allreduce on CPU, the difference was quite insignificant from my benchmark, so it was dropped again. |
HI @david-cortes , are you familiar with the error from CRAN check caused by calling
I'm not sure how to fix it without removing the use of the https://cran.r-project.org/web/checks/check_results_xgboost.html Update |
We have a CI job there that replicates the Debian clang r-devel build from CRAN. |
Yes, that seems to be the usual clang formatting warning, which is solved by using |
@trivialfis In the previous thread, you mentioned this as one of the desired things to implement for the R interface:
Is this supported in the python interface? Looks like it errors out if you try to pass multiple quantiles. If this is supported in python, could you provide an example of what this means? |
Thank you for sharing! I appreciate the reference. As long as CRAN is confortable with the use of CMake, I'm sure there's a way to sort out all the technical details. |
The CRAN mirror on GitHub is also a good reference for this. https://github.com/search?q=cmake+org%3Acran+path%3ADESCRIPTION&type=code Everything you see on there has been accepted to CRAN and met its requirements. Those might not be the best ways to use CMake in a package, but they at least provide some precedents of approaches that CRAN accepted and which pass the CRAN checks. |
Somehow saw this thread today: https://www.mail-archive.com/[email protected]/msg09395.html about the use of cmake. |
@trivialfis You previously mentioned that you wanted to remove support for I see there is also support for CSV (which I assume is also meant to be removed) and for a binary DMatrix format. I am wondering how these plans would play out with the binary format for DMatrices, which I assume is not meant to be removed. They'd still need to accept string/character types as constructors, expand user paths, make sure files exist, etc. Is the idea to remove function |
I don't think I mentioned I want to remove the predict method? Maybe there's miscommunication? Yes, the text file reading capability will be deprecated, including CSV. No, the binary for format will not be removed at the moment, we don't have any plan to change it yet. We haven't been able to deprecate the URI loading yet since the working in progress federated learning is using them at the moment and there's no alternative. Since R doesn't have such legacy, I thought it might ease the task if we don't need to implement it in the DMatrix class. Initially we will just deprecate the text loading, but the function will continue to be useful for loading the binary format. So, yes, many of the file path routines will continue to be needed. On the other hand, if you think the high level interface is going to take over and there's no need for a user to interact with normal DMatrix (non-external memory), then feel free to remove the binary format in R. I don't have strong opinion that it must present in R. |
It currently supports all DMatrix constructor inputs, so it also accepts files as inputs, and has docs around handling of file inputs.
I'd prefer to keep the functionality for saving DMatrices in binary format if it's not deprecated.
Then I think it should suffice to just change the current reader to the URI reader, and then once the C functions remove or deprecate this functionality, it will be applied to the R interface without needing to change anything. |
That's unfortunate. Let's remove it.
Yes. I will be on holiday next week for roughly two weeks, would really appreciate it if @hcho3 could continue the discussion around R here. Please let me know if there's anything I can help within the next couple days. |
In that case, let's leave continue this after the PRs for inplace prediction and URI loader have been merged. |
@trivialfis In the other issue you mentioned:
What kind of tests did you have in mind for this topic? |
Any unittest that can show the behavior as described in doc should work. I take tests as a form of document. |
Aren't there already tests for most of those? Only things I can think of from the list that are currently missing are tests checking categorical features in text/json/table dumps. The rest (e.g. iteration numbers, class encodings, etc.) should already be covered in current tests. |
On a deeper look, I noticed that removing on-the-fly DMatrix creation in There are cases like |
@hcho3 @trivialfis I see in the AFT tutorial: It mentions that left-censored data should be passed as having a lower bound of zero. Does it support data that is censored at exactly zero or less than zero? If so, how should it be passed? From some experiments, it seems to accept I'm particularly wondering if |
XGBoost expects the survival time to be nonnegative, since it's not in the log scale. I think a 0 should be acceptable. |
@trivialfis Something that's not quite clear to me after reading the docs about I see the R docs mention for
But this What does it refer to? Is it ever not equal to 1? |
That's the |
Can we come up with a good dataframe representation for vector-leaf trees? I plan to support it in this release, but I haven't got an efficient representation format yet. It would be great if the format is consistent between Python and R. |
Do you mean the function |
Thank you for sharing. @david-cortes I know this has been mentioned before, but do let us know if you would like to become a maintainer for XGBoost around the R package. Your work has been really impressive. |
I see there are many R demos for XGBoost of the kind that can be triggered through I am wondering whether anyone actually checks such examples (I'd say demos are a very obscure R functionality nowadays after the introduction of rmarkdown and similar), and whether it makes sense to have them when there are already vignettes (which are much more likely to be seen by users). I'm thinking it'd be a good idea to either remove all those demos, or turn them into vignettes. Turning them into vignettes would be quite a lot of work, so I'm leaning towards simply deleting them. @trivialfis @hcho3 What do you guys think? |
Good idea! I am never checking such demos. I think we could delete them and slightly extend existing vignettes or examples. |
Yes, let's remove the demos. |
I just realized that back then when we removed .Rnw files due to being auto-generated, the repository didn't actually include the source file (.Rmd) that is meant to generate the .Rnw file - either the file was missing, or the .Rnw might have been manually written without a .Rmd. As a result, the vignette "xgboost: eXtreme Gradient Boosting" is effectively missing from the repository now. I was also just going to suggest to remove it because everything it shows is already covered by the other vignettes and I do not see much of a point of having two introductory vignettes that have a larger overlap with each other, but given that it's already deleted, I propose now to leave it deleted and keep the newer vignette "XGBoost presentation" in its place. |
@mayer79 (switching the discussion back to this thread) Regarding the CRAN release, I think the decision hasn't been made yet. Here are my concerns about reusing the existing CRAN package:
Regarding the second item, my suggestion is to simply remove all functions that haven't been revamped yet and don't have a consistent interface with the new |
Hi @david-cortes @mayer79 would you like to share what changes are in the pipeline for the next major release? I would like to help as much as possible. In addition, now seems to be the right time to discuss how to create a new package. For other parts of XGBoost, #11088 and the JVM version of external memory (cc @wbo4958 ) are the remaining changes. It might take some time due to the end of the year, but no more new feature is planned for the 3.0 release. We revised the CI with the help of @hcho3 , so we can't make any patch release for older versions and don't want to stall the new release for too long. |
Only remaining must-haves (after current PRs get merged) are around docs, examples, and vignettes. Should hopefully get finished soon, so no help is needed for now on that side. |
@trivialfis I see python has |
@david-cortes It's the external memory version of It's part of ongoing work on the external memory. The remaining major change is the CV optimization we previously discussed, which will happen after the next release. |
ref #9475
I'm opening this issue in order to propose a high-level overview of how an idiomatic R interface for
xgboost
should ideally look like and some thoughts on how it might be implemented (note: I'm not very familiar with xgboost internals so have several questions). Would be ideal to hear comments from other people regarding these ideas. I'm not aware of everything thatxgboost
can do so perhaps I might be missing something important here.(Perhaps a GitHub issue is not the most convenient format for this though, but I don't where else to put it)
Looking at the python and C interfaces, I see that there's a low-level interface with objects like
Booster
with associated functions/methodsxgb.train
that more or less reflects how the C interface works, and then there's a higher-level scikit-learn interface that wraps the lower-level functions into a friendlier interface that works more or less the same way as scikit-learn objects, which is the most common ML framework for python.In the current R interface, there's to some degree a division into lower/higher-level interfaces with
xgb.train()
andxgboost()
, but there's less of a separation as e.g. both return objects of the same class, and the high-level interface is not very high-level as it doesn't take the same kinds of arguments as base R'sstats::glm()
or other popular R packages - e.g. it doesn't take formulas, nor data frames.In my opinion, a more ideal R interface could consist of a mixture of a low-level interface reflecting the way the underling C API works (which ideally most reverse dependencies and very latency-sensitive applications should be using) just like the current python interface, plus a high-level interface that would make it ergonomic to use at the cost of some memory and performance overhead, behaving similarly as core and popular R packages for statistical modeling. Especially important in this higher-level interface would be to make use of all the rich metadata that R objects have (e.g. column names, levels of factor/categorical variables, specialized classes like
survival::Surv
, etc.), for both inputs and outputs.In the python scikit-learn interface, there are different classes depending on the task (regression / classification / ranking) and the algorithm mode (boosting / random forest). This is rather uncommon in R packages - for example,
randomForest()
will switch between regression and classification according to the type of the response variable - and I think it could be better to keep everything in the high-level interface under a single functionxgboost()
and single class returned from that function, at the expense of more arguments and more internal metadata in the returned objects.In my view, a good separation between the two would involve:
DMatrix
(which could be created from different sources like files, in-memory objects, etc.), and the high-level interface work only with common R objects (e.g.data.frame
,matrix
,dgCMatrix
,dgRMatrix
, etc.), but not withDMatrix
(IMO would not only make the codebase and docs simpler but also given that both have different metadata, would make it easier to guess what to expect as outputs from other functions).predict
function for the object returned byxgb.train()
would mimic the prediction function from the C interface and be mindful of details like not automatically transposing row-major outputs, while thepredict
function from the object returned fromxgboost()
would mimic base R and popular R packages and be able to e.g. return named factors for classification objectives.cv
functions for the low-level and the high-level interfaces (e.g.xgboost.cv()
orcv.xgboost()
, like there iscv.glmnet()
; and anotherxgb.cv()
that would work withDMatrix
).xgboost::normalize
or internal classes/methods likeBooster.handle
.A couple extra details **about the high-level interface** that in my opinion would make it nicer to use. I think many people might disagree with some of the points here though so would like to hear opinions on these ideas:
x
in the format of common R classes likedata.frame
,matrix
,dgCMatrix
, perhapsarrow
(although I wouldn't consider it a priority), and perhaps other classes likefloat32
(from thefloat
package). I don't think it'd be a good idea to support more obscure potential classes likedist
that aren't typically used in data pipelines.predict
should additionally takedgRMatrix
andsparseVector
like it does currently.y
depending on the objective, and by default, if not passed, should select an objective based on the type of y:factor
with 2 levels ->binary:logistic
.factor
with >2 levels ->multi:softmax
.survival::Surv
->survival:aft
(with normal link), or maybesurvival:cox
if only right-censored.reg:squarederror
.factor
types for classification,Surv
for survival, and numeric/integer for regression (takinglogical
(boolean) as 0/1 numeric).data.frame
ormatrix
, and keep their column names as metadata if they had.x
is adata.frame
, it should automatically recognizefactor
columns as categorical types, and (controversial) also takecharacter
(string) class as categorical, converting them tofactor
on the fly, just like R's formula interface for GLMs does. Note that not all popular R packages doing the former would do the latter.x
being adata.frame
withfactor
/character
types. Note: see question below about sparse types and categorical types.x
is either adata.frame
or otherwise has column names, then arguments that reference columns in the data should be able to reference them by name.x
has 4 columns[c1, c2, c3, c4]
it should allow passingmonotone_constraints
as eitherc(-1, 0, 0, 1)
, or asc("c1" = -1, "c4" = 1)
, or aslist(c1 = -1, c4 = 1)
; but not as e.g.c("1" = -1, "4" = -1)
.x
is amatrix
/dgCMatrix
without column names, it should accept them asc(-1, 0, 0, 1)
, or asc("1" = -1, "4" = -1)
, erroring out on non-integer names, not allowing negative numbers, and not allowinglist
with length<ncols as the matching would be ambiguous.x
has column names, column-vector inputs likeqid
should be taken from what's passed on the arguments, and not guessed from the column names ofx
like the python scikit-learn interface does.xgb.train_control()
or similar (likeC50::C5.0Control
) that would return alist
, or perhaps it should have them all as top-level arguments. I don't have a particular preference here, but:objective
,nthread
,verbosity
,seed
(if not using R's - see next point below),booster
; plus arguments that reference columns names or indices:monotone_constraints
,interaction_constraints
; and data inputs likebase_score
orqid
.nrounds
I guess is most logical to put in the list of additional parameters, but given that it's a required piece of metadata for e.g. determining output dimensions, perhaps it should also be top-level.xgb.train_control(eta=0.01)
, it could return onlylist(eta = 0.01)
instead of a full list of parameters. In this case, it might not be strictly required to know what are the default values for every parameter for the purposes of developing this interface, but would be nice if they were easily findable from the signature anyway.seed
and only rely on R's PRNG ifseed
is not passed.print
/show
andsummary
methods that would display info such as the objective, booster type, rounds, dimensionality of the data to which it was fitted, whether it had categorical columns, etc.summary
and theprint
method could both print the exact same information. Not sure if there's anything additional worth putting into thesummary
method that wouldn't be suitable forprint
. Perhaps evaluation metric or objective function per round could be shown (head + tail) forsummary
, but maybe it's not useful.print
should not only print but also return the model object as invisible, methods likepredict
should take the same named arguments as others likepredict(object, newdata, type, ...)
, return an output with one row per input in the data, keep row names in the output if there were any in the input, use name(Intercept)
instead ofBIAS
, etc.interaction_constraints
,iterationrange
for prediction, node indices frompredict
, etc. Perhaps the high-level interface could do the conversion internally but I'm not sure if I'm missing more potential places where indices about something might be getting used, and probably more such parameters and outputs will be added in the future.predict
function should:type
argument with potential values like "response", "class", "raw", "leaf", "contrib", "interaction", "approx.contrib", "approx.interaction" (not sure what to name the last 2 though).factor
types forclass
type in classification objectives.x
for outputs ofcontrib
/interaction
, and factor levels ofy
as column names from e.g.response
in multi-class classification and class-specificcontrib
/interaction
(named aslevel:column
, like base R).DMatrix
or handles - e.g. calling the plotting function should not leave the input with an additionalImportance
column after the call is done.x
had.predict
function for the x/y interface should have a parameter to determine whether it should subset/re-order columns ofnewdata
according to the column names that were used in training, and in this case, it should also recode factor levels. Note that this subsetting and reordering is always done for base R'spredict.glm
, but factors are not re-coded there (so e.g. if the levels in factor columns ofnewdata
differ from the training data,predict.glm
would instead output something that might not make sense).factor
instead of the binarized numeric vector that xgboost will use internally.predict
that would be kept inside the output, etc. Don't have a strong preference here though, and I expect most users of custom objectives would be using the low-level interface anyway.stats::poisson(link = "log")
) and the structure they involve as custom objectives (likeglmnet
does), so that one could pass a family-compliant object from other packages as objective. I see an issue here though in that these families are meant for Fisher scoring, which means in the case of non-cannonical link functions likebinomial(link = "probit")
, they wouldn't calculate the true Hessian function, but I guess Fisher's should work just as fine with gradient boosting. Not an expert in this topic though.The low-level interface in turn should support everything that the C interface offers - e.g. creating
DMatrix
from libsvm files, specifying categorical columns from a non-data-frame array, etc.As per the formula interface, this is quite tricky to implement in a good way for xgboost.
In the case of linear models, it's quite handy to create these features on-the-fly to find out good transformations and e.g. be able to call stepwise feature selectors on the result, but:
log(x1)
have no effect in decision trees, transformations likex1:x2
don't have the same practical effect as decision trees implicitly create interactions, and transformations likex^2
the way R does them do not make a difference for decision trees compared to simplerI(x^2)+x
, for example.log(y)
aren't any harder to do with an x/y interface than with a formula interface.Nevertheless, a formula interface can still be handy for calls like
y ~ x1 + x2 + x3
when there are many columns, and packages likeranger
offer such an interface, so perhaps it might be worth having one, even if it's not the recommended way to use.Some nuances in terms of formulas though:
(Intercept)
that would have the same value for every row.formula
for processing the training data also implies using it forpredict
- so for example, formulas do not recode levels of factor variables when callingpredict
, which the x/y interface could potentially do, leading to differences in behaviors between both interfaces.y ~ x | z
than what base R would do (for example,lme4
would interpretz
here as something that controls mixed effects forx
, while base R would interpret this as a feature "x or z"), and in some casesxgboost()
would also need a different interpretations of formulas (e.g. for parameterqid
, which doesn't fit at either side of the formula).randomForest
don't use the base R formula parser, taking it instead (by copying the code) from a different librarye1071
which is GPLv2 licensed, which is incompatible with xgboost's apache license.monotone_constraints
could be tricky - e.g. if we remove the auto-added(Intercept)
columns, should the numbers re-adjust?Hence, supporting a formula interface for
xgboost()
would be tricky:-1
at the end of the formula (which means "don't add an intercept") by converting it to string and back, in order to get one-hot encoding of factors and avoid adding(Intercept)
, but I can foresee cases in which this might not work depending on how the formula is inputted.qid
in these formulas.|
differently, but what should happen in this case if the user requests something likexnum*xcat
orf(xcat)
(withf
being some arbitrary base function) ?Unless we can find some other package that would better handle formula parsing and that could be reasonable to use as dependency (I'm not aware of any), I think the best way forward here would be to:
monotone_constraints
orqid
in the formula interface.<curr formula> - 1
and error out if this doesn't succeed.predict
post-processing, regardless of how the x/y interface does it.A couple questions for xgboost developers (@hcho3 @trivialfis ?) I have here:
xgboost
support passing a mixture of dense and sparse data together?pandas
supports sparse types, but if I understand it correctly from a brief look at the source code of the python interface, it will cast them to dense before creating theDMatrix
object. If it supports mixtures of both, I think it'd again be ideal if the R interface could also have a way to create such aDMatrix
for the low-level interface. Not sure if there'd be an idiomatic way to incorporate this in the high-level interface though, unless allowing passing something like alist
.DMatrix
equivalent ofcbind
/np.c_
? I don't see any but this would make things easier.XGDMatrixCreateFromDT
, but if I understand it correctly from a look at the pandas processing functions in the scikit-learn interface, if the input involves types likeint64
, these would be casted to a floating-point type. In R, especially when usingdata.table
, it's oftentimes common to have 64-bit integers from the packagebit64
, which if casted tofloat64
, would lose integral precision beyondint32
which could lose precision when casted tofloat32
) - I am wondering if there should be a way to support these without loss of precision, and whether it's possible to efficiently create aDMatrix
from a structure like an Rdata.frame
, which is a list of arrays that aren't contiguous in memory and might have different dtypes.binary:cauchit
, adds it to the pythonXGBClassifier
class, but overlooks the R code as it might be unknown to the contributor, and the R interface then won't act properly on receiving this objective.xgboost
regardless.Booster
, this in theory could lead to issues when updating package versions and loading objects from an earlier version - for example, if a new field is added to the object class returned fromxgboost()
, an earlier object saved withsaveRDS
will not have such a field, which might lead to issues if it is assumed to exist. It'd be theoretically possible to add some function to auto-fill newly added fields as they are added to the R interface when e.g. restoring the booster handle, but this could potentially translate into a lot of maintenance work and be quite hard to test amnd easy to miss when adding new features.pickle
and object attributes that aren't part of the CBooster
? How does it deal with converting betweenBooster
and scikit-learn classes?saveRDS
not maintaining compatility with future versions - is this compatibility meant to be left to the user to check? From a quick look at the code, I guess it only checks the version of the CBooster
, but there could be cases in which the R interface changes independently of the C struct.Booster
into the high-level interface and there's no metadata to take, I guess the most logical thing would be to fill with generic names "V1..N", "1..N", etc. like base R does, but this would not lead to a nice inter-op with the python scikit-learn interface. Does the python interface or other interfaces keep extra metadata that theBooster
wouldn't? Is it somehow standardized?xgboost
determine position in position-aware learning-to-rank? Is it just based on the row order of the input data, or does it look for something else like a specific column name for it?DMatrix
? If so, I guess support forarrow
in R could be delayed until such a route is implemented. I guess the python interface would also benefit from such a functionality being available at the C level.Some other questions for R users (@mayer79 ?):
cuDF
for python) or distributed computing (other than spark) that I might perhaps be missing in this post?qid
and position information thatxgboost
uses? I am not aware of any but I'm no expert in this area.factor
/numeric
/ etc. ?The text was updated successfully, but these errors were encountered: