-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a clause SAMPLING, has the ability to sample based on probability #4700
base: master
Are you sure you want to change the base?
Conversation
@YolandaLyj Thanks for contribution! Pls check the commit, you added some others. |
896e139
to
256d9b2
Compare
@Sophie-Xie I removed the unused commits. |
void SamplingExecutor::executeBinarySample(Iterator *iter, size_t index, | ||
size_t count, DataSet &list) { | ||
auto uIter = static_cast<U *>(iter); | ||
std::vector<WeightType> accumulate_weights; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please use the CamelCase format
while (it != uIter->end()) { | ||
v = 1.0; | ||
if ((*it)[index].type() == Value::Type::NULLVALUE) { | ||
LOG(WARNING) << "Sampling type is nullvalue"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the dataset have many null, seem this may print a lot WARNING logs? If that is the condition, I advise not use WARNING logs
} else if ((*it)[index].type() == Value::Type::INT) { | ||
v = (float)((*it)[index].getInt()); | ||
} else { | ||
LOG(WARNING) << "Sampling type is wrong, must be int or float."; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
Thinks for the contribution, It is excellent feature for those who want integrate GNN with Nebula! |
@codesigner I removed the unused warning and use the CamelCase format. |
@YolandaLyj we use the google stype, I fixed some lint problem which has nothing to do with code logic, but there some that related to code, please fix that and make the Checks pass; currently, it is blocked at here: https://github.com/vesoft-inc/nebula/actions/runs/3881147321/jobs/6619841055 BTW, latest code have merged master, if you plan to fix, do |
@codesigner I modified the code so that the check passes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
impressive!
| KW_RETURN KW_DISTINCT match_return_items match_order_by match_skip match_limit { | ||
$$ = new MatchReturn($3, $4, $5, $6, true); | ||
| KW_RETURN KW_DISTINCT match_return_items match_sampling match_order_by match_skip match_limit { | ||
$$ = new MatchReturn($3, $4, $5, $6, $7, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you give some test examples about the sampling feature in match statement?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's an example of what I often use
Go from 1 over e0 yield id($$) as id, toFloat(properties(edge).weight) as weight | limit 100 | sampling $-.weight 2
+-----+--------+
| id | weight |
+-----+--------+
| 654 | 1.0 |
| 1 | 1.0 |
+-----+--------+
Got 2 rows (time spent 1514/2528 us)
I get the neighbors of the node with ID 1, and the weights of the connecting edges, and sample the neighbors through the weights, and finally get two values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
What type of PR is this?
What problem(s) does this PR solve?
Issue(s) number:
4699
Description:
The ability to sample based on probability
How do you solve it?
I added a clause named SAMPLING. The syntax is
...| SAMPLING <expression> <sample_number> [BINARY|ALIAS]
.The implementation of BINARY is based on binary lookups.
The implementation of ALIAS is based on alias sampling.
Special notes for your reviewer, ex. impact of this fix, design document, etc:
Checklist:
Tests:
Affects:
Release notes:
Added a clause named SAMPLING. Based on SAMPLING, you can sample the results based on probabilities. The magnitude of the probability is specified by you, for example according to the weight.