Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a clause SAMPLING, has the ability to sample based on probability #4700

Open
wants to merge 12 commits into
base: master
Choose a base branch
from

Conversation

YolandaLyj
Copy link

@YolandaLyj YolandaLyj commented Oct 9, 2022

What type of PR is this?

  • feature

What problem(s) does this PR solve?

Issue(s) number:

4699

Description:

The ability to sample based on probability

How do you solve it?

I added a clause named SAMPLING. The syntax is ...| SAMPLING <expression> <sample_number> [BINARY|ALIAS].

The implementation of BINARY is based on binary lookups.
The implementation of ALIAS is based on alias sampling.

Special notes for your reviewer, ex. impact of this fix, design document, etc:

Checklist:

Tests:

  • Unit test(positive and negative cases)
  • Function test
  • Performance test
  • N/A

Affects:

  • Documentation affected (Please add the label if documentation needs to be modified.)
  • Incompatibility (If it breaks the compatibility, please describe it and add the label.)
  • If it's needed to cherry-pick (If cherry-pick to some branches is required, please label the destination version(s).)
  • Performance impacted: Consumes more CPU/Memory

Release notes:

Added a clause named SAMPLING. Based on SAMPLING, you can sample the results based on probabilities. The magnitude of the probability is specified by you, for example according to the weight.

@CLAassistant
Copy link

CLAassistant commented Oct 9, 2022

CLA assistant check
All committers have signed the CLA.

@YolandaLyj YolandaLyj changed the title Sampling dev Add a clause SAMPLING which has the ability to sample based on probability Oct 9, 2022
@YolandaLyj YolandaLyj changed the title Add a clause SAMPLING which has the ability to sample based on probability Add a clause SAMPLING, has the ability to sample based on probability Oct 9, 2022
@Sophie-Xie
Copy link
Contributor

@YolandaLyj Thanks for contribution! Pls check the commit, you added some others.

@YolandaLyj YolandaLyj force-pushed the sampling_dev branch 2 times, most recently from 896e139 to 256d9b2 Compare October 13, 2022 13:20
@Sophie-Xie Sophie-Xie requested a review from MuYiYong October 14, 2022 02:27
@Sophie-Xie Sophie-Xie added the doc affected PR: improvements or additions to documentation label Oct 14, 2022
@YolandaLyj
Copy link
Author

@YolandaLyj Thanks for contribution! Pls check the commit, you added some others.

@Sophie-Xie I removed the unused commits.

@abby-cyber abby-cyber self-assigned this Nov 25, 2022
void SamplingExecutor::executeBinarySample(Iterator *iter, size_t index,
size_t count, DataSet &list) {
auto uIter = static_cast<U *>(iter);
std::vector<WeightType> accumulate_weights;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use the CamelCase format

while (it != uIter->end()) {
v = 1.0;
if ((*it)[index].type() == Value::Type::NULLVALUE) {
LOG(WARNING) << "Sampling type is nullvalue";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the dataset have many null, seem this may print a lot WARNING logs? If that is the condition, I advise not use WARNING logs

} else if ((*it)[index].type() == Value::Type::INT) {
v = (float)((*it)[index].getInt());
} else {
LOG(WARNING) << "Sampling type is wrong, must be int or float.";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@Shylock-Hg Shylock-Hg added the ready-for-testing PR: ready for the CI test label Dec 9, 2022
@codesigner
Copy link
Contributor

codesigner commented Dec 9, 2022

Thinks for the contribution, It is excellent feature for those who want integrate GNN with Nebula!

@YolandaLyj
Copy link
Author

YolandaLyj commented Jan 4, 2023

@codesigner I removed the unused warning and use the CamelCase format.

@codesigner
Copy link
Contributor

@YolandaLyj we use the google stype, I fixed some lint problem which has nothing to do with code logic, but there some that related to code, please fix that and make the Checks pass;

currently, it is blocked at here: https://github.com/vesoft-inc/nebula/actions/runs/3881147321/jobs/6619841055

BTW, latest code have merged master, if you plan to fix, do git pull to fetch the latest change from remote before do the change.

@YolandaLyj
Copy link
Author

@codesigner I modified the code so that the check passes.

Copy link
Contributor

@yixinglu yixinglu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

impressive!

| KW_RETURN KW_DISTINCT match_return_items match_order_by match_skip match_limit {
$$ = new MatchReturn($3, $4, $5, $6, true);
| KW_RETURN KW_DISTINCT match_return_items match_sampling match_order_by match_skip match_limit {
$$ = new MatchReturn($3, $4, $5, $6, $7, true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you give some test examples about the sampling feature in match statement?

Copy link
Author

@YolandaLyj YolandaLyj Apr 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yixinglu

Here's an example of what I often use

Go from 1 over e0 yield id($$) as id, toFloat(properties(edge).weight) as weight | limit 100 | sampling $-.weight 2
+-----+--------+
| id | weight |
+-----+--------+
| 654 | 1.0 |
| 1 | 1.0 |
+-----+--------+
Got 2 rows (time spent 1514/2528 us)

I get the neighbors of the node with ID 1, and the weights of the connecting edges, and sample the neighbors through the weights, and finally get two values.

Copy link
Contributor

@yixinglu yixinglu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
doc affected PR: improvements or additions to documentation ready-for-testing PR: ready for the CI test
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants