Add a clause SAMPLING, has the ability to sample based on probability #4700

YolandaLyj · 2022-10-09T10:49:07Z

What type of PR is this?

feature

What problem(s) does this PR solve?

Issue(s) number:

4699

Description:

The ability to sample based on probability

How do you solve it?

I added a clause named SAMPLING. The syntax is ...｜ SAMPLING <expression> <sample_number> [BINARY|ALIAS].

The implementation of BINARY is based on binary lookups.
The implementation of ALIAS is based on alias sampling.

Special notes for your reviewer, ex. impact of this fix, design document, etc:

Checklist:

Tests:

Unit test(positive and negative cases)
Function test
Performance test
N/A

Affects:

Documentation affected (Please add the label if documentation needs to be modified.)
Incompatibility (If it breaks the compatibility, please describe it and add the label.）
If it's needed to cherry-pick (If cherry-pick to some branches is required, please label the destination version(s).)
Performance impacted: Consumes more CPU/Memory

Release notes:

Added a clause named SAMPLING. Based on SAMPLING, you can sample the results based on probabilities. The magnitude of the probability is specified by you, for example according to the weight.

CLAassistant · 2022-10-09T10:49:12Z

All committers have signed the CLA.

Sophie-Xie · 2022-10-13T11:55:54Z

@YolandaLyj Thanks for contribution! Pls check the commit, you added some others.

YolandaLyj · 2022-10-14T03:11:08Z

@YolandaLyj Thanks for contribution! Pls check the commit, you added some others.

@Sophie-Xie I removed the unused commits.

codesigner · 2022-12-09T05:59:42Z

src/graph/executor/query/SamplingExecutor.cpp

+void SamplingExecutor::executeBinarySample(Iterator *iter, size_t index,
+                                           size_t count, DataSet &list) {
+  auto uIter = static_cast<U *>(iter);
+  std::vector<WeightType> accumulate_weights;


please use the CamelCase format

codesigner · 2022-12-09T06:03:27Z

src/graph/executor/query/SamplingExecutor.cpp

+  while (it != uIter->end()) {
+    v = 1.0;
+    if ((*it)[index].type() == Value::Type::NULLVALUE) {
+      LOG(WARNING) << "Sampling type is nullvalue";


If the dataset have many null, seem this may print a lot WARNING logs? If that is the condition, I advise not use WARNING logs

codesigner · 2022-12-09T06:03:46Z

src/graph/executor/query/SamplingExecutor.cpp

+    } else if ((*it)[index].type() == Value::Type::INT) {
+      v = (float)((*it)[index].getInt());
+    } else {
+      LOG(WARNING) << "Sampling type is wrong, must be int or float.";


codesigner · 2022-12-09T06:11:10Z

Thinks for the contribution, It is excellent feature for those who want integrate GNN with Nebula!

YolandaLyj · 2023-01-04T07:30:17Z

@codesigner I removed the unused warning and use the CamelCase format.

codesigner · 2023-01-10T07:36:23Z

@YolandaLyj we use the google stype, I fixed some lint problem which has nothing to do with code logic, but there some that related to code, please fix that and make the Checks pass;

currently, it is blocked at here: https://github.com/vesoft-inc/nebula/actions/runs/3881147321/jobs/6619841055

BTW, latest code have merged master, if you plan to fix, do git pull to fetch the latest change from remote before do the change.

YolandaLyj · 2023-01-11T07:18:58Z

@codesigner I modified the code so that the check passes.

yixinglu

impressive!

yixinglu · 2023-03-30T10:46:55Z

src/parser/parser.yy

-    | KW_RETURN KW_DISTINCT match_return_items match_order_by match_skip match_limit {
-        $$ = new MatchReturn($3, $4, $5, $6, true);
+    | KW_RETURN KW_DISTINCT match_return_items match_sampling match_order_by match_skip match_limit {
+        $$ = new MatchReturn($3, $4, $5, $6, $7, true);


Could you give some test examples about the sampling feature in match statement?

@yixinglu

Here's an example of what I often use

Go from 1 over e0 yield id($$) as id, toFloat(properties(edge).weight) as weight | limit 100 | sampling $-.weight 2
+-----+--------+
| id | weight |
+-----+--------+
| 654 | 1.0 |
| 1 | 1.0 |
+-----+--------+
Got 2 rows (time spent 1514/2528 us)

I get the neighbors of the node with ID 1, and the weights of the connecting edges, and sample the neighbors through the weights, and finally get two values.

yixinglu

LGTM

YolandaLyj requested review from dutor and codesigner as code owners October 9, 2022 10:49

YolandaLyj changed the title ~~Sampling dev~~ Add a clause SAMPLING which has the ability to sample based on probability Oct 9, 2022

YolandaLyj changed the title ~~Add a clause SAMPLING which has the ability to sample based on probability~~ Add a clause SAMPLING, has the ability to sample based on probability Oct 9, 2022

Add SAMPLING clause, probability-based sample

256d9b2

YolandaLyj force-pushed the sampling_dev branch 2 times, most recently from 896e139 to 256d9b2 Compare October 13, 2022 13:20

Rename sampler.h to Sampler.h

d720e12

Sophie-Xie requested a review from MuYiYong October 14, 2022 02:27

Sophie-Xie added the doc affected PR: improvements or additions to documentation label Oct 14, 2022

abby-cyber self-assigned this Nov 25, 2022

codesigner reviewed Dec 9, 2022

View reviewed changes

Shylock-Hg added the ready-for-testing PR: ready for the CI test label Dec 9, 2022

fix: remove unused warning

96fd97f

codesigner added 3 commits January 9, 2023 16:26

Merge branch 'master' into sampling_dev

3ad90fc

fix compile

4ad0257

Merge branch 'master' into sampling_dev

1145b9a

fix bugs found when checking cpplint code style

7804f35

codesigner and others added 4 commits January 11, 2023 15:20

Merge branch 'master' into sampling_dev

3536434

format change using clang-format

241b253

format change using clang-format

de461ae

Merge branch 'master' into sampling_dev

809cd49

yixinglu reviewed Mar 30, 2023

View reviewed changes

yixinglu approved these changes May 22, 2023

View reviewed changes

Merge branch 'master' into sampling_dev

70c9c87

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a clause SAMPLING, has the ability to sample based on probability #4700

Add a clause SAMPLING, has the ability to sample based on probability #4700

YolandaLyj commented Oct 9, 2022 •

edited by nevermore3

Loading

CLAassistant commented Oct 9, 2022 •

edited

Loading

Sophie-Xie commented Oct 13, 2022

YolandaLyj commented Oct 14, 2022

codesigner Dec 9, 2022

codesigner Dec 9, 2022

codesigner Dec 9, 2022

codesigner commented Dec 9, 2022 •

edited

Loading

YolandaLyj commented Jan 4, 2023 •

edited

Loading

codesigner commented Jan 10, 2023

YolandaLyj commented Jan 11, 2023

yixinglu left a comment

yixinglu Mar 30, 2023

YolandaLyj Apr 7, 2023 •

edited

Loading

yixinglu left a comment

Add a clause SAMPLING, has the ability to sample based on probability #4700

Are you sure you want to change the base?

Add a clause SAMPLING, has the ability to sample based on probability #4700

Conversation

YolandaLyj commented Oct 9, 2022 • edited by nevermore3 Loading

What type of PR is this?

What problem(s) does this PR solve?

Issue(s) number:

Description:

How do you solve it?

Special notes for your reviewer, ex. impact of this fix, design document, etc:

Checklist:

Release notes:

CLAassistant commented Oct 9, 2022 • edited Loading

Sophie-Xie commented Oct 13, 2022

YolandaLyj commented Oct 14, 2022

codesigner Dec 9, 2022

Choose a reason for hiding this comment

codesigner Dec 9, 2022

Choose a reason for hiding this comment

codesigner Dec 9, 2022

Choose a reason for hiding this comment

codesigner commented Dec 9, 2022 • edited Loading

YolandaLyj commented Jan 4, 2023 • edited Loading

codesigner commented Jan 10, 2023

YolandaLyj commented Jan 11, 2023

yixinglu left a comment

Choose a reason for hiding this comment

yixinglu Mar 30, 2023

Choose a reason for hiding this comment

YolandaLyj Apr 7, 2023 • edited Loading

Choose a reason for hiding this comment

yixinglu left a comment

Choose a reason for hiding this comment

YolandaLyj commented Oct 9, 2022 •

edited by nevermore3

Loading

CLAassistant commented Oct 9, 2022 •

edited

Loading

codesigner commented Dec 9, 2022 •

edited

Loading

YolandaLyj commented Jan 4, 2023 •

edited

Loading

YolandaLyj Apr 7, 2023 •

edited

Loading