Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

very slow on initializing gpmodel for non-consecutive group_data #4

Closed
aprilffff opened this issue Jun 12, 2020 · 10 comments
Closed

very slow on initializing gpmodel for non-consecutive group_data #4

aprilffff opened this issue Jun 12, 2020 · 10 comments

Comments

@aprilffff
Copy link

As in #3, the speed for initializing consecutive group_data is tested great, but it seems not properly handeled non-consecutive data.
I have tested this on a super cool machine, and the performance between consecutive group data and non-consecutive ones are 0.4s and 1800+s.

pls help on this, because lgb model requires the data in its query order, which might be very different with the group_data requested in gaussian model.

@fabsig
Copy link
Owner

fabsig commented Jun 12, 2020

Just to double check, by non-consecutive you mean e.g. [1, 1, 5, 60, 60] instead of [1, 1, 2, 3, 3]? Can you provide a minimal working example?

@aprilffff
Copy link
Author

like [1,2,3,1,2,3]

@fabsig
Copy link
Owner

fabsig commented Jun 12, 2020

Which setting do you use? Number of samples and number of different groups?

@aprilffff
Copy link
Author

what do you mean by setting?
[1,2,3,1,2,3] means 6 samples, the first and fourth belong to group 1,second adn fifth to group2,third and last to group3.

@fabsig
Copy link
Owner

fabsig commented Jun 12, 2020

[1,2,3,1,2,3] means a total of 6 samples and 3 different groups. I guess this gives no problem :-). When do you experience performance issues?

@aprilffff
Copy link
Author

That happens everytime I initialize gpboost. My dataset is as large as 10million samples,2000+groups, and samples in every group is not consecutive just like the example above.

@fabsig
Copy link
Owner

fabsig commented Jun 12, 2020

I see. 10'000'000 samples and 2'000 groups. I will investigate this.

@aprilffff
Copy link
Author

btw, I cannot save the gp_model as well. I'm using the python wrapper of gpboost, but gpb.save_model() could only save the gbdt part, not with the gaussian model.

@fabsig
Copy link
Owner

fabsig commented Jun 12, 2020

Yes, that is correct. Saving of the GPModel is not implemented. But this is another topic. Please open another issue if this feature is desirable for you.

@fabsig
Copy link
Owner

fabsig commented Jun 12, 2020

I have fixed this now. Initialization of a GPModel with group_data that is not ordered takes now approximately the same time as in the ordered case (see #3 (comment)).

@aprilffff : Many thanks for raising this issue!

@fabsig fabsig closed this as completed Jun 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
2 participants