×

Recent advances in scaling-down sampling methods in machine learning. (English) Zbl 07914938

MSC:

62-08 Computational methods for problems pertaining to statistics
Full Text: DOI

References:

[1] IBM What Is Big Data: Bring Big Data to the Enterprise. 2012. [online] Available at: http://www-01.ibm.com/software/data/bigdata/.
[2] HilbertM, LópezP. The world’s technological capacity to store, communicate, and compute information. Science2011, 332:60-65.
[3] IDC. 2014. Available at: https://www.emc.com/leadership/digital-universe/2014iview/executive-summary.htm. (Accesses March 2017)
[4] SettlesB. Active learning. Synth Lect Artif Intell Mach Learn2012, 6:1-114. · Zbl 1270.68006
[5] TomanekK, OlssonF. A web survey on the use of active learning to support annotation of text data. In: Proceedings of the NAACL HLT 2009 Workshop on Active Learning for Natural Language Processing, pp. 45-48. Association for Computational Linguistics, 2009.
[6] HesabiZR, TariZ, GoscinskiA, FahadA, KhalilI, QueirozC. Data summarization techniques for big data—a survey. In: Handbook on Data Centers. New York: Springer; 2015, 1109-1152.
[7] VitterJS. Random sampling with a reservoir. ACM Trans Math Softw1985, 11:37-57. · Zbl 0562.68028
[8] MichalskiRS. On the selection of representative samples from large relational tables for inductive inference. University of Illinois (Chicago circle) Tech. Report, 1975.
[9] WaldA. On the efficient design of statistical investigations. Ann Math Stat1943, 14:134-140. · Zbl 0060.30109
[10] LiuH (ed.), MotodaH (ed.), eds. Instance Selection and Construction for Data Mining, vol. 608. US: Springer Science & Business Media; 2013.
[11] AntalE, TilléY. A direct bootstrap method for complex sampling designs from a finite population. J Am Stat Assoc2011, 106:534-543. · Zbl 1232.62030
[12] AfshartousD. Sample size determination for binomial proportion confidence intervals: an alternative perspective motivated by a legal case. Am Stat2008, 62:27-31.
[13] O’NeillB. Some useful moment results in sampling problems. Am Stat2014, 68:282-296. · Zbl 07653670
[14] ZhangL. Sample mean and sample variance: their covariance and their (in) dependence. Am Stat2007, 61:159-160.
[15] GregoireTG, AffleckDLR. Estimating desired sample size for simple random sampling of a skewed population. Am Stat. In press. · Zbl 07663938
[16] FedorovVV. Theory of Optimal Experiments. Philadelphia, PA: Elsevier; 1972.
[17] CochranWG. Sampling Techniques. 3rd ed.New York: John Wiley & Sons; 1977. · Zbl 0353.62011
[18] HedayatAS, Kumar SinhaB. Design and Inference Infinite Population Sampling. New York: Wiley; 1991. · Zbl 0850.62160
[19] GuB, HuF, LiuH. Sampling and its application in data mining: a survey. Singapore: National University of Singapore; 2000.
[20] HannekeS. Theory of disagreement‐based active learning. Found Trends Mach Learn2014, 7:131-309. · Zbl 1327.68193
[21] ZhuX, LaffertyJ, GhahramaniZ. Combining active learning and semi‐supervised learning using gaussian fields and harmonic functions. In: Proceedings of the ICML Workshop on the Continuum from Labeled to Unlabeled Data, pp. 58-65, 2003.
[22] ZhangJ, XuJ, LiaoS. Sampling methods for summarizing unordered vehicle‐tovehicle data streams. Transportation Research Part C—Emerging Technologies2012, 23:56-67.
[23] DashM, NgW. Efficient reservoir sampling for transactional data streams. In: Proceedings of IEEE International Conference on Data Mining (ICDM), pp. 662-666, 2006.
[24] AggarwalCC. On biased reservoir sampling in the presence of stream evolution. In: Proceedings of the 32nd International Conference on Very large Data Bases (VLDB), pp. 607-618, 2006.
[25] GhoshD, VogtA. A modification of Poisson sampling. In: Proceedings of the American Statistical Association, Survey Research Methods Section, pp. 198-199, 1999.
[26] BabcockB, DatarM, MotwaniR. Sampling from a moving window over streaming data. In: Proceedings of the 13th Annual ACM‐SIAM Symposium on Discrete Algorithms (SODA). Society for Industrial and Applied Mathematics, Philadelphia, pp. 633-634, 2002. · Zbl 1093.68571
[27] Hua‐HuiC, LiaoK‐L. Weighted random sampling based hierarchical amnesic synopses for data streams. In: 2010 5th International Conference on Computer Science and Education (ICCSE), pp. 1816-1820, IEEE, 2010.
[28] AcharyaS, PoosalaV, RamaswamyS. Selectivity estimation in spatial databases. In: Proceedings of SIGMOD, June 1999.
[29] Al‐KatebM, LeeBS. Adaptive stratified reservoir sampling over heterogeneous data streams. Inf Syst2014, 39:199-216.
[30] LiuT, WangF, AgrawalG. Stratified sampling for data mining on the deep web. Front Comp Sci2012, 6:179-196. · Zbl 1251.68182
[31] KurantM, GjokaM, ButtsCT, MarkopoulouA. Walking on a graph with a magnifying glass: stratified sampling via weighted random walks. In: Proceedings of the ACM SIGMETRICS Joint International Conference on Measurement and Modeling of Computer Systems, pp. 281-292. ACM, 2011.
[32] YeY, WuQ, Zhexue HuangJ, NgMK, LiX. Stratified sampling for feature subspace selection in random forests for high dimensional data. Pattern Recogn2013, 46:769-787.
[33] HollandPW. Statistics and causal inference. J Am Stat Assoc1986, 81:945-960. · Zbl 0607.62001
[34] AlemiF, ElRafeyA, AvramovicI. Covariate balancing through naturally occurring strata. Health Serv Res2016. https://doi.org/10.1111/1475-6773.12628 · doi:10.1111/1475-6773.12628
[35] MiratrixLW, SekhonJS, BinY. Adjusting treatment effect estimates by post‐stratification in randomized experiments. J R Stat Soc Series B Stat Methodology2013, 75:369-396. · Zbl 07555452
[36] NeymanJ. Contribution to the theory of sampling human populations. J Am Stat Assoc1938, 33:101-116. · Zbl 0018.22603
[37] BreslowNE, HolubkovR. Maximum Likelihood Estimation of Logistic Regression Parameters under Two‐phase, Outcome‐dependent Sampling. J R Stat Soc Ser B Stat Methodol1997, 59:447-461. · Zbl 0886.62071
[38] ChatterjeeN, ChenY‐H. Maximum likelihood inference on a mixed conditionally and marginally specified regression model for genetic epidemiologic studies with two‐phase sampling. J R Stat Soc Ser B Stat Methodol2007, 69:123-142. · Zbl 1120.62096
[39] BreslowNE, WellnerJA. Weighted likelihood for semiparametric models and two‐phase stratified samples, with application to cox regression. Scand J Stat2007, 34:86-102. · Zbl 1142.62014
[40] YamaneT. Elementary Sampling Theory. Englewood Cliffs, NJ: Prentice Hall; 1967. · Zbl 0147.38002
[41] NguyenTT, SongI. Centrality clustering‐based sampling for big data visualization. In: 2016 International Joint Conference on Neural Networks (IJCNN), pp. 1911-1917. IEEE, 2016.
[42] SharmaS, KhanMGM. Determining optimum cluster size and sampling unit for multivariate study. In: 2015 2nd Asia‐Pacific World Congress on Computer Science and Engineering (APWC on CSE), pp. 1-4. IEEE, 2015.
[43] InoueT, KrishnaA, GopalanRP. Multidimensional cluster sampling view on large databases for approximate query processing. In: 2015 I.E. 19th International Enterprise Distributed Object Computing Conference (EDOC), pp. 104-111. IEEE, 2015.
[44] ThompsonSK. Adaptive cluster sampling. J Am Stat Assoc1990, 85:1050-1059. · Zbl 1330.62070
[45] FieldCA, WelshAH. Bootstrapping clustered data. J R Stat Soc B2007, 69:369-390. · Zbl 07555357
[46] FieldCA, PangZ, WelshAH. Bootstrapping data with multiple levels of variation. Can J Stat2008, 36:521-539. · Zbl 1166.62026
[47] SamantaM, WelshAH. Bootstrapping for highly unbalanced clustered data. Comput Stat Data Anal2013, 59:70-81. · Zbl 1400.62134
[48] ChatterjeeS, BoseA. Generalized bootstrap for estimating equations. Ann Stat2005, 33:414-436. · Zbl 1065.62073
[49] Salibián‐BarreraM, Van AelstS, WillemsG. Principal components analysis based on multivariate MM estimators with fast and robust bootstrap. J Am Stat Assoc2006, 101:1198-1211. · Zbl 1120.62319
[50] MacKinnonJG, WebbMD. Wild bootstrap inference for wildly different cluster sizes. J Appl Econ2016, 32:233-254.
[51] ParentePMDC, SilvaS. Quantile regression with clustered data. J Econ Methods2016, 5:1-15. · Zbl 1345.62182
[52] PalmerCR, FaloutsosC. Density biased sampling: an improved method for data mining and clustering. ACM2000, 29:82-92.
[53] PoosalaV, IoannidisY. Selectivity estimation without the attribute value independence assumption. In: Proceedings of Very Large Data Bases Conference, pp. 486-495, 1997.
[54] ChaudhuriS, MotwaniR, NarasayyaV. On random sampling over joins. In: Proceedings of SIGMOD, pp. 263-274, June 1999.
[55] KornF, JohnsonT, JagadishH. Range selectivity estimation for continuous attribute. In: Proceedings of 11th Intl Conf. SSDBMs, 1999.
[56] VitterJS, WangM, IyerBR. Data cube approximation and histograms via wavelets. In: Proceedings of 1998 ACM CIKM International Conference on Information and Knowledge Management, 1998.
[57] MatiasY, VitterJS, WangM. Wavelet‐based histograms for selectivity estimation. In: Proceedings of 1998 ACM SIGMOD International Conference on Management of Data, 1998.
[58] LeeJ, KimD, ChungC. Multi‐dimensional selectivity estimation using compressed histogram information. In: Proceedings of 1999ACM SIGMOD International Conference on Management of Data, 1999.
[59] BlohsfeldB, KorusD, SeegerB. A comparison of selectivity estimators for range queries on metric attributes. Proceedings of 1999 ACM SIGMOD International Conference on Management of Data, 1999.
[60] ScottD. Multivariate Density Estimation: Theory, Practice and Visualization. Hoboken, NJ: John Wiley & Sons; 1992. · Zbl 0850.62006
[61] SilvermanBW. Density estimation for statistics and data analysis. In: Monographs on Statistics and Applied Probability. Boca Raton, FL: Chapman & Hall; 1986. · Zbl 0617.62042
[62] KolliosG, GunopulosD, KoudasN, BerchtoldS. Efficient biased sampling for approximate clustering and outlier detection in large data sets. IEEE Trans Knowl Data Eng2003, 15:1170-1187.
[63] IversenTF, EllekildeL‐P. Kernel density estimation based self‐learning sampling strategy for motion planning of repetitive tasks. In: 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1380-1387. IEEE, 2016.
[64] PejoskiS, KafedziskiV. Wavelet image decomposition based variable density compressive sampling in mri. In: Telecommunications Forum (TELFOR), 2011 19th, pp. 635-638. IEEE, 2011.
[65] LewisD, CatlettJ. Heterogeneous uncertainty sampling for supervised learning. In: Proceedings of the International Conference on Machine Learning (ICML), pp. 148-156. Morgan Kaufmann, 1994.
[66] SharmaM, BilgicM. Evidence‐based uncertainty sampling for active learning. Data Mining Knowl Discov2017, 31:164-202. · Zbl 1411.68121
[67] BilgicM, MihalkovaL, GetoorL. Active learning for networked data. In: Proceedings of the 27th International Conference on Machine Learning, pp. 79-86, 2010.
[68] ChaoC, CakmakM, ThomazAL. Transparent active learning for robots. In: 5th ACM/IEEE International Conference on Human-Robot Interaction (HRI), IEEE, pp. 317-324, 2010.
[69] StanitsasP, CherianA, MorellasV, PapanikolopoulosN. Active constrained clustering via non‐iterative uncertainty sampling. In: IROS, 2016, pp. 4027-4033.
[70] PrudêncioRBC, SoaresC, Bernarda LudermirT. Uncertainty sampling‐based active selection of datasetoids for meta‐learning. in: ICANN (2), pp. 454-461, 2011.
[71] BhattN, ThakkarA, GanatraA, BhattN. The multi‐criteria ranking approach to classification algorithms using uncertainty sampling method of active meta learning; 2014.
[72] MinakawaM, RaytchevB, TamakiT, KanedaK. Image sequence recognition with active learning using uncertainty sampling. In: The 2013 International Joint Conference on Neural Networks (IJCNN), pp. 1-6. IEEE, 2013.
[73] LughoferE, PratamaM. On‐line active learning in data stream regression using uncertainty sampling based on evolving generalized fuzzy models. In: IEEE Transactions on Fuzzy Systems, 2017.
[74] NguforC, WojtusiakJ. Learning from large distributed data: a scaling down sampling scheme for efficient data processing. Int J Mach Learn Comput2014, 4:216-224.
[75] ZhangT, OlesF. A probability analysis on the value of unlabeled data for classification problems. In: Proceedings of the International Conference on Machine Learning, 2000.
[76] BrinkerK. Incorporating diversity in active learning with support vector machines. In: ICML, 2003.
[77] HoiSCH, JinR, ZhuJ, LyuMR. Batch mode active learning and its application to medical image classification. In: ICML, 2006.
[78] AzimiJ, FernA, Zhang‐FernX, BorradaileG, HeeringaB. Batch active learning via coordinated matching. arXiv preprint arXiv:1206.6458, 2012.
[79] WangZ, YeJ. Querying discriminative and representative samples for batch mode active learning. ACM Trans Knowl Discov Data2015, 9:17.
[80] WeiK, IyerRK, BilmesJA. Submodularity in data subset selection and active learning. In: ICML, pp. 1954-1963, 2015.
[81] Chattopadhyay, R, FanW, DavidsonI, PanchanathanS, YeJ. Joint transfer and batch‐mode active learning. In: ICML 3, pp. 253-261, 2013.
[82] MitchellT. Generalization as search. Artificial Intell1982, 18:203-226. https://doi.org/10.1016/0004-3702(82)90040-6. · doi:10.1016/0004-3702(82)90040-6
[83] DasguptaS. Two faces of active learning. Theor Comput Sci2011, 412:1767-1781. · Zbl 1209.68408
[84] HannekeS. Theory of active learning. Version 1.1, 2014. Available at: http://www.stevehanneke.com.
[85] CohnD, AtlasL, LadnerR. Improving generalization with active learning. Mach Learn1994, 15:201-221. https://doi.org/10.1007/BF00993277. · doi:10.1007/BF00993277
[86] SeungHS, OpperM, SompolinskyH. Query by committee. In: Proceedings of the ACM Workshop on Computational Learning Theory, pp. 287-294. ACM, 1992. 10.1145/130385.130417
[87] FreundY, SeungHS, ShamirE, TishbyN. Selective samping using the query by committee algorithm. Mach Learn1997, 28:133-168. https://doi.org/10.1023/A:1007330508534. · Zbl 0881.68093 · doi:10.1023/A:1007330508534
[88] OlssonF. A literature survey of active machine learning in the context of natural language processing; 2009.
[89] BreimanL. Bagging predictors. Mach Learn1996, 24:123-140. · Zbl 0858.68080
[90] FreundY, SchapireRE. A decision‐theoretic generalization of on‐line learning and application to boosting. J Comput Syst Sci1997, 55:119-139. · Zbl 0880.68103
[91] MelvilleP, MooneyRJ. Diverse ensembles for active learning. In: Proceedings of the 21st International Conference on Machine Learning (ICML‐2004), pp. 584-591. Banff, Canada, 2004.
[92] StefanowskiJ, PachockiM. Comparing performance of committee based approaches to active learning. In: Recent Advances in Intelligent Information Systems. Warszawa: Wydawnictwo EXIT; 2009, 457-470.
[93] DzeroskiS, ZenkoB. Is combining classifiers with stacking better than selecting the best one?Mach Learn2004, 54:255-273. · Zbl 1101.68077
[94] CaruanaR, MunsonA, Niculescu‐MizilA. Getting the most out of ensemble selection. In: Proceedings of International Conference on Data Mining (ICDM), pp. 828-833, 2006.
[95] LuZ, WuX, BongardJC. Active learning through adaptive heterogeneous ensembling. IEEE Trans Knowl Data Eng2015, 27:368-381.
[96] BalcanM‐F, BeygelzimerA, LangfordJ. Agnostic active learning. J Comput Syst Sci2009, 75:78-89. · Zbl 1162.68516
[97] HannekeS. A bound on the label complexity of agnostic active learning. In: Proceedings of the 24th International Conference on Machine Learning, 2007.
[98] DasguptaS, HsuD, MonteleoniC. A general agnostic active learning algorithm. In: Advances in Neural Information Processing Systems 20, 2007.
[99] BalcanM‐F, BroderA, ZhangT. Margin based active learning. In: Proceedings of the 20th Conference on Learning Theory, 2007. · Zbl 1203.68136
[100] BeygelzimerA, DasguptaS, LangfordJ. Importance weighted active learning. In: Proceedings of the 26th International Conference on Machine Learning, 2009.
[101] FriedmanE. Active learning for smooth problems. In: Proceedings of the 22nd Conference on Learning Theory, 2009.
[102] BalcanM‐F, HannekeS, VaughanJW. The true sample complexity of active learning. Mach Learn2010, 80:111-139. · Zbl 1470.68078
[103] HannekeS. Rates of convergence in active learning. Ann Stat2011, 39:333-361. · Zbl 1274.62510
[104] KoltchinskiiV. Rademacher complexities and bounding the excess risk in active learning. J Mach Learn Res2010, 11:2457-2485. · Zbl 1242.62088
[105] BeygelzimerA, HsuD, LangfordJ, ZhangT. Agnostic active learning without constraints. In: Advances in Neural Information Processing Systems 23, 2010.
[106] HsuD. Algorithms for active learning. PhD Thesis, Department of Computer Science and Engineering,School of Engineering, University of California, San Diego, 2010.
[107] HannekeS. Activized learning: transforming passive to active with improved label complexity. J Mach Learn Res2012, 13:1469-1587. · Zbl 1303.68103
[108] El‐YanivR, WienerY. Active learning via perfect selective classification. J Mach Learn Res2012, 13:255-279. · Zbl 1283.68287
[109] HannekeS, YangL. Surrogate losses in passive and active learning. arXiv:1207.3772, 2012.
[110] HannekeS. Teaching dimension and the complexity of active learning. In: Proceedings of the 20th Conference on Learning Theory, 2007. · Zbl 1203.68151
[111] El‐YanivR, WienerY. On the foundations of noise‐free selective classification. J Mach Learn Res2010, 11:1605-1641. · Zbl 1242.68218
[112] WienerY. Theoretical foundations of selective prediction. PhD Thesis, The Technion — Israel Institute of Technology, 2013.
[113] KornerC, WrobelS. Multi‐class ensemble‐based active learning. In: Proceedings of The 17th European Conference on Machine Learning and the 10th European Conference on Principles and Practice of Knowledge Discovery in Databases, pp. 687-694. Berlin: Springer‐Verlag, 2006.
[114] LinJ. Divergence measures based on the Shannon entropy. IEEE Trans Inf Theory1991, 37:145-151. · Zbl 0712.94004
[115] PereiraFCN, TishbyN, LeeL. Distributional clustering of English words. In: Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics, pp. 183-190. Columbus, OH: ACL, 1993.
[116] KullbackS, LeiblerRA. On information and sufficiency. Ann Math Stat1951, 22:79-86. · Zbl 0042.38403
[117] EngelsonSP, DaganI. 1996. Minimizing manual annotation cost in supervised training from corpora. In: Proceedings of the 34th Annual Meeting of the Association for Computational Linguistics, pp. 319-326. Santa Cruz, CA: ACL.
[118] NgaiG, YarowskyD. Rule writing or annotation: Costefficient resource usage for base noun phrase chunking. In: Proceedings of the 38th Annual Meeting on Association for Computational Linguistics, pp. 117-125. Hong‐Kong: ACL, 2000.
[119] ChalonerK, VerdinelliI. Bayesian experimental design: a review. Stat Sci1995, 10:273-304. · Zbl 0955.62617
[120] SmithK. On the standard deviations of adjusted and interpolated values of an observed polynomial function and its constants and the guidance they give towards a proper choice of the distribution of observations. Biometrika1918, 12:1-85.
[121] FedorovV. Optimal experimental design. WIREs Comput Stat2010, 2:581-589.
[122] DetteH, StuddenWJ. Geometry of E‐optimality. Ann Stat1993, 21:416-433. · Zbl 0780.62057
[123] ElfvingG. Optimum allocation in linear regression theory. Ann Math Stat1952, 23:255-262. · Zbl 0047.13403
[124] SacksJ, YlvisakerD. Designs for regression problems with correlated errors. Ann Math Stat1966, 37:66-89. · Zbl 0152.17503
[125] MacKayDJC. Information‐based objective functions for active data selection. Neural Comput1992, 4:590-604.
[126] ScheinAI, UngarLH. Active learning for logistic regression: an evaluation. Mach Learn2007, 68:235-265. · Zbl 1470.68170
[127] HoiSCH, JinR, LyuMR. Large‐scale text categorization by batch mode active learning. In: Proceedings of the International Conference on theWorldWideWeb, pp. 633-642. ACM, 2006. doi: 10.1145/1135777.1135870
[128] Ramirez‐LoaizaME, SharmaM, KumarG, BilgicM. Active learning: an empirical study of common baselines. Data Mining Knowl Discov2016, 31:287-313.
[129] RoyN, McCallumA. Toward optimal active learning through sampling estimation of error reduction. In: Proceedings of the International Conference on Machine Learning (ICML), pp. 441-448. Morgan Kaufmann; 2001.
[130] dos SantosDP, de CarvalhoACPLF. Comparison of active learning strategies and proposal of a multiclass hypothesis space search. In: International Conference on Hybrid Artificial Intelligence Systems, pp. 618-629. Springer International Publishing, 2014.
[131] SettlesB, CravenM. An analysis of active learning strategies for sequence labeling tasks. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 1070-1079; 2008.
[132] ZhuJ, WangH, TsouBK, MaM. Active learning with sampling by uncertainty and density for data annotations. IEEE Trans Audio Speech Lang Process2010, 18:1323-1331.
[133] IencoD, ZliobaiteI, PfahringerB. High density‐focused uncertainty sampling for active learning over evolving stream data. In: BigMine, pp. 133-148, 2014.
[134] FuY, ZhuX, LiB. A survey on instance selection for active learning. Knowl Inf Syst2013, 35:1-35.
[135] BouneffoufD. Exponentiated gradient exploration for active learning. C R Geosci2016, 5:1.
[136] LuoC, JiY, DaiX, ChenJ. Active learning with transfer learning. In: Proceedings of ACL 2012 Student Research Workshop, pp. 13-18. Association for Computational Linguistics, 2012.
[137] ShaoH, TaoF, RuiX. Transfer active learning by querying committee. J Zhejiang Univ Sci C2014, 15:107-118.
[138] HannekeS, YangL. Minimax analysis of active learning. J Mach Learn Res2015, 16:3487-3602. · Zbl 1351.68210
[139] ProvostF, JensenD, OatesT. Efficient progressive sampling. In: Proceedings of the Fifth International Conference on Knowledge Discovery and Data Mining, pp. 23-32, 1999.
[140] MeekC, TheissonB, HeckerrnanD. The learning‐curve sampling method applied to model‐ based clustering. J Mach Learn Res2002, 2:397-418. · Zbl 1007.68082
[141] JohnGH, LangleyP. Static versus dynamic sampling for data mining. In: KDD, 96, pp. 367-370, 1996.
[142] SatyanarayanaA. Intelligent sampling for big data using bootstrap sampling and chebyshev inequality. In: 2014 I.E. 27th Canadian Conference on Electrical and Computer Engineering (CCECE), pp. 1-6. IEEE, 2014.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.