Python | scikit-learn
scikit-learn
Machine Learning in Python
- Simple and efficient tools for data mining and data analysis
- Built on NumPy, SciPy, and matplotlib
Structure^1
图中蓝色圆圈内是判断条件,绿色方框内是可以选择的算法。根据自己的数据特征和任务目标去找到一条自己的操作路线,一步步做就好了。
可以看到库的算法主要有四类:分类,回归,聚类,降维。其中:
- 常用的回归:线性、决策树、SVM、KNN ;集成回归:随机森林、Adaboost、GradientBoosting、Bagging、ExtraTrees
- 常用的分类:线性、决策树、SVM、KNN,朴素贝叶斯;集成分类:随机森林、Adaboost、GradientBoosting、Bagging、ExtraTrees
- 常用聚类:k均值(K-means)、层次聚类(Hierarchical clustering)、DBSCAN
- 常用降维:LinearDiscriminantAnalysis、PCA
Modules[^2]
Modules | Purpose | Notes |
---|---|---|
sklearn.base | base classes for all estimators | |
sklearn.calibration | calibration of predicted probabilities | |
sklearn.cluster | gathers popular unsupervised clustering algorithms | |
sklearn.cluster.bicluster | spectral biclustering algorithms | |
sklearn.compose | building composite models with transformers | |
sklearn.covariance | includes methods and algorithms to robustly estimate the covariance of features given a set of points | |
sklearn.cross_decomposition | ||
sklearn.datasets | utilities to load datasets, including methods to load and fetch popular reference datasets, also features some artificial data generators | |
sklearn.decomposition | matrix decomposition algorithms, including among others PCA, NMF or ICA | |
sklearn.discriminant_analysis | Linear Discriminant Analysis and Quadratic Discriminant Analysis | 分类算法 |
sklearn.dummy | Dummy estimators | |
sklearn.ensemble | ensemble-based methods for classification, regression and anomaly detection | |
sklearn.exceptions | all custom warnings and error classes used across scikit-learn | |
sklearn.experimental | provides importable modules that enable the use of experimental features or estimators. | |
sklearn.feature_extraction | deals with feature extraction from raw data. It currently includes methods to extract features from text and images. | |
sklearn.feature_selection | implements feature selection algorithms. It currently includes univariate filter selection methods and the recursive feature elimination algorithm. | |
sklearn.gaussian_process | implements Gaussian Process based regression and classification. | |
sklearn.isotonic | Isotonic regression | |
sklearn.impute | Transformers for missing value imputation | |
sklearn.kernel_approximation | implements several approximate kernel feature maps base on Fourier transforms. | |
sklearn.kernel_ridge | implements kernel ridge regression. | |
sklearn.linear_model | implements generalized linear models. It includes Ridge regression, Bayesian Regression, Lasso and Elastic Net estimators computed with Least Angle Regression and coordinate descent. It also implements Stochastic Gradient Descent related algorithms. | |
sklearn.manifold | implements data embedding techniques. | |
sklearn.metrics | score functions, performance metrics and pairwise metrics and distance computations | confusion_matrix roc_auc_score [note] |
sklearn.mixture | implements mixture modeling algorithms. | |
sklearn.model_selection | splitter, validation | train_test_split StratifiedKFold |
sklearn.multiclass | This module implements multiclass learning algorithms | |
sklearn.multioutput | implements multioutput regression and classification. | |
sklearn.naive_bayes | implements Naive Bayes algorithms. These are supervised learning methods based on applying Bayes’ theorem with strong (naive) feature independence assumptions. | |
sklearn.neighbors | implements the k-nearest neighbors algorithm | |
sklearn.neural_network | ncludes models based on neural networks. | |
sklearn.pipeline | implements utilities to build a composite estimator, as a chain of transforms and estimators. | |
sklearn.inspection | includes tools for model inspection. | |
sklearn.preprocessing | includes scaling, centering, normalization, binarization and imputation methods. | |
sklearn.random_projection | Random Projections are a simple and computationally efficient way to reduce the dimensionality of the data by trading a controlled amount of accuracy (as additional variance) for faster processing times and smaller model sizes. | |
sklearn.semi_supervised | implements semi-supervised learning algorithms. These algorithms utilized small amounts of labeled data and large amounts of unlabeled data for classification tasks. This module includes Label Propagation. | |
sklearn.svm | includes Support Vector Machine algorithms. | |
sklearn.tree | includes decision tree-based models for classification and regression. | |
sklearn.utils | includes various utilities. |
sklearn.preprocessing
The
learn.preprocessing
package provides several common utility functions and transformer classes to change raw feature vectors into a representation that is more suitable for the downstream estimators.
- Standardization, or mean removal and variance scaling
- standardization:
scale
,StandardScaler
,Transformer
- scale to a range:
MinMaxScaler
,MaxAbsScaler
- standardization:
- Non-linear transformation
- Uniform distribution:
QuantileTransformer
,quantile_transform
- Gaussian distribution:
PowerTransformer
- Uniform distribution:
- Normalization:
normalize
,Normalizer
,Transformer
- Encoding categorical features:
OrdinalEncoder
,OneHotEncoder
- Discretization:
KBinsDiscretizer
,Binarizer
- Generating polynomial features:
PolynomialFeatures
- Custom transformers:
FunctionTransformer
[^2]: scikit-learn/modules
[^3]: scikit-learn/user_guide
本博客所有文章除特别声明外,均采用 CC BY-NC-SA 4.0 许可协议。转载请注明来自 琴韵居!