scikit-learn

Machine Learning in Python

  • Simple and efficient tools for data mining and data analysis
  • Built on NumPy, SciPy, and matplotlib

Structure^1

图中蓝色圆圈内是判断条件,绿色方框内是可以选择的算法。根据自己的数据特征和任务目标去找到一条自己的操作路线,一步步做就好了。

可以看到库的算法主要有四类:分类,回归,聚类,降维。其中:

  • 常用的回归:线性、决策树、SVM、KNN ;集成回归:随机森林、Adaboost、GradientBoosting、Bagging、ExtraTrees
  • 常用的分类:线性、决策树、SVM、KNN,朴素贝叶斯;集成分类:随机森林、Adaboost、GradientBoosting、Bagging、ExtraTrees
  • 常用聚类:k均值(K-means)、层次聚类(Hierarchical clustering)、DBSCAN
  • 常用降维:LinearDiscriminantAnalysis、PCA

Modules[^2]

Modules Purpose Notes
sklearn.base base classes for all estimators
sklearn.calibration calibration of predicted probabilities
sklearn.cluster gathers popular unsupervised clustering algorithms
sklearn.cluster.bicluster spectral biclustering algorithms
sklearn.compose building composite models with transformers
sklearn.covariance includes methods and algorithms to robustly estimate the covariance of features given a set of points
sklearn.cross_decomposition
sklearn.datasets utilities to load datasets, including methods to load and fetch popular reference datasets, also features some artificial data generators
sklearn.decomposition matrix decomposition algorithms, including among others PCA, NMF or ICA
sklearn.discriminant_analysis Linear Discriminant Analysis and Quadratic Discriminant Analysis 分类算法
sklearn.dummy Dummy estimators
sklearn.ensemble ensemble-based methods for classification, regression and anomaly detection
sklearn.exceptions all custom warnings and error classes used across scikit-learn
sklearn.experimental provides importable modules that enable the use of experimental features or estimators.
sklearn.feature_extraction deals with feature extraction from raw data. It currently includes methods to extract features from text and images.
sklearn.feature_selection implements feature selection algorithms. It currently includes univariate filter selection methods and the recursive feature elimination algorithm.
sklearn.gaussian_process implements Gaussian Process based regression and classification.
sklearn.isotonic Isotonic regression
sklearn.impute Transformers for missing value imputation
sklearn.kernel_approximation implements several approximate kernel feature maps base on Fourier transforms.
sklearn.kernel_ridge implements kernel ridge regression.
sklearn.linear_model implements generalized linear models. It includes Ridge regression, Bayesian Regression, Lasso and Elastic Net estimators computed with Least Angle Regression and coordinate descent. It also implements Stochastic Gradient Descent related algorithms.
sklearn.manifold implements data embedding techniques.
sklearn.metrics score functions, performance metrics and pairwise metrics and distance computations confusion_matrix
roc_auc_score [note]
sklearn.mixture implements mixture modeling algorithms.
sklearn.model_selection splitter, validation train_test_split
StratifiedKFold
sklearn.multiclass This module implements multiclass learning algorithms
sklearn.multioutput implements multioutput regression and classification.
sklearn.naive_bayes implements Naive Bayes algorithms. These are supervised learning methods based on applying Bayes’ theorem with strong (naive) feature independence assumptions.
sklearn.neighbors implements the k-nearest neighbors algorithm
sklearn.neural_network ncludes models based on neural networks.
sklearn.pipeline implements utilities to build a composite estimator, as a chain of transforms and estimators.
sklearn.inspection includes tools for model inspection.
sklearn.preprocessing includes scaling, centering, normalization, binarization and imputation methods.
sklearn.random_projection Random Projections are a simple and computationally efficient way to reduce the dimensionality of the data by trading a controlled amount of accuracy (as additional variance) for faster processing times and smaller model sizes.
sklearn.semi_supervised implements semi-supervised learning algorithms. These algorithms utilized small amounts of labeled data and large amounts of unlabeled data for classification tasks. This module includes Label Propagation.
sklearn.svm includes Support Vector Machine algorithms.
sklearn.tree includes decision tree-based models for classification and regression.
sklearn.utils includes various utilities.

sklearn.preprocessing

Preprocessing data

The learn.preprocessing package provides several common utility functions and transformer classes to change raw feature vectors into a representation that is more suitable for the downstream estimators.

  • Standardization, or mean removal and variance scaling
    • standardization: scale, StandardScaler, Transformer
    • scale to a range: MinMaxScaler, MaxAbsScaler
  • Non-linear transformation
    • Uniform distribution: QuantileTransformerquantile_transform
    • Gaussian distribution: PowerTransformer
  • Normalization: normalize, Normalizer, Transformer
  • Encoding categorical features: OrdinalEncoder, OneHotEncoder
  • Discretization: KBinsDiscretizer, Binarizer
  • Generating polynomial features: PolynomialFeatures
  • Custom transformers: FunctionTransformer

[^2]: scikit-learn/modules
[^3]: scikit-learn/user_guide