scikit-learn

Machine Learning in Python

Simple and efficient tools for data mining and data analysis

Built on NumPy, SciPy, and matplotlib

Structure^1

图中蓝色圆圈内是判断条件，绿色方框内是可以选择的算法。根据自己的数据特征和任务目标去找到一条自己的操作路线，一步步做就好了。

可以看到库的算法主要有四类：分类，回归，聚类，降维。其中：

常用的回归：线性、决策树、SVM、KNN ；集成回归：随机森林、Adaboost、GradientBoosting、Bagging、ExtraTrees
常用的分类：线性、决策树、SVM、KNN，朴素贝叶斯；集成分类：随机森林、Adaboost、GradientBoosting、Bagging、ExtraTrees
常用聚类：k均值（K-means）、层次聚类（Hierarchical clustering）、DBSCAN
常用降维：LinearDiscriminantAnalysis、PCA

Modules[^2]

Modules	Purpose	Notes
sklearn.base	base classes for all estimators
sklearn.calibration	calibration of predicted probabilities
sklearn.cluster	gathers popular unsupervised clustering algorithms
sklearn.cluster.bicluster	spectral biclustering algorithms
sklearn.compose	building composite models with transformers
sklearn.covariance	includes methods and algorithms to robustly estimate the covariance of features given a set of points
sklearn.cross_decomposition
sklearn.datasets	utilities to load datasets, including methods to load and fetch popular reference datasets, also features some artificial data generators
sklearn.decomposition	matrix decomposition algorithms, including among others PCA, NMF or ICA
sklearn.discriminant_analysis	Linear Discriminant Analysis and Quadratic Discriminant Analysis	分类算法
sklearn.dummy	Dummy estimators
sklearn.ensemble	ensemble-based methods for classification, regression and anomaly detection
sklearn.exceptions	all custom warnings and error classes used across scikit-learn
sklearn.experimental	provides importable modules that enable the use of experimental features or estimators.
sklearn.feature_extraction	deals with feature extraction from raw data. It currently includes methods to extract features from text and images.
sklearn.feature_selection	implements feature selection algorithms. It currently includes univariate filter selection methods and the recursive feature elimination algorithm.
sklearn.gaussian_process	implements Gaussian Process based regression and classification.
sklearn.isotonic	Isotonic regression
sklearn.impute	Transformers for missing value imputation
sklearn.kernel_approximation	implements several approximate kernel feature maps base on Fourier transforms.
sklearn.kernel_ridge	implements kernel ridge regression.
sklearn.linear_model	implements generalized linear models. It includes Ridge regression, Bayesian Regression, Lasso and Elastic Net estimators computed with Least Angle Regression and coordinate descent. It also implements Stochastic Gradient Descent related algorithms.
sklearn.manifold	implements data embedding techniques.
sklearn.metrics	score functions, performance metrics and pairwise metrics and distance computations	confusion_matrix roc_auc_score [note]
sklearn.mixture	implements mixture modeling algorithms.
sklearn.model_selection	splitter, validation	train_test_split StratifiedKFold
sklearn.multiclass	This module implements multiclass learning algorithms
sklearn.multioutput	implements multioutput regression and classification.
sklearn.naive_bayes	implements Naive Bayes algorithms. These are supervised learning methods based on applying Bayes’ theorem with strong (naive) feature independence assumptions.
sklearn.neighbors	implements the k-nearest neighbors algorithm
sklearn.neural_network	ncludes models based on neural networks.
sklearn.pipeline	implements utilities to build a composite estimator, as a chain of transforms and estimators.
sklearn.inspection	includes tools for model inspection.
sklearn.preprocessing	includes scaling, centering, normalization, binarization and imputation methods.
sklearn.random_projection	Random Projections are a simple and computationally efficient way to reduce the dimensionality of the data by trading a controlled amount of accuracy (as additional variance) for faster processing times and smaller model sizes.
sklearn.semi_supervised	implements semi-supervised learning algorithms. These algorithms utilized small amounts of labeled data and large amounts of unlabeled data for classification tasks. This module includes Label Propagation.
sklearn.svm	includes Support Vector Machine algorithms.
sklearn.tree	includes decision tree-based models for classification and regression.
sklearn.utils	includes various utilities.

sklearn.preprocessing

Preprocessing data

The learn.preprocessing package provides several common utility functions and transformer classes to change raw feature vectors into a representation that is more suitable for the downstream estimators.

Standardization, or mean removal and variance scaling
- standardization: scale, StandardScaler, Transformer
- scale to a range: MinMaxScaler, MaxAbsScaler
Non-linear transformation
- Uniform distribution: QuantileTransformer， quantile_transform
- Gaussian distribution: PowerTransformer
Normalization: normalize, Normalizer, Transformer
Encoding categorical features: OrdinalEncoder, OneHotEncoder
Discretization: KBinsDiscretizer, Binarizer
Generating polynomial features: PolynomialFeatures
Custom transformers: FunctionTransformer

[^2]: scikit-learn/modules
[^3]: scikit-learn/user_guide