sklearn的超参数优化

・7 分钟阅读

  • 源代码名称: hyperopt-sklearn
  • 源代码网址: https://www.github.com/hyperopt/hyperopt-sklearn
  • hyperopt-sklearn的文档
  • hyperopt-sklearn的源代码下载
  • Git URL:
    git://www.github.com/hyperopt/hyperopt-sklearn.git
  • Git Clone代码到本地:
    git clone https://www.github.com/hyperopt/hyperopt-sklearn
  • Subversion代码到本地:
    $ svn co --depth empty https://www.github.com/hyperopt/hyperopt-sklearn
                              Checked out revision 1.
                              $ cd repo
                              $ svn up trunk
              
  • hyperopt-sklearn

    Hyperopt-sklearn是基于Hyperopt的模型选择,用于scikit-learn中的机器学习算法。

    了解如何通过示例或旧笔记本使用hyperopt-sklearn

    安装

    支持使用pip从git clone进行安装:

    
    git clone git@github.com:hyperopt/hyperopt-sklearn.git
    
    
    (cd hyperopt-sklearn && pip install -e .)
    
    
    
    

    用法

    如果你熟悉sklearn,则使用hyperopt-sklearn添加超参数搜索,和标准管道相比,只用更改一行。

    
    from hpsklearn import HyperoptEstimator, svc
    
    
    from sklearn import svm
    
    
    
    # Load Data
    
    
    # ...
    
    
    
    if use_hpsklearn:
    
    
     estim = HyperoptEstimator(classifier=svc('mySVC'))
    
    
    else:
    
    
     estim = svm.SVC()
    
    
    
    estim.fit(X_train, y_train)
    
    
    
    print(estim.score(X_test, y_test))
    
    
    # <<show score here>>
    
    
    
    

    使用Iris数据集的完整例子:

    
    from hpsklearn import HyperoptEstimator, any_classifier
    
    
    from sklearn.datasets import load_iris
    
    
    from hyperopt import tpe
    
    
    import numpy as np
    
    
    
    # Download the data and split into training and test sets
    
    
    
    iris = load_iris()
    
    
    
    X = iris.data
    
    
    y = iris.target
    
    
    
    test_size = int(0.2 * len(y))
    
    
    np.random.seed(13)
    
    
    indices = np.random.permutation(len(X))
    
    
    X_train = X[ indices[:-test_size]]
    
    
    y_train = y[ indices[:-test_size]]
    
    
    X_test = X[ indices[-test_size:]]
    
    
    y_test = y[ indices[-test_size:]]
    
    
    
    # Instantiate a HyperoptEstimator with the search space and number of evaluations
    
    
    
    estim = HyperoptEstimator(classifier=any_classifier('my_clf'),
    
    
     preprocessing=any_preprocessing('my_pre'),
    
    
     algo=tpe.suggest,
    
    
     max_evals=100,
    
    
     trial_timeout=120)
    
    
    
    # Search the hyperparameter space based on the data
    
    
    
    estim.fit( X_train, y_train )
    
    
    
    # Show the results
    
    
    
    print( estim.score( X_test, y_test ) )
    
    
    # 1.0
    
    
    
    print( estim.best_model() )
    
    
    # {'learner': ExtraTreesClassifier(bootstrap=False, class_weight=None, criterion='gini',
    
    
    # max_depth=3, max_features='log2', max_leaf_nodes=None,
    
    
    # min_impurity_decrease=0.0, min_impurity_split=None,
    
    
    # min_samples_leaf=1, min_samples_split=2,
    
    
    # min_weight_fraction_leaf=0.0, n_estimators=13, n_jobs=1,
    
    
    # oob_score=False, random_state=1, verbose=False,
    
    
    # warm_start=False), 'preprocs': (), 'ex_preprocs': ()}
    
    
    
    

    下面是一个使用MNIST的例子在分类器和预处理方面更加具体。

    
    from hpsklearn import HyperoptEstimator, extra_trees
    
    
    from sklearn.datasets import fetch_mldata
    
    
    from hyperopt import tpe
    
    
    import numpy as np
    
    
    
    # Download the data and split into training and test sets
    
    
    
    digits = fetch_mldata('MNIST original')
    
    
    
    X = digits.data
    
    
    y = digits.target
    
    
    
    test_size = int(0.2 * len(y))
    
    
    np.random.seed(13)
    
    
    indices = np.random.permutation(len(X))
    
    
    X_train = X[ indices[:-test_size]]
    
    
    y_train = y[ indices[:-test_size]]
    
    
    X_test = X[ indices[-test_size:]]
    
    
    y_test = y[ indices[-test_size:]]
    
    
    
    # Instantiate a HyperoptEstimator with the search space and number of evaluations
    
    
    
    estim = HyperoptEstimator(classifier=extra_trees('my_clf'),
    
    
     preprocessing=[],
    
    
     algo=tpe.suggest,
    
    
     max_evals=10,
    
    
     trial_timeout=300)
    
    
    
    # Search the hyperparameter space based on the data
    
    
    
    estim.fit( X_train, y_train )
    
    
    
    # Show the results
    
    
    
    print( estim.score( X_test, y_test ) )
    
    
    # 0.962785714286 
    
    
    
    print( estim.best_model() )
    
    
    # {'learner': ExtraTreesClassifier(bootstrap=True, class_weight=None, criterion='entropy',
    
    
    # max_depth=None, max_features=0.959202875857,
    
    
    # max_leaf_nodes=None, min_impurity_decrease=0.0,
    
    
    # min_impurity_split=None, min_samples_leaf=1,
    
    
    # min_samples_split=2, min_weight_fraction_leaf=0.0,
    
    
    # n_estimators=20, n_jobs=1, oob_score=False, random_state=3,
    
    
    # verbose=False, warm_start=False), 'preprocs': (), 'ex_preprocs': ()}
    
    
    
    

    可用组件

    并非所有来自sklearn的classifiers/regressors/preprocessing都已实现,下面列出了当前可用的列表,在这里 找到实现这些函数的源代码

    分类器

    
    svc
    
    
    svc_linear
    
    
    svc_rbf
    
    
    svc_poly
    
    
    svc_sigmoid
    
    
    liblinear_svc
    
    
    
    knn
    
    
    
    ada_boost
    
    
    gradient_boosting
    
    
    
    random_forest
    
    
    extra_trees
    
    
    decision_tree
    
    
    
    sgd
    
    
    
    xgboost_classification
    
    
    
    multinomial_nb
    
    
    gaussian_nb
    
    
    
    passive_aggressive
    
    
    
    linear_discriminant_analysis
    
    
    quadratic_discriminant_analysis
    
    
    
    rbm
    
    
    
    colkmeans
    
    
    
    one_vs_rest
    
    
    one_vs_one
    
    
    output_code
    
    
    
    
    

    对于跨多个分类器的简单通用搜索空间,使用any_classifier ,如果数据的格式为稀疏矩阵格式,请使用any_sparse_classifier

    变量

    
    svr
    
    
    svr_linear
    
    
    svr_rbf
    
    
    svr_poly
    
    
    svr_sigmoid
    
    
    
    knn_regression
    
    
    
    ada_boost_regression
    
    
    gradient_boosting_regression
    
    
    
    random_forest_regression
    
    
    extra_trees_regression
    
    
    
    sgd_regression
    
    
    
    xgboost_regression
    
    
    
    

    对于跨多变量的简单通用搜索空间,使用any_regressor ,如果数据的格式为稀疏矩阵格式,请使用any_sparse_regressor

    预处理

    
    pca
    
    
    
    one_hot_encoder
    
    
    
    standard_scaler
    
    
    min_max_scaler
    
    
    normalizer
    
    
    
    ts_lagselector
    
    
    
    tfidf
    
    
    
    
    

    对于跨空间多预处理算法的简单通用搜索,请使用any_preprocessing。如果你正在处理原始文本数据,请使用any_text_preprocessing ,当前只用于文本,但是,将来可能会添加更多的TFIDF ,请注意HyperoptEstimator中的preprocessing参数需要一个列表,因为各种预处理步骤可以链接在一起,通用搜索空间函数any_preprocessingany_text_preprocessing已经返回一个列表,但是,其他函数不应该被包装在列表中,如果你不想进行任何预处理,请传入空列表[]

    讨论
    Fansisi profile image