机器学习项目实战——信用卡欺诈检测

模型评估方法召回率：Recall=TPTP+FNRecall = \frac{TP}{TP+FN}Recall=TP+FNTPTP(true positives): 正类判定为正类TF(false positives): 负类判定为正类FN(false negatives): 正类判定为负类TN(true negatives): 负类判定为负类正则化惩罚尽量使模型的浮动差异更小，浮动大容易过度拟

韭浪

502人浏览 · 2020-05-15 22:43:54

韭浪 · 2020-05-15 22:43:54 发布

模型评估方法

召回率： $\frac{TP}{TP+FN}$

TP(true positives): 正类判定为正类
TF(false positives): 负类判定为正类
FN(false negatives): 正类判定为负类
TN(true negatives): 负类判定为负类

正则化惩罚

尽量使模型的浮动差异更小，浮动大容易过度拟合（过度拟合：训练集表达效果好，测试集表达效果差）。正则化可以通过大力度惩罚浮动大的模型降低浮动。

用正则化改造损失函数，L1正则化：

$C_0 + \frac{λ}{n} \sum\limits_{W}|W|$

C0：原损失函数；C：正则化惩罚后的损失函数；w：权重参数值。λ参数来惩罚权重值。

L2正则化：

$C_0 + \frac{λ}{2n} \sum\limits_{W} W^2$

惩罚力度不确定，可以先C_param_range = [0.01, 0.1, 1, 10, 100]，将这5个参数分别完成交叉验证，比较效果。

交叉验证

数据集Data按80%,20%分为训练集和测试集，训练集可以继续三等分为ABC，每次用两份做训练集，另一份做验证集，可以交叉验证三次，最后求其平均值。

能够确定参数的大小，同时避免错误值和离群值的干扰。

混淆矩阵

坐标系组成，x轴表示预测值，y轴表示实际值。
在这里插入图片描述

完整代码

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

data = pd.read_csv("creditcard.csv")

def Class列的种类和数量():
    count_classes = pd.value_counts(data['Class'], sort=True).sort_index()
    count_classes.plot(kind='bar')
    plt.title('Fraud class histogram')
    plt.xlabel('Class')
    plt.ylabel('Frequency')
    plt.show()


from sklearn.preprocessing import StandardScaler
# 数据标准化
data['normAmount'] = StandardScaler().fit_transform(data['Amount'].values.reshape(-1, 1))
# 删除不用的特征s
data = data.drop(['Time', 'Amount'], axis=1)


# 数据切分
X = data.loc[:, data.columns != 'Class']
y = data.loc[:, data.columns == 'Class']


#-- 下采样：recall较高，但容易误杀 -- 
# Class为0的index
normal_indices = np.array(data[data.Class == 0].index)
# Class为1的index
fraud_indices = np.array(data[data.Class == 1].index)

# 随机选样
random_normal_indices = np.random.choice(normal_indices, len(fraud_indices), replace=False)  # ndarray

# 生成下采样数据
us_indices = np.concatenate([fraud_indices, random_normal_indices])
us_data = data.iloc[us_indices, :]

# 数据切分
us_X = us_data.loc[:, us_data.columns != 'Class']
us_y = us_data.loc[:, us_data.columns == 'Class']
#-- end --


#-- 切分训练集、测试集 --
from sklearn.model_selection import train_test_split
# test_size 测试集的比例
# 默认shuffle=True，random_state固定随机起点
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
us_X_train, us_X_test, us_y_train, us_y_test = train_test_split(us_X, us_y, test_size=0.3, random_state=0)
#-- end --


from sklearn.linear_model import LogisticRegression  # 逻辑回归模型
from sklearn.model_selection import KFold, cross_val_score  # KFold 切分数据进行交叉验证
from sklearn.metrics import confusion_matrix, recall_score, classification_report

def printing_Kfold_scores(x_train_data, y_train_data):
    fold = KFold(5, shuffle=False)  # 分成5份进行交叉验证

    # 惩罚力度
    c_param_range = [0.01, 0.1, 1, 10, 100]

    result_table = pd.DataFrame(columns=['C_parameter', 'Mean recall score'])
    result_table['C_parameter'] = c_param_range
    
    j=0  # 惩罚力度index
    # 循环找到最好的惩罚力度
    for c_param in c_param_range:
        print('-------------------------------------------')
        print('C parameter:', c_param)
        print('-------------------------------------------\n')
        
        recall_accs = []
        for iteration, indices in enumerate(fold.split(x_train_data)):
            # fold.split(x_train_data) --> [train_indices, test_indices]
            
            # 用特定的c参数调用逻辑回归模型
            lr = LogisticRegression(C = c_param, penalty='l1', solver='liblinear',max_iter=10000)
            # 警告 ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
            # 解决 增加solver='liblinear' max_iter=10000(默认1000)

            # lr填充 x的训练值, y的训练值.ravel() 
            lr.fit(x_train_data.iloc[indices[0], :], y_train_data[indices[0]].values.ravel())
             
            # 预测值 = lr.predict(x的测试值)
            y_pred = lr.predict(x_train_data.iloc[indices[1], :].values)
             
            # 用 y的测试值, y的预测值 计算recall，反映当前的c参数
            recall_acc = recall_score(y_train_data[indices[1]].values, y_pred)
            recall_accs.append(recall_acc)
            print('Iteration: {}, recall score = {}'.format(iteration, recall_acc))
             
        # 多次交叉验证的评分均值
        result_table.loc[j, 'Mean recall score'] = np.mean(recall_accs)
        j += 1
        print('')
        print('Mean recall score ', np.mean(recall_accs))
        print('')
         
    # 注意此处报错  源代码没有astype('float64')
    best_c = result_table.loc[result_table['Mean recall score'].astype('float64').idxmax()]['C_parameter']
    # Finally, we can check which C parameter is the best amongst the chosen.
    print('*********************************************************************************')
    print('Best model to choose from cross validation is with C parameter', best_c)
    print('*********************************************************************************')
     
    return best_c


def plot_confusion_matrix(cm, classes, title='Confusion matrix', cmap=plt.cm.Blues):
    """
    绘制混淆矩阵
    cm: confusion_matrix 混淆矩阵对象
    classes: 类别，例如[0, 1]
    cmap: plt.cm.Blues蓝色样式
    """

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=0)
    plt.yticks(tick_marks, classes)
 
    # 混淆矩阵的文字颜色
    # 上半部分蓝色，因此文字呈白色
    # 下半部分白色，因此文字呈黑色
    import itertools
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        color="white" if cm[i, j] > cm.max() / 2 else "black"
        plt.text(j, i, cm[i, j], horizontalalignment="center")
 
    plt.tight_layout()
    plt.ylabel("True label")
    plt.xlabel("Predicted label")
    

# 求最佳参数
best_c = printing_Kfold_scores(us_X_train, us_y_train)
# 创建lr对象
lr = LogisticRegression(C=best_c, penalty='l1', solver='liblinear', max_iter=10000)
lr.fit(us_X_train, us_y_train.values.ravel())

# -- 1. 用默认0.5阈值绘制混淆矩阵 --
# 求预测值
us_y_pred = lr.predict(us_X_test)
# 用测试值和预测值计算混淆矩阵
cnf_matrix = confusion_matrix(us_y_test, us_y_pred)
# 打印评估分数recall = TP / (FN + TP)
np.set_printoptions(precision=2)  # 打印两位小数
recall_acc = cnf_matrix[1, 1] / (cnf_matrix[1, 0] + cnf_matrix[1, 1])
print("Recall metric in the testing dataset:", recall_acc)
# 绘制混淆矩阵
class_names = [0, 1]
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=class_names)
plt.show()
# -- end --

# -- 2. 确定混淆矩阵阈值 --
# 求预测概率
us_y_pred_proba = lr.predict_proba(us_X_test)
thresholds = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9] # 阈值列表
plt.figure(figsize=(10, 10))  # 画布大小

j = 1
for i in thresholds:
    us_y_pred = us_y_pred_proba[:, 1] > i
    plt.subplot(3, 3, j)  # 绘图区域分3*3块，在j位置绘图
    j += 1

    cnf_matrix = confusion_matrix(us_y_test, us_y_pred)

    # 打印评估分数recall = TP / (FN + TP)
    np.set_printoptions(precision=2)  # 打印两位小数
    recall_acc = cnf_matrix[1, 1] / (cnf_matrix[1, 0] + cnf_matrix[1, 1])
    print("Recall metric in the testing dataset:", recall_acc)

    # 绘制混淆矩阵
    class_names = [0, 1]
    plot_confusion_matrix(cnf_matrix, classes=class_names)
plt.show()
# -- end --

以上是下采样实现的模型，测试结果还好，但是数据利用率太低。如果换成原始测试集将会出现大量误杀。

对于样本不均衡数据，要利用越多的数据越好。在此案例中，过采样的结果偏好一些。过采样完整代码 https://blog.csdn.net/weixin_43326122/article/details/106264721

智源数据社区

更多推荐

自然语言处理(NLP)-下游任务&数据集：语言模型、机器翻译、问答、文本分类、情感分析、文本生成、自动摘要、命名实体识别、阅读理解、自然语言推理、信息提取、词性标注、共指消解、实体链接【＞200项】

智源数据社区

利用科大讯飞开放平台进行自然语言处理（NLP）Python

最近在做聊天机器人的人工智能实践，需要用到依存句法分析和语义依存分析，所以利用强大的中文语言技术平台注册及快速入门网址 https://www.xfyun.cn/快速入门文档 https://www.xfyun.cn/doc/platform/quickguide.htmlIP白名单设置运行demo时，会出现类似{"code":"10105","data":{},"desc":"ill...