机器学习项目实战——信用卡欺诈检测
模型评估方法召回率:Recall=TPTP+FNRecall = \frac{TP}{TP+FN}Recall=TP+FNTPTP(true positives): 正类判定为正类TF(false positives): 负类判定为正类FN(false negatives): 正类判定为负类TN(true negatives): 负类判定为负类正则化惩罚尽量使模型的浮动差异更小,浮动大容易过度拟
模型评估方法
召回率: R e c a l l = T P T P + F N Recall = \frac{TP}{TP+FN} Recall=TP+FNTP
- TP(true positives): 正类判定为正类
- TF(false positives): 负类判定为正类
- FN(false negatives): 正类判定为负类
- TN(true negatives): 负类判定为负类
正则化惩罚
尽量使模型的浮动差异更小,浮动大容易过度拟合(过度拟合:训练集表达效果好,测试集表达效果差)。正则化可以通过大力度惩罚浮动大的模型降低浮动。
用正则化改造损失函数,L1正则化:
C = C 0 + λ n ∑ W ∣ W ∣ C = C_0 + \frac{λ}{n} \sum\limits_{W}|W| C=C0+nλW∑∣W∣
C0:原损失函数;C:正则化惩罚后的损失函数;w:权重参数值。λ参数来惩罚权重值。
L2正则化:
C = C 0 + λ 2 n ∑ W W 2 C = C_0 + \frac{λ}{2n} \sum\limits_{W} W^2 C=C0+2nλW∑W2
惩罚力度不确定,可以先C_param_range = [0.01, 0.1, 1, 10, 100]
,将这5个参数分别完成交叉验证,比较效果。
交叉验证
数据集Data按80%,20%分为训练集和测试集,训练集可以继续三等分为ABC,每次用两份做训练集,另一份做验证集,可以交叉验证三次,最后求其平均值。
能够确定参数的大小,同时避免错误值和离群值的干扰。
混淆矩阵
坐标系组成,x轴表示预测值,y轴表示实际值。
完整代码
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
data = pd.read_csv("creditcard.csv")
def Class列的种类和数量():
count_classes = pd.value_counts(data['Class'], sort=True).sort_index()
count_classes.plot(kind='bar')
plt.title('Fraud class histogram')
plt.xlabel('Class')
plt.ylabel('Frequency')
plt.show()
from sklearn.preprocessing import StandardScaler
# 数据标准化
data['normAmount'] = StandardScaler().fit_transform(data['Amount'].values.reshape(-1, 1))
# 删除不用的特征s
data = data.drop(['Time', 'Amount'], axis=1)
# 数据切分
X = data.loc[:, data.columns != 'Class']
y = data.loc[:, data.columns == 'Class']
#-- 下采样:recall较高,但容易误杀 --
# Class为0的index
normal_indices = np.array(data[data.Class == 0].index)
# Class为1的index
fraud_indices = np.array(data[data.Class == 1].index)
# 随机选样
random_normal_indices = np.random.choice(normal_indices, len(fraud_indices), replace=False) # ndarray
# 生成下采样数据
us_indices = np.concatenate([fraud_indices, random_normal_indices])
us_data = data.iloc[us_indices, :]
# 数据切分
us_X = us_data.loc[:, us_data.columns != 'Class']
us_y = us_data.loc[:, us_data.columns == 'Class']
#-- end --
#-- 切分训练集、测试集 --
from sklearn.model_selection import train_test_split
# test_size 测试集的比例
# 默认shuffle=True,random_state固定随机起点
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
us_X_train, us_X_test, us_y_train, us_y_test = train_test_split(us_X, us_y, test_size=0.3, random_state=0)
#-- end --
from sklearn.linear_model import LogisticRegression # 逻辑回归模型
from sklearn.model_selection import KFold, cross_val_score # KFold 切分数据进行交叉验证
from sklearn.metrics import confusion_matrix, recall_score, classification_report
def printing_Kfold_scores(x_train_data, y_train_data):
fold = KFold(5, shuffle=False) # 分成5份进行交叉验证
# 惩罚力度
c_param_range = [0.01, 0.1, 1, 10, 100]
result_table = pd.DataFrame(columns=['C_parameter', 'Mean recall score'])
result_table['C_parameter'] = c_param_range
j=0 # 惩罚力度index
# 循环找到最好的惩罚力度
for c_param in c_param_range:
print('-------------------------------------------')
print('C parameter:', c_param)
print('-------------------------------------------\n')
recall_accs = []
for iteration, indices in enumerate(fold.split(x_train_data)):
# fold.split(x_train_data) --> [train_indices, test_indices]
# 用特定的c参数调用逻辑回归模型
lr = LogisticRegression(C = c_param, penalty='l1', solver='liblinear',max_iter=10000)
# 警告 ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
# 解决 增加solver='liblinear' max_iter=10000(默认1000)
# lr填充 x的训练值, y的训练值.ravel()
lr.fit(x_train_data.iloc[indices[0], :], y_train_data[indices[0]].values.ravel())
# 预测值 = lr.predict(x的测试值)
y_pred = lr.predict(x_train_data.iloc[indices[1], :].values)
# 用 y的测试值, y的预测值 计算recall,反映当前的c参数
recall_acc = recall_score(y_train_data[indices[1]].values, y_pred)
recall_accs.append(recall_acc)
print('Iteration: {}, recall score = {}'.format(iteration, recall_acc))
# 多次交叉验证的评分均值
result_table.loc[j, 'Mean recall score'] = np.mean(recall_accs)
j += 1
print('')
print('Mean recall score ', np.mean(recall_accs))
print('')
# 注意此处报错 源代码没有astype('float64')
best_c = result_table.loc[result_table['Mean recall score'].astype('float64').idxmax()]['C_parameter']
# Finally, we can check which C parameter is the best amongst the chosen.
print('*********************************************************************************')
print('Best model to choose from cross validation is with C parameter', best_c)
print('*********************************************************************************')
return best_c
def plot_confusion_matrix(cm, classes, title='Confusion matrix', cmap=plt.cm.Blues):
"""
绘制混淆矩阵
cm: confusion_matrix 混淆矩阵对象
classes: 类别,例如[0, 1]
cmap: plt.cm.Blues蓝色样式
"""
plt.imshow(cm, interpolation='nearest', cmap=cmap)
plt.title(title)
plt.colorbar()
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes, rotation=0)
plt.yticks(tick_marks, classes)
# 混淆矩阵的文字颜色
# 上半部分蓝色,因此文字呈白色
# 下半部分白色,因此文字呈黑色
import itertools
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
color="white" if cm[i, j] > cm.max() / 2 else "black"
plt.text(j, i, cm[i, j], horizontalalignment="center")
plt.tight_layout()
plt.ylabel("True label")
plt.xlabel("Predicted label")
# 求最佳参数
best_c = printing_Kfold_scores(us_X_train, us_y_train)
# 创建lr对象
lr = LogisticRegression(C=best_c, penalty='l1', solver='liblinear', max_iter=10000)
lr.fit(us_X_train, us_y_train.values.ravel())
# -- 1. 用默认0.5阈值绘制混淆矩阵 --
# 求预测值
us_y_pred = lr.predict(us_X_test)
# 用测试值和预测值计算混淆矩阵
cnf_matrix = confusion_matrix(us_y_test, us_y_pred)
# 打印评估分数recall = TP / (FN + TP)
np.set_printoptions(precision=2) # 打印两位小数
recall_acc = cnf_matrix[1, 1] / (cnf_matrix[1, 0] + cnf_matrix[1, 1])
print("Recall metric in the testing dataset:", recall_acc)
# 绘制混淆矩阵
class_names = [0, 1]
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=class_names)
plt.show()
# -- end --
# -- 2. 确定混淆矩阵阈值 --
# 求预测概率
us_y_pred_proba = lr.predict_proba(us_X_test)
thresholds = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9] # 阈值列表
plt.figure(figsize=(10, 10)) # 画布大小
j = 1
for i in thresholds:
us_y_pred = us_y_pred_proba[:, 1] > i
plt.subplot(3, 3, j) # 绘图区域分3*3块,在j位置绘图
j += 1
cnf_matrix = confusion_matrix(us_y_test, us_y_pred)
# 打印评估分数recall = TP / (FN + TP)
np.set_printoptions(precision=2) # 打印两位小数
recall_acc = cnf_matrix[1, 1] / (cnf_matrix[1, 0] + cnf_matrix[1, 1])
print("Recall metric in the testing dataset:", recall_acc)
# 绘制混淆矩阵
class_names = [0, 1]
plot_confusion_matrix(cnf_matrix, classes=class_names)
plt.show()
# -- end --
以上是下采样实现的模型,测试结果还好,但是数据利用率太低。如果换成原始测试集将会出现大量误杀。
对于样本不均衡数据,要利用越多的数据越好。在此案例中,过采样的结果偏好一些。过采样完整代码 https://blog.csdn.net/weixin_43326122/article/details/106264721
更多推荐
所有评论(0)