使用标签转换¶

在本指南中，您将学习如何使用LabelTimes上可用的转换。每个转换都返回标签时间的一个副本。这对于在不同设置下尝试多种转换而无需重新计算标签非常有用。因此，您可以更快地看到哪些标签能带来更好的性能。

生成标签¶

首先，在模拟交易数据集上生成标签。每个标签定义为客户在给定一小时交易中的总花费。

[1]:

import composeml as cp

def total_spent(df):
    return df['amount'].sum()

label_maker = cp.LabelMaker(
    labeling_function=total_spent,
    target_dataframe_index='customer_id',
    time_index='transaction_time',
    window_size='1h',
)

labels = label_maker.search(
    cp.demos.load_transactions(),
    num_examples_per_instance=10,
    minimum_data='2h',
    gap='2min',
    verbose=True,
)

Elapsed: 00:00 | Remaining: 00:00 | Progress: 100%|██████████| customer_id: 50/50

为了了解标签的样子，请预览数据框。

[2]:

labels.head()

[2]:

	customer_id	time	total_spent
0	1	2014-01-01 02:45:30	217.94
1	1	2014-01-01 02:47:30	217.94
2	1	2014-01-01 02:49:30	217.94
3	1	2014-01-01 02:51:30	217.94
4	1	2014-01-01 02:53:30	217.94

标签阈值处理¶

LabelTimes.threshold() 通过测试标签值是否高于某个阈值来创建二元标签。在此示例中，应用阈值来确定哪些客户消费超过 100。

[3]:

labels.threshold(100).head()

[3]:

	customer_id	time	total_spent
0	1	2014-01-01 02:45:30	True
1	1	2014-01-01 02:47:30	True
2	1	2014-01-01 02:49:30	True
3	1	2014-01-01 02:51:30	True
4	1	2014-01-01 02:53:30	True

标签时间提前¶

LabelTimes.apply_lead() 将标签时间提前。这对于训练预测模型很有用。在此示例中，将标签时间提前了一小时。

[4]:

labels.apply_lead('1h').head()

[4]:

	customer_id	time	total_spent
0	1	2014-01-01 01:45:30	217.94
1	1	2014-01-01 01:47:30	217.94
2	1	2014-01-01 01:49:30	217.94
3	1	2014-01-01 01:51:30	217.94
4	1	2014-01-01 01:53:30	217.94

标签分箱¶

LabelTimes.bin() 将标签分入离散区间（箱）。箱有两种类型：基于值或基于分位数。此外，箱的宽度可以由用户定义或平均划分。

基于值¶

要使用基于值的箱，应将 quantiles 设置为 False，这是默认值。

等宽¶

要将值分组到等宽的箱中，请将 bins 设置为标量值。在此示例中，total_spent 被分组到等宽的箱中。

[5]:

labels.bin(4, quantiles=False).head()

[5]:

	customer_id	time	total_spent
0	1	2014-01-01 02:45:30	(198.455, 271.072]
1	1	2014-01-01 02:47:30	(198.455, 271.072]
2	1	2014-01-01 02:49:30	(198.455, 271.072]
3	1	2014-01-01 02:51:30	(198.455, 271.072]
4	1	2014-01-01 02:53:30	(198.455, 271.072]

自定义宽度¶

要将值分组到自定义宽度的箱中，请将 bins 设置为定义边缘的值数组。在此示例中，total_spent 被分组到自定义宽度的箱中。

[6]:

inf = float('inf')
edges = [-inf, 34, 50, 67, inf]
labels.bin(edges, quantiles=False,).head()

[6]:

	customer_id	time	total_spent
0	1	2014-01-01 02:45:30	(67.0, inf]
1	1	2014-01-01 02:47:30	(67.0, inf]
2	1	2014-01-01 02:49:30	(67.0, inf]
3	1	2014-01-01 02:51:30	(67.0, inf]
4	1	2014-01-01 02:53:30	(67.0, inf]

基于分位数¶

要使用基于分位数的箱，应将 quantiles 设置为 True。

等宽¶

要将值分组到等宽的分位数箱中，请将 bins 设置为分位数的数量作为标量值（例如，4 表示四分位数，10 表示十分位数等）。在此示例中，总花费根据四分位数被分组到箱中。

[7]:

labels.bin(4, quantiles=True).head()

[7]:

	customer_id	time	total_spent
0	1	2014-01-01 02:45:30	(196.25, 217.94]
1	1	2014-01-01 02:47:30	(196.25, 217.94]
2	1	2014-01-01 02:49:30	(196.25, 217.94]
3	1	2014-01-01 02:51:30	(196.25, 217.94]
4	1	2014-01-01 02:53:30	(196.25, 217.94]

要验证四分位数的值，请检查描述性统计信息。

[8]:

stats = labels.total_spent.describe()
stats = stats.round(3).to_string()
print(stats)

count     50.000
mean     215.182
std       90.518
min       53.220
25%      196.250
50%      217.940
75%      290.390
max      343.690

自定义宽度¶

要将值分组到自定义宽度的分位数箱中，请将 bins 设置为分位数的数组。在此示例中，总花费被分组到自定义宽度的分位数箱中。

[9]:

quantiles = [0, .34, .5, .67, 1]
labels.bin(quantiles, quantiles=True).head()

[9]:

	customer_id	time	total_spent
0	1	2014-01-01 02:45:30	(196.25, 217.94]
1	1	2014-01-01 02:47:30	(196.25, 217.94]
2	1	2014-01-01 02:49:30	(196.25, 217.94]
3	1	2014-01-01 02:51:30	(196.25, 217.94]
4	1	2014-01-01 02:53:30	(196.25, 217.94]

标签箱¶

要为箱分配自定义标签，请将 labels 设置为值数组。标签的数量需要与箱的数量匹配。在此示例中，总花费被分组到带有自定义标签的箱中。

[10]:

values = ['low', 'medium', 'high']
labels.bin(3, labels=values).head()

[10]:

	customer_id	time	total_spent
0	1	2014-01-01 02:45:30	medium
1	1	2014-01-01 02:47:30	medium
2	1	2014-01-01 02:49:30	medium
3	1	2014-01-01 02:51:30	medium
4	1	2014-01-01 02:53:30	medium

描述标签¶

LabelTimes.describe() 打印出带有您用于生成标签的设置和转换的分布。这有助于理解标签是如何从原始数据生成的。此外，标签分布对于确定我们是否有不平衡标签很有帮助。在此示例中，在将标签转换为离散值后，会打印出标签的描述。

[11]:

labels.threshold(100).describe()

Label Distribution
------------------
False      8
True      42
Total:    50


Settings
--------
gap                                 2min
maximum_data                        None
minimum_data                          2h
num_examples_per_instance             10
target_column                total_spent
target_dataframe_index       customer_id
target_type                     discrete
window_size                           1h


Transforms
----------
1. threshold
  - value:    100

样本标签¶

LabelTimes.sample() 根据数量或比例对标签进行抽样。通过将 random_state 固定为整数，可以复现抽样结果。

要抽样 10 个标签，将 n 设置为 10。

[12]:

labels.sample(n=10, random_state=0)

[12]:

	customer_id	time	total_spent
2	1	2014-01-01 02:49:30	217.94
4	1	2014-01-01 02:53:30	217.94
10	2	2014-01-01 02:00:00	290.39
11	2	2014-01-01 02:02:00	290.39
22	3	2014-01-01 03:49:05	196.25
27	3	2014-01-01 03:59:05	196.25
28	3	2014-01-01 04:01:05	196.25
31	4	2014-01-01 02:41:00	343.69
38	4	2014-01-01 02:55:00	225.18
41	5	2014-01-01 03:48:25	53.22

类似地，要抽样 10% 的标签，将 frac 设置为 10%。

[13]:

labels.sample(frac=.1, random_state=0)

[13]:

	customer_id	time	total_spent
2	1	2014-01-01 02:49:30	217.94
10	2	2014-01-01 02:00:00	290.39
11	2	2014-01-01 02:02:00	290.39
28	3	2014-01-01 04:01:05	196.25
41	5	2014-01-01 03:48:25	53.22

分类标签¶

处理分类标签时，可以使用字典对每个类别的标签数量或比例进行抽样。将标签分入 4 个箱以使其成为分类标签。

[14]:

categorical = labels.bin(4, labels=['A', 'B', 'C', 'D'])

要对每个类别抽样 2 个标签，将每个类别映射到数字 2。

[15]:

n = {'A': 2, 'B': 2, 'C': 2, 'D': 2}
categorical.sample(n=n, random_state=0)

[15]:

	customer_id	time	total_spent
6	1	2014-01-01 02:57:30	C
11	2	2014-01-01 02:02:00	D
16	2	2014-01-01 02:12:00	D
26	3	2014-01-01 03:57:05	B
38	4	2014-01-01 02:55:00	C
42	5	2014-01-01 03:50:25	A
46	5	2014-01-01 03:58:25	A
48	5	2014-01-01 04:02:25	B

类似地，要对每个类别抽样 10% 的标签，将每个类别映射到 10%。

[16]:

frac = {'A': .1, 'B': .1, 'C': .1, 'D': .1}
categorical.sample(frac=frac, random_state=0)

[16]:

	customer_id	time	total_spent
6	1	2014-01-01 02:57:30	C
11	2	2014-01-01 02:02:00	D
16	2	2014-01-01 02:12:00	D
26	3	2014-01-01 03:57:05	B
46	5	2014-01-01 03:58:25	A

控制标签搜索中的截止时间

数据切片生成器