python 字符串方法

原文：Python String Methods

校验：Kitty Du

自豪地采用谷歌翻译

# HIDDEN
# Clear previously defined variables
%reset -f

# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/08'))

# HIDDEN
import warnings
# Ignore numpy dtype warnings. These warnings are caused by an interaction
# between numpy and Cython and can be safely ignored.
# Reference: https://stackoverflow.com/a/40846742
warnings.filterwarnings("ignore", message="numpy.dtype size changed")
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
import nbinteract as nbi

sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.options.display.max_rows = 7
pd.options.display.max_columns = 8
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)

python 为基本的字符串操作提供了多种方法。虽然这些方法简单，但是作为基础组合在一起形成了更加复杂的字符串操作。我们将在“处理文本：数据清洗”的通用用例中介绍 Python 的字符串方法。

清洗文本数据

数据通常来自几个不同的来源，每个来源都实现了自己的信息编码方式。在下面的示例中，我们用一个表记录一个县(County)所属的州(State)，另一个表记录该县的人口(Population)。

# HIDDEN
state = pd.DataFrame({
    'County': [
        'De Witt County',
        'Lac qui Parle County',
        'Lewis and Clark County',
        'St John the Baptist Parish',
    ],
    'State': [
        'IL',
        'MN',
        'MT',
        'LA',
    ]
})
population = pd.DataFrame({
    'County': [
        'DeWitt  ',
        'Lac Qui Parle',
        'Lewis & Clark',
        'St. John the Baptist',
    ],
    'Population': [
        '16,798',
        '8,067',
        '55,716',
        '43,044',
    ]
})

state

	County	State
0	De Witt County	IL(伊利诺伊州)
1	Lac qui Parle County	MN(明尼苏达州)
2	Lewis and Clark County	MT(蒙大拿州)
3	St John the Baptist Parish	LA(路易斯安那州)

population

	County	Population
0	DeWitt	16,798
1	Lac Qui Parle	8,067
2	Lewis & Clark	55,716
3	St. John the Baptist	43,044

我们当然希望使用County列连接state和population表。不幸的是，两张表中没有一个县的拼写相同。此示例说明了文本数据中存在以下常见问题：

大写：qui 对应 Qui
不同的标点符号习惯：St. 对应 St
缺少单词：在population表中缺少单词County/Parish
空白的使用：DeWitt 对应 De Witt
不同的缩写习惯：& 对应 and

字符串方法

python 的字符串方法允许我们着手解决这些问题。这些方法在所有 python 字符串上都被方便地定义，因此不需要导入其他模块。虽然有必要熟悉一下字符串方法的完整列表，但我们在下表中描述了一些最常用的方法。

方法	说明
`str[x:y]`	切片`str`，返回索引 x（包含）到 y（不包含）
`str.lower()`	返回字符串的副本，所有字母都转换为小写
`str.replace(a, b)`	用子字符串`b`替换`str`中子字符串`a`的所有实例
`str.split(a)`	返回在子字符串`a`处拆分的`str`子字符串
`str.strip()`	从`str`中删除前导空格和尾随空格

我们从state和population表中选择St. John the Baptist parish的字符串，并应用字符串方法去除大写、标点符号和county/parish的出现。

john1 = state.loc[3, 'County']
john2 = population.loc[3, 'County']

(john1
 .lower()
 .strip()
 .replace(' parish', '')
 .replace(' county', '')
 .replace('&', 'and')
 .replace('.', '')
 .replace(' ', '')
)

'stjohnthebaptist'

将同一组方法应用于john2，这样我们就能验证两个字符串现在是否相同。

(john2
 .lower()
 .strip()
 .replace(' parish', '')
 .replace(' county', '')
 .replace('&', 'and')
 .replace('.', '')
 .replace(' ', '')
)

'stjohnthebaptist'

满意的是，我们创建了一个名为clean_county的方法来规范化输入的county。

def clean_county(county):
    return (county
            .lower()
            .strip()
            .replace(' county', '')
            .replace(' parish', '')
            .replace('&', 'and')
            .replace(' ', '')
            .replace('.', ''))

现在，我们可以验证clean_county方法为两个表中的所有的county生成匹配的county：

([clean_county(county) for county in state['County']],
 [clean_county(county) for county in population['County']]
)

(['dewitt', 'lacquiparle', 'lewisandclark', 'stjohnthebaptist'],
 ['dewitt', 'lacquiparle', 'lewisandclark', 'stjohnthebaptist'])

因为两个表中的每个county都有相同的转换表示，所以我们可以使用转换后的county成功地连接两个表。

pandas 中的字符串方法

在上面的代码中，我们使用一个循环来转换每个county。pandas的Series对象提供了一种将字符串方法应用于序列中每个项的方便方法。首先，state表中的county序列：

state['County']

0                De Witt County
1          Lac qui Parle County
2        Lewis and Clark County
3    St John the Baptist Parish
Name: County, dtype: object

pandas的Series的.str属性提供了和原生Python 中相同的字符串方法。对.str属性调用方法会对序列中的每个项调用该方法。

state['County'].str.lower()

0                de witt county
1          lac qui parle county
2        lewis and clark county
3    st john the baptist parish
Name: County, dtype: object

这允许我们在不使用循环的情况下转换序列中的每个字符串。

(state['County']
 .str.lower()
 .str.strip()
 .str.replace(' parish', '')
 .str.replace(' county', '')
 .str.replace('&', 'and')
 .str.replace('.', '')
 .str.replace(' ', '')
)

0              dewitt
1         lacquiparle
2       lewisandclark
3    stjohnthebaptist
Name: County, dtype: object

我们将转换后的county保存回其原始表：

state['County'] = (state['County']
 .str.lower()
 .str.strip()
 .str.replace(' parish', '')
 .str.replace(' county', '')
 .str.replace('&', 'and')
 .str.replace('.', '')
 .str.replace(' ', '')
)

population['County'] = (population['County']
 .str.lower()
 .str.strip()
 .str.replace(' parish', '')
 .str.replace(' county', '')
 .str.replace('&', 'and')
 .str.replace('.', '')
 .str.replace(' ', '')
)

现在，这两个表包含了county的相同字符串表示：

state

	County	State
0	dewitt	IL
1	lacquiparle	MN
2	lewisandclark	MT
3	stjohnthebaptist	LA

population

	County	Population
0	dewitt	16,798
1	lacquiparle	8,067
2	lewisandclark	55,716
3	stjohnthebaptist	43,044

一旦county匹配，就很容易连接这些表了。

state.merge(population, on='County')

	County	State	Population
0	dewitt	IL	16,798
1	lacquiparle	MN	8,067
2	lewisandclark	MT	55,716
3	stjohnthebaptist	LA	43,044

摘要

python 的字符串方法形成了一组简单而有用的字符串操作。pandas的Series实现了相同的方法，将底层 python 方法应用于序列中的每个字符串。

您可以在这里找到关于 python 的string方法的完整文档。还可以在这里找到关于 pandas 的文档str方法的完整文档。

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

8.1.md

8.1.md

python 字符串方法

清洗文本数据

字符串方法

pandas 中的字符串方法

摘要

Files

8.1.md

Latest commit

History

8.1.md

File metadata and controls

python 字符串方法

清洗文本数据

字符串方法

pandas 中的字符串方法

摘要