前言

直接上环境：

主编给你一份刚修订好新杂志，让你整理出一个表格里面要有文章题目，作者，字数，稿费

仔细一想，说不了python搞起来更简单，效率更高。

本文分步进行操作，整合起来有点麻烦，因为一篇杂志里面文章很多，还有各种乱七八糟的栏目和图片，想要一劳永逸还需要慢慢来。

环境

python3.9

docx包

提取标题

一级标题

PY
for p in doc.paragraphs:
    if p.style.name=='Heading 1':
        print(p.text)

二级标题

PY
for p in doc.paragraphs:
    if p.style.name=='Heading 2':
        print(p.text)

前言
直接上环境：

主编给你一份刚修订好新杂志，让你整理出一个表格里面要有文章题目，作者，字数，稿费

仔细一想，说不了python搞起来更简单，效率更高。

本文分步进行操作，整合起来有点麻烦，因为一篇杂志里面文章很多，还有各种乱七八糟的栏目和图片，想要一劳永逸还需要慢慢来。

环境
python3.9

docx包

提取标题
一级标题
for p in doc.paragraphs:
if p.style.name==‘Heading 1’:
print(p.text)
二级标题
for p in doc.paragraphs:
if p.style.name==‘Heading 2’:
print(p.text)
所有标题
import re
for p in doc.paragraphs:
if re.match(“^Heading \d+$”,p.style.name):
print(p.text)
提取正文
for p in doc.paragraphs:
if p.style.name==‘Normal’:
print(p.text)
搞定杂志中所有标题以后，接下来就要学习如何用python操作excel表格了。

python在Excel表格的运用
参考文章：https://blog.csdn.net/weixin_41261833/article/details/106028038

写入
1）修改表格中的内容
① 向某个格子中写入内容并保存

workbook = load_workbook(filename = “test.xlsx”)
sheet = workbook.active
print(sheet)
sheet[“A1”] = “哈喽” # 这句代码也可以改为cell = sheet[“A1”] cell.value = “哈喽”
workbook.save(filename = “哈喽.xlsx”)
“”"
注意：我们将“A1”单元格的数据改为了“哈喽”，并另存为了“哈喽.xlsx”文件。
如果我们保存的时候，不修改表名，相当于直接修改源文件；
“”"
② .append()：向表格中插入行数据

.append()方式：会在表格已有的数据后面，增添这些数(按行插入)；
这个操作很有用，爬虫得到的数据，可以使用该方式保存成Excel文件；
workbook = load_workbook(filename = “test.xlsx”)
sheet = workbook.active
print(sheet)
data = [
[“魈”,“男”,“165cm”],
[“个子真矮真君”,“男”,“165cm”],
[“凯亚”,“男”,“175cm”],
[“凝冰渡海真君”,“男”,“176cm”],
]
for row in data:
sheet.append(row)
workbook.save(filename = “test.xlsx”)
③ .提取word表格，并保存在excel中

from docx import Document
from openpyxl import Workbook

doc = Document(r"G:\6Tipdm\7python办公自动化\concat_word\test2.docx")
t0 = doc.tables[0]

workbook = Workbook()
sheet = workbook.active

for i in range(len(t0.rows)):
list1 = []
for j in range(len(t0.columns)):
list1.append(t0.cell(i,j).text)
sheet.append(list1)
workbook.save(filename = r"G:\6Tipdm\7python办公自动化\concat_word\来自word中的表.xlsx")
文章字数统计
import sys
print (sys.getdefaultencoding())

fw = open(‘data.txt.utf8’,‘r’)

character列表：存储所有出现的汉字

stat字典：汉字为key值，出现次数为value值

characters = []
stat = {}

for line in fw:
line = line.strip()
# 如果某一行去掉空格没有内容，则这一行不做处理
if len(line) == 0:
continue

for x in range(0,len(line)):
	# 暴力列举可能出现的标点符号，统计汉字的时候跳过这些符号
	if line[x] in [' ','\n','\t','，','。','？','《','》','！','、','：','“','”','；']:
		continue
	
	# 如果当前汉字没有在character列表中，则加入character列表	
	if not(line[x] in characters):
		characters.append(line[x])
 
	# 判断stat字典中是否含存在当前汉字，如果不存在，则将此汉字加入stat字典，其value值赋 0
	# python2的版本： if not(stat.has_key(line[x])):
	if not (stat.__contains__(line[x])):
		stat[line[x]] = 0
	
	# 在stat字典中，使当前汉字的统计数 +1
	stat[line[x]] += 1

fw.close()

print the result

print(characters)
for key,value in stat.items():
print(key,value)

查看character和stat的长度，即里面含有的元素个数

print(‘characters列表的长度：’ + str(len(characters)))
print(‘stat字典的长度：’ + str(len(stat)))
接下来，需要再写一个判断模块，来跳过一下不必要的字数记录。

还有一个大问题是，如何让程序运行到每篇文章末尾的时候结束字数统计。这里我采用的方法是识别到含有“作者简介”字样是结束运行。

下一篇考虑用一种新的方式，暂时想着用字号，但是漏洞反而更多了。所有标题

PY
import re
for p in doc.paragraphs:
    if re.match("^Heading \d+$",p.style.name):
        print(p.text)

提取正文

PY
for p in doc.paragraphs:
    if p.style.name=='Normal':
        print(p.text)

搞定杂志中所有标题以后，接下来就要学习如何用python操作excel表格了。

python在Excel表格的运用

参考文章：https://blog.csdn.net/weixin_41261833/article/details/106028038

写入

1）修改表格中的内容

① 向某个格子中写入内容并保存

PY
workbook = load_workbook(filename = "test.xlsx")
sheet = workbook.active
print(sheet)
sheet["A1"] = "哈喽" # 这句代码也可以改为cell = sheet["A1"] cell.value = "哈喽"
workbook.save(filename = "哈喽.xlsx")
"""
注意：我们将“A1”单元格的数据改为了“哈喽”，并另存为了“哈喽.xlsx”文件。
如果我们保存的时候，不修改表名，相当于直接修改源文件；
"""

② .append()：向表格中插入行数据

.append()方式：会在表格已有的数据后面，增添这些数(按行插入)；
这个操作很有用，爬虫得到的数据，可以使用该方式保存成Excel文件；
PY
workbook = load_workbook(filename = "test.xlsx")
sheet = workbook.active
print(sheet)
data = [
    ["魈","男","165cm"],
    ["个子真矮真君","男","165cm"],
    ["凯亚","男","175cm"],
    ["凝冰渡海真君","男","176cm"],
]
for row in data:
    sheet.append(row)
workbook.save(filename = "test.xlsx")

③ .提取word表格，并保存在excel中

CODE
from docx import Document
from openpyxl import Workbook

doc = Document(r"G:\6Tipdm\7python办公自动化\concat_word\test2.docx")
t0 = doc.tables[0]

workbook = Workbook()
sheet = workbook.active

for i in range(len(t0.rows)):
    list1 = []
    for j in range(len(t0.columns)):
        list1.append(t0.cell(i,j).text)
    sheet.append(list1)
workbook.save(filename = r"G:\6Tipdm\7python办公自动化\concat_word\来自word中的表.xlsx")

文章字数统计

CODE
import sys
print (sys.getdefaultencoding())


fw = open('data.txt.utf8','r')

 # character列表：存储所有出现的汉字

 # stat字典：汉字为key值，出现次数为value值

characters = []
stat = {}

for line in fw:
	line = line.strip()
	# 如果某一行去掉空格没有内容，则这一行不做处理
	if len(line) == 0:
		continue

	for x in range(0,len(line)):
		# 暴力列举可能出现的标点符号，统计汉字的时候跳过这些符号
		if line[x] in [' ','\n','\t','，','。','？','《','》','！','、','：','“','”','；']:
			continue
		
		# 如果当前汉字没有在character列表中，则加入character列表	
		if not(line[x] in characters):
			characters.append(line[x])
	 
		# 判断stat字典中是否含存在当前汉字，如果不存在，则将此汉字加入stat字典，其value值赋 0
		# python2的版本： if not(stat.has_key(line[x])):
		if not (stat.__contains__(line[x])):
			stat[line[x]] = 0
		
		# 在stat字典中，使当前汉字的统计数 +1
		stat[line[x]] += 1 

fw.close()

# print the result

print(characters)
for key,value in stat.items():
	print(key,value)

# 查看character和stat的长度，即里面含有的元素个数

print('characters列表的长度：' + str(len(characters)))
print('stat字典的长度：' + str(len(stat)))

接下来，需要再写一个判断模块，来跳过一下不必要的字数记录。

还有一个大问题是，如何让程序运行到每篇文章末尾的时候结束字数统计。这里我采用的方法是识别到含有“作者简介”字样是结束运行。

下一篇考虑用一种新的方式，暂时想着用字号，但是漏洞反而更多了。