经常发现很多用gitbook
生成的书籍质量很高
就想离线下来看
但是gitbook
生成的pdf
都无法复制且体积很大
有的网站甚至不提供下载的选项
就和小伙伴一起做了个工具
对于gitbook
生成的网站进行抓取
解析以后使用weasyprint
进行生成文件
异步抓取
使用aiohttp
抓取
对于网站内容抓取基本秒速完成
文本可复制
保持原目录结构
保留原文链接
项目地址:gitbook2pdf
1
fuergaosi OP 求 star
|
2
magicZ 2019-03-07 10:28:11 +08:00
给个链接呀
|
3
fuergaosi OP 忘记放链接了
gitbook2pdf: https://github.com/fuergaosi233/gitbook2pdf |
4
22k 2019-03-07 10:32:00 +08:00
昨天还在想着有没有能下载 gitbook 的书籍,mark 一下,楼主可以分享的话更新下原帖。谢谢大佬
|
6
changjiangzzZ 2019-03-07 11:22:48 +08:00
已 star :)
|
7
newmind 2019-03-07 11:27:17 +08:00
效果很不错, 已赞
|
8
newmind 2019-03-07 11:28:13 +08:00
要是能有个在线版就更好了
|
9
jasonslyvia 2019-03-07 11:55:25 +08:00
赞,一直想要一个这样的工具,希望能持续打磨!
|
10
FakeLeung 2019-03-07 11:59:18 +08:00
没有 usage 吗?
看代码貌似是直接修改 main 里面那个 run 的 url ? ps:github 地址可以 append。 |
11
fffflyfish 2019-03-07 12:19:46 +08:00
点赞!终于看到有人做了
|
12
mseasons 2019-03-07 14:31:23 +08:00
aiohttp.client_exceptions.ClientConnectorCertificateError: Cannot connect to host wizardforcel.gitbooks.io:443 ssl:True [SSLCertVerificationError: (1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1056)')]
|
13
d5 2019-03-07 14:34:09 +08:00
楼主可以考虑做一个在线版,后端放在外地主机上~
|
14
privil 2019-03-07 16:32:32 +08:00
……好像比较吃内存,被 kill 掉了
|
15
tongdongdong 2019-03-07 18:59:15 +08:00
C:\Users\TDD\Desktop>python -m weasyprint https://ts.xcatliu.com ts.pdf
WARNING: Ignored `text-rendering:auto` at 4:620, unknown property. WARNING: Ignored `filter:none` at 4:2882, unknown property. WARNING: Expected a media type, got (max-width:600px) WARNING: Invalid media type " (max-width:600px)" the whole @media rule was ignored at 9:83. WARNING: Expected a media type, got (max-width:600px) WARNING: Invalid media type " (max-width:600px)" the whole @media rule was ignored at 9:669. WARNING: Ignored `box-shadow:none` at 9:1092, unknown property. WARNING: Ignored `text-overflow:ellipsis` at 9:1686, unknown property. WARNING: Expected a media type, got (max-width:1000px) WARNING: Invalid media type " (max-width:1000px)" the whole @media rule was ignored at 9:1805. WARNING: Ignored `box-shadow:0 6px 12px rgba(0,0,0,.175)` at 9:2336, unknown property. WARNING: Ignored `overflow-y:auto` at 9:3908, unknown property. WARNING: Ignored `text-overflow:ellipsis` at 9:4934, unknown property. WARNING: Expected a media type, got (max-width:600px) WARNING: Invalid media type " (max-width:600px)" the whole @media rule was ignored at 9:5254. WARNING: Expected a media type, got (min-width:600px) WARNING: Invalid media type " (min-width:600px)" the whole @media rule was ignored at 9:5583. WARNING: Expected a media type, got (max-width:600px) WARNING: Invalid media type " (max-width:600px)" the whole @media rule was ignored at 9:5650. WARNING: Ignored `overflow-y:auto` at 9:6180, unknown property. WARNING: Ignored `overflow-y:auto` at 9:6418, unknown property. WARNING: Expected a media type, got (max-width:1240px) WARNING: Invalid media type " (max-width:1240px)" the whole @media rule was ignored at 9:6434. WARNING: Ignored `text-size-adjust:100%` at 9:7377, unknown property. WARNING: Expected a media type, got (max-width:1240px) WARNING: Invalid media type " (max-width:1240px)" the whole @media rule was ignored at 9:11595. WARNING: Ignored `box-shadow:none` at 9:12111, unknown property. WARNING: Ignored `text-size-adjust:100%` at 9:12512, unknown property. WARNING: Ignored `text-rendering:optimizeLegibility` at 9:20972, unknown property. WARNING: Ignored `font-smoothing:antialiased` at 9:21006, unknown property. WARNING: Ignored `text-size-adjust:100%` at 9:21124, unknown property. WARNING: Ignored `box-shadow: none` at 235:3, unknown property. WARNING: Ignored `box-shadow: none` at 272:3, unknown property. 然后只有首页转成功了!!! |
16
changjiangzzZ 2019-03-07 19:02:54 +08:00
@tongdongdong 老哥麻烦看看文档先~
|
17
changjiangzzZ 2019-03-07 19:04:38 +08:00
@mseasons 国内网络环境不太好,连接的时候 timeout 了,添加个代理试试
|
18
fuergaosi OP @privil 吃内存是因为`weasyprint`的问题 正在尝试分片输出
@tongdongdong 出门左转`weasyprint`的 issues 区 @mseasons 我无法访问这个 url 不知道你是怎么访问的 希望你可以把问题以及抓取的 url 发在`issues`区 @FakeLeung 感谢提醒 之前没找到 append 的按钮╮(╯_╰)╭ 另外目前是修改 url 使用 等下改一下使用方法 之前一直这样测试 就没注意这些方面 |
19
Ahs 2019-03-07 19:14:26 +08:00 via Android
已 Star
|
21
aWangami 2019-03-07 19:27:16 +08:00
(Python3) ➜ gitbook2pdf python gitbook.py
Traceback (most recent call last): File "gitbook.py", line 5, in <module> import weasyprint File "/Users/Python3/lib/python3.7/site-packages/weasyprint/__init__.py", line 393, in <module> from .css import preprocess_stylesheet # noqa File "/Users/Python3/lib/python3.7/site-packages/weasyprint/css/__init__.py", line 26, in <module> from . import computed_values File "/Users/Python3/lib/python3.7/site-packages/weasyprint/css/computed_values.py", line 17, in <module> from .. import text File "/Users/Python3/lib/python3.7/site-packages/weasyprint/text.py", line 14, in <module> import cairocffi as cairo File "/Users/Python3/lib/python3.7/site-packages/cairocffi/__init__.py", line 39, in <module> cairo = dlopen(ffi, 'cairo', 'cairo-2', 'cairo-gobject-2', 'cairo.so.2') File "/Users/Python3/lib/python3.7/site-packages/cairocffi/__init__.py", line 36, in dlopen raise OSError("dlopen() failed to load a library: %s" % ' / '.join(names)) OSError: dlopen() failed to load a library: cairo / cairo-2 / cairo-gobject-2 / cairo.so.2 这是啥情况? |
22
privil 2019-03-07 19:28:17 +08:00
@fuergaosi #18 抓取的时候也报错了,不过我 vps 内存真小,才 512Mb,抓原来的 k8s handbook 是不行的。
https://funhacks.gitbooks.io/explore-python crawling : https://funhacks.gitbooks.io/explore-python/Conclusion/reference_material.html Traceback (most recent call last): File "gitbook.py", line 298, in <module> Gitbook2PDF("https://funhacks.gitbooks.io/explore-python/").run() File "gitbook.py", line 190, in run loop.run_until_complete(self.crawl_main_content(content_urls)) File "/usr/local/python3.7.2/lib/python3.7/asyncio/base_events.py", line 584, in run_until_complete return future.result() File "gitbook.py", line 212, in crawl_main_content await asyncio.gather(*tasks) File "gitbook.py", line 233, in gettext text = ChapterParser(metatext, level).parser() File "gitbook.py", line 95, in parser if len(context.find('footer')): TypeError: object of type 'NoneType' has no len() |
23
privil 2019-03-07 19:30:23 +08:00
|
24
hooych 2019-03-07 19:38:27 +08:00
|
26
fuergaosi OP @privil 无法重现 这个报错是官方推荐的锅 我本来没有写 len 今天跑的时候官方提示我以后可能不让直接 if None 了 就推荐写成这样 结果成了个 bug 我这就去改
|
27
mseasons 2019-03-07 22:15:15 +08:00
@changjiangzzZ 不是 timeout 的问题,似乎是 https 验证的问题。我把所有的 get 请求参数增加 verify=False 就好了。
|
28
mseasons 2019-03-07 22:18:23 +08:00
@fuergaosi url 我没改,直接 git clone 下来运行的源码。我后面查了一下文档,将所有的 get 请求增加参数 verify=False 就通过了。
|
29
dyxang 2019-03-07 22:24:18 +08:00 via Android
好想直接用,为什么不 py2exe ?
|
30
leesymbol 2019-03-08 08:22:04 +08:00 via iPhone
帮顶
|
31
cye3s 2019-03-08 11:25:50 +08:00
试了个,目录结构没保留啊,比如这个
https://go.tanglei.name/content |
32
fuergaosi OP @cye3s 我测试了一下 目录结构保留了 不过因为有两个 404 所以少了两个章节 ![kz37f1.png]( https://s2.ax1x.com/2019/03/08/kz37f1.png) 另外希望有问题可以直接发到 issues 区
@dyxang 因为我没有 windows ┑( ̄Д  ̄)┍ |
33
soulteary 2019-05-07 23:51:07 +08:00
@fuergaosi 你的小工具很好用鸭,但是看到有些同学搞不定环境,所以我封装了一个容器镜像,代码在这里: https://github.com/soulteary/docker-gitbook-pdf-generator
如果你愿意稍微调整项目目录结构 & 打 release tag 的话,后续升级维护能够更方便,比如定制电子书风格, etc... |