python文章采集例子（爬取http://infoq.com） -m6米乐中国在线登录入口

qepwqnp

浏览: 102704 次
性别:
来自: 成都

博主相关

博客

微博

相册

文章分类

社区版块

( 0)
( 65)
( 47)

存档分类

2023-12 ( 1)
2017-04 ( 1)
2015-09 ( 1)

python文章采集例子（爬取http://infoq.com）

博客分类：

原创

python xml sql orm html

写了个采集http://infoq.com资源的小程序，原理：从infoq.com上读取提供的ress资源。然后根据资源中相关链接下载相应文章
ress地址:http://www.infoq.com/cn/rss/rss.action?token=v4oeyqexg7ltwopp5iph34ky6wdtpxqz

   
- 
- 
  未注册用户的 infoq 个性化 rss feed - 请注册后升级！ 
  http://www.infoq.com/cn/ 
  本 rss feed 是一个个性化定制的 feed，对于您在 infoq.com 上的帐号（无论注册与否）都是唯一的。您可以从 infoq 网站左侧栏中的“您的社区”选项框内选择感兴趣的社区，此外您还可以通过关闭子话题和标签的方式过滤掉您不感兴趣的内容。您所做的选择将影响到本 rss feed 显示的新闻——新闻内容将和您在网站m6手机网页版登录首页中央的新闻栏看见的内容保持一致。如果您的 rss feed 没有反映出这样的相关性，那么可能是因为您使用的 feed 链接没有与您的 infoq 帐号相关联。为了确保您所使用的 feed 的正确性，请先在 infoq 上注册，然后从网站左侧菜单中的“个性化 rss”链接获取新的 rss feed url。祝您使用愉快！ 
- 
- 
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
  
  
  
- 
  petapoco：适用于.net的微型orm 
  http://www.infoq.com/cn/news/2011/06/petapoco 
  >petapoco是一款适用于.net应用程序的对象关系映射器（orm, object relational mapper）。与那些功能完备的orm（如nhibernate或entity framework）不同的是，petapoco更注重易用性和性能，而非丰富的功能。使用petapoco只需要引入一个c#文件，可以使用强类型的poco，并支持
.........

得到一个ress的标准xml文档，然后解析xml得相关文章信息，再进入解析，最后下载图片，保存文章信息到mysql数据库中

下面是代码：

#! /usr/bin/env python 
# -*- coding: utf-8 -*- 
import urllib
import re,sys
import string
from xml.dom.minidom import parsestring
from sgmllib import sgmlparser  
import mysqldb
reload(sys)
sys.setdefaultencoding('utf8')
class constants():
    #站点
    html_site = "http://www.infoq.com";
    #聚体资源
    html_resource = html_site   "/cn/rss/rss.action?token=v4oeyqexg7ltwopp5iph34ky6wdtpxqz";  
    #数据库配置
    db_host = "localhost"     
    #数据库用户名
    db_user = "root"
    #数据库密码
    db_password = "root"
    #数据库
    db_database = "test"
    #数据库连接编码集
    charset = "utf8"
    #代理服务器
    proxy_adress = ""
    #代理用户名
    proxy_username = ""
    #代理用户密码
    proxy_password = ""
    #图片本地保存路径
    img_localdstdir = "e:/image/"
    
    
class listurls(sgmlparser):  
    def reset(self):  
        self.imgs = []  
        sgmlparser.reset(self)  
    def start_img(self, attrs):  
        src = [v for k, v in attrs if k == 'src']  
        if src:  
            self.imgs.extend(src)
#数据库工具类
class dbutil():
    def  getconnectiondb(self):
        try:
            conn = mysqldb.connect(host=constants.db_host, user=constants.db_user, passwd=constants.db_password, db=constants.db_database, charset=constants.charset)
            return conn
        except:
            print "eroor: get connectiondb is fail"
        
#文章对象用于从网站中爬取然后存储在db中
class  actrict():
    title = ''
    link = ''
    description = ''
    creator = ''
    createdate = ''
    identifier = ''
    content = ''
    
class webcrawlerhttp: 
    #获取html内容   
    def geturlinfo(self, weburl):
        try :
            #proxyconfig = 'http://%s:%s@%s' % (constants.proxy_username, constants.proxy_password, constants.proxy_adress)
            #information = urllib.urlopen(weburl, proxies={'http':proxyconfig})
            information = urllib.urlopen(weburl)
            #header = information.info()            
            #contenttype = header.getheader('content-type')           
            status = information.getcode()           
            if status == 200:            
                html = information.readlines()                        
                return html    
            else:
                return 'error: get web %s% is fail and status=%s' % (weburl, status);
        except:
            print 'error: get web %s% is fail' % (weburl);
        finally:
            information.close()    
            
    #解析html
    def parsehtml(self, html, link):
        try:
            #body是一个list，需要转成string
            document = ""
            for line in html:
                if line.split():
                    document = document   line                
            #title
            title = document[re.search("title>", document).end():]   
            title = title[:re.search("title>", title).end() - 8]
              
            #content
            content = document[re.search("box-content-5", document).end():]
            content = content[:re.search("bottom-corners", content).end()]  
            content = document[re.search("", document).end():]  
            content = content[:re.search("

global site tag (gtag.js) - google analytics

python文章采集例子（爬取http://infoq.com） -m6米乐中国在线登录入口

最近访客

博主相关

文章分类

社区版块

存档分类

最新评论

python文章采集例子（爬取http://infoq.com）

python文章采集例子（爬取http://infoq.com） -m6米乐中国在线登录入口

最近访客

博主相关

文章分类

社区版块

存档分类

最新评论

python文章采集例子（爬取http://infoq.com）

字节顺序大端模式big endian

mediawiki 如何开发特殊页面

mediawiki 实现ajax请求及demo

mysql error 1040 (00000): too many connections

解决ajax传输到后台时中文乱码问题

webservices手写客户端调用

一键搞定python连接mysql驱动问题(windows版本)

python爬虫抓站技巧

python urlopen使用代理

typeerror: 'str' object is not callable

eclipse 如何本地进行远程调试

js_自己封装一个可查询frame中对象的一个方法

在eclipse中启动tomcat，并指定启动目录

oracle varchar 排序问题

sql 连接 join 例解。（左连接，右连接，全连接，内连接，交叉连接，自连接）

利用properties资源文件追加写入，而不覆盖

spring aspectj采用注释做申明式事务(手工山寨版)

数据库分页大全（oracle利用解析函数row_number高效分页）

jdbc优化[手工原创]