`
simpledev
  • 浏览: 194417 次
  • 性别: Icon_minigender_1
  • 来自: 上海
社区版块
存档分类
最新评论

htmlparser学习总结

阅读更多

由于公司需要,开发了一个抓取网上数据爬虫的程序,如抓取点评网、阿里巴巴网和慧聪网城市和行业信息,采用的技术是:htmlparser。本文是简单的介绍htmlparser抓取的常用代码示例,具体详见:htmlparser下载包中的api文档。

下面理清一下Node节点与节点之间的关系及NodeFilter的全部实现类。

Interface Node

|||All Known Subinterfaces:

RemarkRemarkNode ,

TagAppletTag, BaseHrefTag, BodyTag, Bullet, BulletList, CompositeTag, DefinitionList, DefinitionListBullet, Div, DoctypeTag, FormTag, FrameSetTag, FrameTag, HeadingTag, HeadTag, Html, ImageTag, InputTag, JspTag, LabelTag, LinkTag, MetaTag, ObjectTag, OptionTag, ParagraphTag, ProcessingInstructionTag, ScriptTag, SelectTag, Span, StyleTag, TableColumn, TableHeader, TableRow, TableTag, TagNode, TextareaTag, TitleTag,

TextTextNode

 

Interface NodeFilter

|||All Known Implementing Classes:

AndFilter, AndFilterWrapper, CssSelectorNodeFilter, Filter, HasAttributeFilter, HasAttributeFilterWrapper, HasChildFilter, HasChildFilterWrapper, HasParentFilter, HasParentFilterWrapper, HasSiblingFilter, HasSiblingFilterWrapper, IsEqualFilter, LinkRegexFilter, LinkStringFilter, NodeClassFilter, NodeClassFilterWrapper, NotFilter, NotFilterWrapper, OrFilter, OrFilterWrapper, RegexFilter, RegexFilterWrapper, StringFilter, StringFilterWrapper, TagNameFilter, TagNameFilterWrapper

 

 

|||基本思路:前提是对整个html代码的分析,特别是需要抓取的html内容的分析。

第一步:Parser对象的创建并且设置编码,parser.setEncoding("UTF-8"); //UTF-8html文件中的编码格式,保持一致。

第二步:创建合适的Filter过滤器

第三步:解析获取NodeList对象,然后该对象的toHtml()方法获取字符串,又可以重新创建Parser对象,如果可以一次定位到抓取的内容是最好的,如果不可以,方法是:逐步缩小范围。

第四步:对抓取的内容进行字符串处理,数据库操作等。NodeList对象的toNodeArray()方法获取Node[]节点数组,如LinkTag link = (LinkTag)node[0]; link.getLinkText()//获取链接文本 link.getLink(); //获取链接

 

|||Detail

1.       创建Parser对象的方法:(有的时候会抛出网络异常,可以尝试下面三种方法解决问题)

1.1最普通常规的方式

Parser(String resource)

          Creates a Parser object with the location of the resource (URL or file).

 

Parser(URLConnection connection)

          Construct a parser using the provided URLConnection.

 

static Parser createParser(String html, String charset)

          Creates the parser on an input string.

 

1.2 使用java网络链接代理方式

       public static URLConnection getUrlAgent(String strUrl){

              HttpURLConnection connection = null;

              try{

                     URL url = new URL(strUrl);

                     connection = (HttpURLConnection) url.openConnection();

              connection.setRequestProperty("User-Agent", "Mozilla/4.0 (compatible; MSIE 5.0; Windows NT; DigExt)");

              } catch (MalformedURLException e) {

                     e.printStackTrace();

              } catch (IOException e) {

                     e.printStackTrace();

              }

        return connection;

       }

Parser parser = new Parser(getUrlAgent(strUrl));

//存在中文转码的情况

String url = "http://localhost:8081/company/kw/%CB%FE%B5%F5.html";

           url = java.net.URLDecoder.decode(url, "gb2312");

           System.out.println(url);

           URLConnection conn = getUrlAgent(url);

           Parser parser = new Parser(conn);

 

1.3使用httpclient抓取网页内容流方式

       public static String convertStreamToString(InputStream is)

           throws UnsupportedEncodingException {

       BufferedReader reader = new BufferedReader(new InputStreamReader(is,

              "gbk"));

       StringBuilder sb = new StringBuilder();

       String line = null;

       try {

           while ((line = reader.readLine()) != null) {

              sb.append(line + "\n");

           }

       } catch (IOException e) {

           e.printStackTrace();

       } finally {

           try {

              is.close();

           } catch (IOException e) {

              e.printStackTrace();

           }

       }

       return sb.toString();

    }

 

    // 下载内容

    public static String urlContent(String urlString) throws HttpException,

           IOException {

       HttpClient client = new HttpClient();

       GetMethod get = new GetMethod(urlString);

       client.executeMethod(get);

       // System.out.print("aaaaa:"+get.getResponseCharSet()); //GBK

       InputStream iStream = get.getResponseBodyAsStream();

       String contentString = convertStreamToString(iStream);

       get.releaseConnection();

       return contentString;

    }

 

String url = "http://localhost:8081/company/c-1031646_province-%B9%E3%B6%AB_n-y.html/";

Parser parser = new Parser(urlContent(url));

 

 

2. NodeList对象

2.1单个标签本身过滤的情况

       TagNameFilter filter = new TagNameFilter(tag);

      NodeList nodeList = parser.parse(filter);

2.2单个标签同级(即标签与标签之间是兄弟平行关系)过滤的情况

       TagNameFilter filter = new TagNameFilter(tag);

       HasSiblingFilter hasSiblingFilter = new HasSiblingFilter(filter);

       NodeList nodeList = parser.parse(hasSiblingFilter);

2.3单个标签上级(即标签与标签之间是父子关系)过滤的情况

TagNameFilter filter = new TagNameFilter(tag);

       HasChildFilter hasChildFilter = new HasChildFilter(filter);

       NodeList nodeList = parser.parse(hasChildFilter);

2.4单个标签下级(即标签与标签之间是父子关系)过滤的情况

       TagNameFilter filter = new TagNameFilter(tag);

       HasParentFilter hasParentFilter = new HasParentFilter(filter);

       NodeList nodeList = parser.parse(hasParentFilter);

3.两个标签组合的情况,组合分为:AndFilter, OrFilter, NotFilter,同上也分为:本身,同级HasSiblingFilter,上级HasChildFilter和下级HasParentFilter过滤

       AndFilter filter = new AndFilter (

                    new TagNameFilter (tag),

                    new TagNameFilter (tagother)

                );

       AndFilter filter = new AndFilter (

                            new HasSiblingFilter (

                    new TagNameFilter (tag)),

                new HasSiblingFilter (

                    new TagNameFilter (tagother))

                );

       AndFilter filter = new AndFilter (

                            new HasChildFilter (

                    new TagNameFilter (tag)),

                new HasChildFilter (

                    new TagNameFilter (tagother))

                );

AndFilter filter = new AndFilter (

                            new HasParentFilter (

                    new TagNameFilter (tag)),

                new HasParentFilter (

                    new TagNameFilter (tagother))

                );

 

OrFilter filter = new OrFilter (

                    new TagNameFilter (tag),

                    new TagNameFilter (tagother)

                );

OrFilter filter = new OrFilter (

                            new HasSiblingFilter (

                    new TagNameFilter (tag)),

                new HasSiblingFilter (

                    new TagNameFilter (tagother))

                );

       OrFilter filter = new OrFilter (

                            new HasChildFilter (

                    new TagNameFilter (tag)),

                new HasChildFilter (

                    new TagNameFilter (tagother))

                );

       OrFilter filter = new OrFilter (

                            new HasParentFilter (

                    new TagNameFilter (tag)),

                new HasParentFilter (

                    new TagNameFilter (tagother))

                );

      

       AndFilter filter = new AndFilter (

                    new TagNameFilter (tag),

                    new NotFilter(new TagNameFilter (tagother))

                );

AndFilter filter = new AndFilter (

                            new HasSiblingFilter (

                    new TagNameFilter (tag)),

                new NotFilter (

                    new TagNameFilter (tagother))

                );

       AndFilter filter = new AndFilter (

                            new HasChildFilter (

                    new TagNameFilter (tag)),

                new NotFilter (

                    new TagNameFilter (tagother))

                );

       AndFilter filter = new AndFilter (

                            new HasParentFilter (

                    new TagNameFilter (tag)),

                new NotFilter (

                    new TagNameFilter (tagother))

                );

       NodeList nodeList = parser.parse(filter);

 

4.根据标签属性或标签属性和属性值过滤

       HasAttributeFilter filter = new HasAttributeFilter (attribute);

      

HasAttributeFilter filter = new HasAttributeFilter (attribute,value);

NodeList nodeList = parser.parse(filter);

5.标签类过滤的情况   

NodeFilter filter = new NodeClassFilter(LinkTag.class);  //如链接标签

      

       NodeFilter filter = new NodeClassFilter(TextNode.class); //如文本标签

       NodeList nodeList = parser.parse(filter);

       Node[] nodes = nodeList.toNodeArray();  //返回Node[]节点数组的情况

6.对表格的过滤获取

NodeClassFilter filter = new NodeClassFilter(TableTag.class);

               NodeList nodeList = parser.parse(filter);

               TableTag tableTag = (TableTag) nodeList.elementAt(0);

               TableRow[] rows = tableTag.getRows();

   

             

for (int j = 0; j < rows.length; j++) {

                  TableRow tr = (TableRow) rows[j];

                  TableColumn[] td = tr.getColumns();

                  for (int k = 0; k < td.length; k++) {

                     LinkTag lt = (LinkTag)td[k].getFirstChild();

                     …… //字符串操作,数据库操作

                  }

              }

 

                 

分享到:
评论

相关推荐

Global site tag (gtag.js) - Google Analytics