博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
iOS解析HTML
阅读量:7042 次
发布时间:2019-06-28

本文共 5856 字,大约阅读时间需要 19 分钟。

xml,json都有大量的库来解析,我们如何解析html呢?

TFHpple是一个小型的封装,可以用来解析html,它是对libxml的封装,语法是xpath。
今天我看到一个直接用libxml来解析html,参看:http://www.cocoanetics.com/2011/09/taming-html-parsing-with-libxml-1/#comment-3090 那张图画得一目了然,很值得收藏。这个文章中的源码不能遍历所有的html,我做了一点修改可以将html遍历打印出来
// NSData data contains the document data 
// encoding is the NSStringEncoding of the data 
// baseURL the documents base URL, i.e. location 
CFStringEncoding cfenc = CFStringConvertNSStringEncodingToEncoding(encoding); 
CFStringRef cfencstr = CFStringConvertEncodingToIANACharSetName(cfenc); 
const char *enc = CFStringGetCStringPtr(cfencstr, 0); 
htmlDocPtr _htmlDocument = htmlReadDoc([data bytes], 
[[baseURL absoluteString] UTF8String], 
enc, 
XML_PARSE_NOERROR | XML_PARSE_NOWARNING); 
if (_htmlDocument) 
xmlFreeDoc(_htmlDocument); 
xmlNodePtr currentNode = (xmlNodePtr)_htmlDocument; 
while (currentNode) 
// output node if it is an element 
if (currentNode->type == XML_ELEMENT_NODE) 
NSMutableArray *attrArray = [NSMutableArray array]; 
for (xmlAttrPtr attrNode = currentNode->properties; attrNode; attrNode = attrNode->next) 
xmlNodePtr contents = attrNode->children; 
[attrArray addObject:[NSString stringWithFormat:@"%s='%s'", attrNode->name, contents->content]]; 
NSString *attrString = [attrArray componentsJoinedByString:@" "]; 
if ([attrString length]) 
attrString = [@" " stringByAppendingString:attrString]; 
NSLog(@"<%s%@>", currentNode->name, attrString); 
else if (currentNode->type == XML_TEXT_NODE) 
//NSLog(@"%s", currentNode->content); 
NSLog(@"%@", [NSString stringWithCString:(const char *)currentNode->content encoding:NSUTF8StringEncoding]); 
else if (currentNode->type == XML_COMMENT_NODE) 
NSLog(@"/* %s */", currentNode->name); 
if (currentNode && currentNode->children) 
currentNode = currentNode->children; 
else if (currentNode && currentNode->next) 
currentNode = currentNode->next; 
else 
currentNode = currentNode->parent; 
// close node 
if (currentNode && currentNode->type == XML_ELEMENT_NODE) 
NSLog(@"</%s>", currentNode->name); 
if (currentNode->next) 
currentNode = currentNode->next; 
else 
while(currentNode) 
currentNode = currentNode->parent; 
if (currentNode && currentNode->type == XML_ELEMENT_NODE) 
NSLog(@"</%s>", currentNode->name); 
if (strcmp((const char *)currentNode->name, "table") == 0) 
NSLog(@"over"); 
if (currentNode == nodes->nodeTab[0]) 
break; 
if (currentNode && currentNode->next) 
currentNode = currentNode->next; 
break; 
if (currentNode == nodes->nodeTab[0]) 
break; 
}

不过我还是喜欢用TFHpple,因为它很简单,也好用,但是它的功能不是很完完善。比如,不能获取children node,我就写了两个方法,一个是获取children node,一个是获取所有的contents. 还有node的属性content的key与node's content的key一样,都是@"nodeContent", 正确情况下属性的应是@"attributeContent",
所以我写了这个方法,同时修改node属性的content key.
NSDictionary *DictionaryForNode2(xmlNodePtr currentNode, NSMutableDictionary *parentResult) 
NSMutableDictionary *resultForNode = [NSMutableDictionary dictionary]; 
if (currentNode->name) 
NSString *currentNodeContent = 
[NSString stringWithCString:(const char *)currentNode->name encoding:NSUTF8StringEncoding]; 
[resultForNode setObject:currentNodeContent forKey:@"nodeName"]; 
if (currentNode->content) 
NSString *currentNodeContent = [NSString stringWithCString:(const char *)currentNode->content encoding:NSUTF8StringEncoding]; 
if (currentNode->type == XML_TEXT_NODE) 
if (currentNode->parent->type == XML_ELEMENT_NODE) 
[parentResult setObject:currentNodeContent forKey:@"nodeContent"]; 
return nil; 
if (currentNode->parent->type == XML_ATTRIBUTE_NODE) 
[parentResult 
setObject: 
[currentNodeContent 
stringByTrimmingCharactersInSet:[NSCharacterSet whitespaceAndNewlineCharacterSet]] 
forKey:@"attributeContent"]; 
return nil; 
xmlAttr *attribute = currentNode->properties; 
if (attribute) 
NSMutableArray *attributeArray = [NSMutableArray array]; 
while (attribute) 
NSMutableDictionary *attributeDictionary = [NSMutableDictionary dictionary]; 
NSString *attributeName = 
[NSString stringWithCString:(const char *)attribute->name encoding:NSUTF8StringEncoding]; 
if (attributeName) 
[attributeDictionary setObject:attributeName forKey:@"attributeName"]; 
if (attribute->children) 
NSDictionary *childDictionary = DictionaryForNode2(attribute->children, attributeDictionary); 
if (childDictionary) 
[attributeDictionary setObject:childDictionary forKey:@"attributeContent"]; 
if ([attributeDictionary count] > 0) 
[attributeArray addObject:attributeDictionary]; 
attribute = attribute->next; 
if ([attributeArray count] > 0) 
[resultForNode setObject:attributeArray forKey:@"nodeAttributeArray"]; 
xmlNodePtr childNode = currentNode->children; 
if (childNode) 
NSMutableArray *childContentArray = [NSMutableArray array]; 
while (childNode) 
NSDictionary *childDictionary = DictionaryForNode2(childNode, resultForNode); 
if (childDictionary) 
[childContentArray addObject:childDictionary]; 
childNode = childNode->next; 
if ([childContentArray count] > 0) 
[resultForNode setObject:childContentArray forKey:@"nodeChildArray"]; 
return resultForNode; 
}

TFHppleElement.m里加了两个key 常量

NSString * const TFHppleNodeAttributeContentKey = @"attributeContent"; 
NSString * const TFHppleNodeChildArrayKey = @"nodeChildArray";

并修改获取属性方法为:

- (NSDictionary *) attributes 
NSMutableDictionary * translatedAttributes = [NSMutableDictionary dictionary]; 
for (NSDictionary * attributeDict in [node objectForKey:TFHppleNodeAttributeArrayKey]) { 
[translatedAttributes setObject:[attributeDict objectForKey:TFHppleNodeAttributeContentKey] 
forKey:[attributeDict objectForKey:TFHppleNodeAttributeNameKey]]; 
return translatedAttributes; 
}

并添加获取children node 方法:

- (BOOL) hasChildren 
NSArray *childs = [node objectForKey: TFHppleNodeChildArrayKey]; 
if (childs) 
return YES; 
return NO; 
- (NSArray *) children 
if ([self hasChildren]) 
return [node objectForKey: TFHppleNodeChildArrayKey]; 
return nil; 
}

转载地址:http://jbhal.baihongyu.com/

你可能感兴趣的文章
24.4. REGEXP
查看>>
通过指定函数/方法形参类型提高PHP代码可靠性
查看>>
机器理解大数据秘密:聚类算法深度剖析
查看>>
[Erlang 0123] Erlang EPMD
查看>>
超级简单:使用Visual Studio自动产生存储过程
查看>>
[Tex学习笔记]矩阵输入中的省略号
查看>>
[数学杂志]AML
查看>>
[唐诗]187春思-李白
查看>>
.NET设计模式实例之单例模式( Singleton Pattern)
查看>>
JAVA学习中Swing概述中的JFrame学习
查看>>
帝国cms无法注册登录快速解决方法 附路径
查看>>
谈谈一些有趣的CSS题目(十一)-- reset.css 知多少?
查看>>
微信小程序开发教程第八章:微信小程序分组开发与左滑功能实现
查看>>
springmvc+shiro
查看>>
[20150804]提升scn2.txt
查看>>
MYSQL sync_relay_log对I/O thread的影响分析
查看>>
4.4. 文章表设计
查看>>
Innodb:RR隔离级别下insert...select 对select表加锁模型和死锁案列
查看>>
Python函数简单示例
查看>>
Consistent Nonlocking Reads,Locking Reads 和Phantom Rows
查看>>