如何在c#里通过正则表达式把网页内容提取出来?
比如http://stockdata.stock.hexun.com/2009_zxcwzb_002024.shtml这个网页,需要把其中的财务数据按季度提取出来,存入数...
比如http://stockdata.stock.hexun.com/2009_zxcwzb_002024.shtml这个网页,需要把其中的财务数据按季度提取出来,存入数据库中。急盼各位大神施以援手!
比如http://stockdata.stock.hexun.com/002024.shtml这个网页是这么采集的,能不能用类似方法采集,麻烦给出代码
Match ProfitEachMatch = Regex.Match(IntegratedStaticHtmlString, "每股收益\\(元\\)</td><td class='tb2_new'>(?<ProfitEach>[-]?\\d+\\W\\d+)</td>", RegexOptions.IgnoreCase);
Match IPOTimeMatch = Regex.Match(IntegratedStaticHtmlString, "上市时间</td><td class='tb2_new'>(?<IPOTime>\\d{4}-\\d{2}-\\d{2})</td>", RegexOptions.IgnoreCase);
Match NetWorthEachMatch = Regex.Match(IntegratedStaticHtmlString, "每股净资产\\(元\\)</td><td class='tb2_new'>(?<NetWorthEach>[-]?\\d+\\W\\d+)</td>", RegexOptions.IgnoreCase); 展开
比如http://stockdata.stock.hexun.com/002024.shtml这个网页是这么采集的,能不能用类似方法采集,麻烦给出代码
Match ProfitEachMatch = Regex.Match(IntegratedStaticHtmlString, "每股收益\\(元\\)</td><td class='tb2_new'>(?<ProfitEach>[-]?\\d+\\W\\d+)</td>", RegexOptions.IgnoreCase);
Match IPOTimeMatch = Regex.Match(IntegratedStaticHtmlString, "上市时间</td><td class='tb2_new'>(?<IPOTime>\\d{4}-\\d{2}-\\d{2})</td>", RegexOptions.IgnoreCase);
Match NetWorthEachMatch = Regex.Match(IntegratedStaticHtmlString, "每股净资产\\(元\\)</td><td class='tb2_new'>(?<NetWorthEach>[-]?\\d+\\W\\d+)</td>", RegexOptions.IgnoreCase); 展开
2个回答
展开全部
<div>最新财务指标</div>
<table width="100%" border="0" cellspacing="0" cellpadding="0" class="web2">
<span id="ControlEx1_lbl"><tr><td class='dotborder' width='35%'><div class='tishi'><strong>会计年度</strong></div></td><td class='dotborder'><div class='tishi'>2010-03-15</div></td><td class='dotborder'><div class='tishi'>2009-12-31</div></td><td><div class='tishi'>2009-09-30</div></td></tr><tr><td class='dotborder' width='35%'><div class='tishi'><strong>营业收入</strong></div></td><td class='dotborder'><div class='tishi'>16,711,983,000.00</div></td><td class='dotborder'><div class='tishi'>58,300,149,000.00</div></td><td><div class='tishi'>41,573,938,000.00</div></td></tr><tr><td class='dotborder' width='35%'><div class='tishi'><strong>净利润(元)</strong></div></td><td class='dotborder'><div class='tishi'>883,809,000.00</div></td><td class='dotborder'><div class='tishi'>2,889,956,000.00</div></td><td><div class='tishi'>1,969,508,000.00</div></td></tr><tr><td class='dotborder' width='35%'><div class='tishi'><strong>利润总额(元)</strong></div></td><td class='dotborder'><div class='tishi'>1,194,506,000.00</div></td><td class='dotborder'><div class='tishi'>3,926,367,000.00</div></td><td><div class='tishi'>2,666,614,000.00</div></td></tr><tr><td class='dotborder' width='35%'><div class='tishi'><strong>扣除非经常性损益后的净利润(元)</strong></div></td><td class='dotborder'><div class='tishi'>--</div></td><td class='dotborder'><div class='tishi'>2,852,724,000.00</div></td><td><div class='tishi'>--</div></td></tr><tr><td class='dotborder' width='35%'><div class='tishi'><strong>总资产(元)</strong></div></td><td class='dotborder'><div class='tishi'>36,803,812,000.00</div></td><td class='dotborder'><div class='tishi'>35,839,832,000.00</div></td><td><div class='tishi'>32,383,657,000.00</div></td></tr><tr><td class='dotborder' width='35%'><div class='tishi'><strong>股东权益(元)</strong></div></td><td class='dotborder'><div class='tishi'>15,420,979,000.00</div></td><td class='dotborder'><div class='tishi'>14,540,346,000.00</div></td><td><div class='tishi'>10,655,729,000.00</div></td></tr><tr><td class='dotborder' width='35%'><div class='tishi'><strong>经营活动产生的现金流量净额(元)</strong></div></td><td class='dotborder'><div class='tishi'>--</div></td><td class='dotborder'><div class='tishi'>--</div></td><td><div class='tishi'>4,625,304,000.00</div></td></tr><tr><td class='dotborder' width='35%'><div class='tishi'><strong>每股收益(摊薄)</strong></div></td><td class='dotborder'><div class='tishi'>0.13</div></td><td class='dotborder'><div class='tishi'>0.64</div></td><td><div class='tishi'>0.44</div></td></tr><tr><td class='dotborder' width='35%'><div class='tishi'><strong>净资产收益率(摊薄)(%)</strong></div></td><td class='dotborder'><div class='tishi'>5.73</div></td><td class='dotborder'><div class='tishi'>19.88</div></td><td><div class='tishi'>18.48</div></td></tr><tr><td class='dotborder' width='35%'><div class='tishi'><strong>每股经营活动产生的现金流量净额(元)</strong></div></td><td class='dotborder'><div class='tishi'>-0.12</div></td><td class='dotborder'><div class='tishi'>1.19</div></td><td><div class='tishi'>1.03</div></td></tr><tr><td class='dotborder' width='35%'><div class='tishi'><strong>每股净资产(元)</strong></div></td><td class='dotborder'><div class='tishi'>3.31</div></td><td class='dotborder'><div class='tishi'>3.12</div></td><td><div class='tishi'>2.38</div></td></tr><tr><td class='dotborder' width='35%'><div class='tishi'><strong>调整后每股净资产(元)</strong></div></td><td class='dotborder'><div class='tishi'>--</div></td><td class='dotborder'><div class='tishi'>--</div></td><td><div class='tishi'>--</div></td></tr><tr><td class='dotborder' width='35%'><div class='tishi'><strong>境外会计准则净利润(元)</strong></div></td><td class='dotborder'><div class='tishi'>--</div></td><td class='dotborder'><div class='tishi'>--</div></td><td><div class='tishi'>--</div></td></tr><tr><td class='dotborder' width='35%'><div class='tishi'><strong>扣除非经常性损益后的每股收益(元)</strong></div></td><td class='dotborder'><div class='tishi'>--</div></td><td class='dotborder'><div class='tishi'>0.64</div></td><td><div class='tishi'>--</div></td></tr><tr><td class='dotborder' width='35%'><div class='tishi'><strong>报告起始时间</strong></div></td><td class='dotborder'><div class='tishi'>2010-01-01</div></td><td class='dotborder'><div class='tishi'>2009-01-01</div></td><td><div class='tishi'>2009-01-01</div></td></tr><tr><td class='dotborder' width='35%'><div class='tishi'><strong>报告终止时间</strong></div></td><td class='dotborder'><div class='tishi'>2010-03-31</div></td><td class='dotborder'><div class='tishi'>2009-12-31</div></td><td><div class='tishi'>2009-09-30</div></td></tr></span>
</table>
你注意观察一下,以上这段是数据。
用正则表达式我感觉不是很好弄,但用程序应该能处理的差不多。。
1,得到这个页面的html
2,找到有用信息的开头“最新财务指标”
3,下面是一个table 取<table 作为药截取字符串的开头 </table>为末尾
4、如果仅为了显示信息,那么直接保存到数据库里就可以了
5,如果要把每条数据都拿出来,就稍微麻烦了,按照td tr 逐一截取字符串,然后用替换的方式删除html代码剩下的就是数据本身了
<table width="100%" border="0" cellspacing="0" cellpadding="0" class="web2">
<span id="ControlEx1_lbl"><tr><td class='dotborder' width='35%'><div class='tishi'><strong>会计年度</strong></div></td><td class='dotborder'><div class='tishi'>2010-03-15</div></td><td class='dotborder'><div class='tishi'>2009-12-31</div></td><td><div class='tishi'>2009-09-30</div></td></tr><tr><td class='dotborder' width='35%'><div class='tishi'><strong>营业收入</strong></div></td><td class='dotborder'><div class='tishi'>16,711,983,000.00</div></td><td class='dotborder'><div class='tishi'>58,300,149,000.00</div></td><td><div class='tishi'>41,573,938,000.00</div></td></tr><tr><td class='dotborder' width='35%'><div class='tishi'><strong>净利润(元)</strong></div></td><td class='dotborder'><div class='tishi'>883,809,000.00</div></td><td class='dotborder'><div class='tishi'>2,889,956,000.00</div></td><td><div class='tishi'>1,969,508,000.00</div></td></tr><tr><td class='dotborder' width='35%'><div class='tishi'><strong>利润总额(元)</strong></div></td><td class='dotborder'><div class='tishi'>1,194,506,000.00</div></td><td class='dotborder'><div class='tishi'>3,926,367,000.00</div></td><td><div class='tishi'>2,666,614,000.00</div></td></tr><tr><td class='dotborder' width='35%'><div class='tishi'><strong>扣除非经常性损益后的净利润(元)</strong></div></td><td class='dotborder'><div class='tishi'>--</div></td><td class='dotborder'><div class='tishi'>2,852,724,000.00</div></td><td><div class='tishi'>--</div></td></tr><tr><td class='dotborder' width='35%'><div class='tishi'><strong>总资产(元)</strong></div></td><td class='dotborder'><div class='tishi'>36,803,812,000.00</div></td><td class='dotborder'><div class='tishi'>35,839,832,000.00</div></td><td><div class='tishi'>32,383,657,000.00</div></td></tr><tr><td class='dotborder' width='35%'><div class='tishi'><strong>股东权益(元)</strong></div></td><td class='dotborder'><div class='tishi'>15,420,979,000.00</div></td><td class='dotborder'><div class='tishi'>14,540,346,000.00</div></td><td><div class='tishi'>10,655,729,000.00</div></td></tr><tr><td class='dotborder' width='35%'><div class='tishi'><strong>经营活动产生的现金流量净额(元)</strong></div></td><td class='dotborder'><div class='tishi'>--</div></td><td class='dotborder'><div class='tishi'>--</div></td><td><div class='tishi'>4,625,304,000.00</div></td></tr><tr><td class='dotborder' width='35%'><div class='tishi'><strong>每股收益(摊薄)</strong></div></td><td class='dotborder'><div class='tishi'>0.13</div></td><td class='dotborder'><div class='tishi'>0.64</div></td><td><div class='tishi'>0.44</div></td></tr><tr><td class='dotborder' width='35%'><div class='tishi'><strong>净资产收益率(摊薄)(%)</strong></div></td><td class='dotborder'><div class='tishi'>5.73</div></td><td class='dotborder'><div class='tishi'>19.88</div></td><td><div class='tishi'>18.48</div></td></tr><tr><td class='dotborder' width='35%'><div class='tishi'><strong>每股经营活动产生的现金流量净额(元)</strong></div></td><td class='dotborder'><div class='tishi'>-0.12</div></td><td class='dotborder'><div class='tishi'>1.19</div></td><td><div class='tishi'>1.03</div></td></tr><tr><td class='dotborder' width='35%'><div class='tishi'><strong>每股净资产(元)</strong></div></td><td class='dotborder'><div class='tishi'>3.31</div></td><td class='dotborder'><div class='tishi'>3.12</div></td><td><div class='tishi'>2.38</div></td></tr><tr><td class='dotborder' width='35%'><div class='tishi'><strong>调整后每股净资产(元)</strong></div></td><td class='dotborder'><div class='tishi'>--</div></td><td class='dotborder'><div class='tishi'>--</div></td><td><div class='tishi'>--</div></td></tr><tr><td class='dotborder' width='35%'><div class='tishi'><strong>境外会计准则净利润(元)</strong></div></td><td class='dotborder'><div class='tishi'>--</div></td><td class='dotborder'><div class='tishi'>--</div></td><td><div class='tishi'>--</div></td></tr><tr><td class='dotborder' width='35%'><div class='tishi'><strong>扣除非经常性损益后的每股收益(元)</strong></div></td><td class='dotborder'><div class='tishi'>--</div></td><td class='dotborder'><div class='tishi'>0.64</div></td><td><div class='tishi'>--</div></td></tr><tr><td class='dotborder' width='35%'><div class='tishi'><strong>报告起始时间</strong></div></td><td class='dotborder'><div class='tishi'>2010-01-01</div></td><td class='dotborder'><div class='tishi'>2009-01-01</div></td><td><div class='tishi'>2009-01-01</div></td></tr><tr><td class='dotborder' width='35%'><div class='tishi'><strong>报告终止时间</strong></div></td><td class='dotborder'><div class='tishi'>2010-03-31</div></td><td class='dotborder'><div class='tishi'>2009-12-31</div></td><td><div class='tishi'>2009-09-30</div></td></tr></span>
</table>
你注意观察一下,以上这段是数据。
用正则表达式我感觉不是很好弄,但用程序应该能处理的差不多。。
1,得到这个页面的html
2,找到有用信息的开头“最新财务指标”
3,下面是一个table 取<table 作为药截取字符串的开头 </table>为末尾
4、如果仅为了显示信息,那么直接保存到数据库里就可以了
5,如果要把每条数据都拿出来,就稍微麻烦了,按照td tr 逐一截取字符串,然后用替换的方式删除html代码剩下的就是数据本身了
本回答被提问者采纳
已赞过
已踩过<
评论
收起
你对这个回答的评价是?
展开全部
学习
已赞过
已踩过<
评论
收起
你对这个回答的评价是?
推荐律师服务:
若未解决您的问题,请您详细描述您的问题,通过百度律临进行免费专业咨询