如何提取网页源代码中的链接代码？

用vb建立一个窗体，窗体上有两个文本框text1和text2（multiline均为true）、一个按钮command1。现在在text1中输入网页源代码，点击comma... 用vb建立一个窗体，窗体上有两个文本框text1和text2（multiline均为true）、一个按钮command1。现在在text1中输入网页源代码，点击command1，自动在text2中输出源代码中的超链接代码（不包括原来超级链接中的属性定义），去掉其他代码。而且每个代码要占一行，不能换行。
我的初步想法供参考：这个问题是想从一大页代码中单单提取纯粹的链接代码，也就是把像“<A href="http://www.aaa.cn/">大学校园</A> ”这种形式的代码从一大堆乱码中提取出来。因为链接代码前后可能有各种各样的代码，链接代码里面也可能夹杂着一些字体、颜色等等属性代码，这样，是否可以按链接代码的特征提取，就是只考虑<A href="http://www.aaa.cn/">和</A>以及其中不在<>或</>这样字符中的链接名称？
注意：提出出来的每个链接代码要占一行，不能换行。展开

 我来答

2个回答

#热议# 什么是淋病？哪些行为会感染淋病？

haokeyy
2010-04-15 · TA获得超过1487个赞

知道小有建树答主

回答量：1002

采纳率：0%

帮助的人：1138万

我也去答题访问个人页

关注

展开全部

Private Sub Command1_Click()

    Dim s As String

    s = Text1.Text
    s = Replace(Text1.Text, vbCrLf, "") '移除所有回车换行符

    'Dim oRegEx As RegExp
    'Set oRegEx = New RegExp
    'Dim oMatches As MatchCollection
    'Dim oMatch   As Match
    
    Dim oRegEx As Object
    Set oRegEx = CreateObject("VBScript.RegExp")
    Dim oMatches As Object
    Dim oMatch As Object
    
    With oRegEx
        .Global = True '全局匹配
        .IgnoreCase = True '忽略大小写
        .Pattern = "<a[^>]*?href=[""' ]?(.*?)(?:""|'| ).[^> ]*?>([\s\S]*?)</a>"
        '提取所有A标签的正则式,小括号中是子匹配引用组第一个是 (.*?) 第二个是([\s\S]*?)
        Set oMatches = .Execute(s)

        If oMatches.Count >= 1 Then
            Text2.Text = ""

            Dim sHref    As String, sInnerText As String

            Dim i        As Integer

            Dim sLink    As String

            'Dim colLinks As Scripting.Dictionary
            'Set colLinks = New Scripting.Dictionary

            Dim colLinks As Object
            Set colLinks = CreateObject("Scripting.Dictionary")
            
            For Each oMatch In oMatches

                sHref = oMatch.SubMatches(0) '(.*?)
                sInnerText = oMatch.SubMatches(1) '([\s\S]*?)
                sInnerText = RemoveTags(sInnerText) '移除A标签(内容)中的多余标签
                sInnerText = Replace(sInnerText, " ", "") '移除A标签(内容)中的所有空格
                sLink = "<A href=""" & sHref & """>" & sInnerText & "</A>"

                If Not colLinks.Exists(sLink) Then
                    colLinks.Add sLink, sLink
                    Text2.Text = Text2.Text & sLink & vbNewLine
                End If

            Next

        End If

    End With

    Set oMatches = Nothing
    Set oMatch = Nothing
    Set oRegEx = Nothing
    Set colLinks = Nothing
End Sub

'这个函数可以去除HTML代码中的标签
Function RemoveTags(ByVal html As String)

    'Dim oRegEx As RegExp
    'Set oRegEx = New RegExp
    Dim oRegEx As Object

    Set oRegEx = CreateObject("VBScript.RegExp")

    With oRegEx
        .Global = True
        .IgnoreCase = True
        .Pattern = "<[^>]*>"
        RemoveTags = .Replace(html, "")
    End With

    Set oRegEx = Nothing
End Function


本回答被提问者采纳






已赞过已踩过<

你对这个回答的评价是？
评论收起

mooey122
2010-04-13 · TA获得超过105个赞

知道小有建树答主

回答量：184

采纳率：0%

帮助的人：199万

我也去答题访问个人页

关注

展开全部

用正则表达式,查找<a> 与</a>之间的即可,很简单

已赞过 已踩过<

评论收起

1条折叠回答

推荐律师服务：若未解决您的问题，请您详细描述您的问题，通过百度律临进行免费专业咨询

您可能关注的内容

『总部』超级连接-APP下载

超级链接 APP下载!人脉私域运营神器!用户量超500W+，新人送宝石超级连接下载邀请好友一起体验火爆年底稳定项目

aj.kwbhuah.cn广告

如何提取网页源代码中的链接代码？

您可能关注的内容

其他类似问题

为你推荐：