Crawler of SAP Wiki

Background:

In daily work on ServiceNow platform, our employees often need to check the documents in SAP wiki or stories on ServiceNow, when they have trouble with the processes or don’t know the operations exactly. But there are too many document and stories so that it’s too complex to find the target information.

So we are going to develop a search engine to solve this pain point. And for the reason that we don’t have the database of SAP wiki, so we need to create a crawler to get the data. And the target wiki directory is:51. HCSM Documentation (internal) - SSI Support Application Landscape - Wiki@SAP

Design thought:

ALG: SAP wiki’s directory is a tree structure. The parent node contains many child nodes and many of these child nodes contain another batch of child nodes… So to get all of them, I choose depth-first algorithm. The crawler will traverse this tree and when current node has no child node, backtrack.

Crawler: Because of the network security issue, we can’t just send requests to get html source code. So we use selenium and webdriver to simulate browser operations. Therefore, when you run this program, there will be a browser started.

Prerequisite of running:

* modify the exec_path with your chromedriver ‘s path:

1
2
3
# not used anymore
exec_path = r'C:\Program Files\Google\Chrome\Application\chromedriver.exe'
chrome = webdriver.Chrome(executable_path=exec_path)

**NEW!**we use ChromeDriverManager().install() to install the chrome web driver automatedly.

1
2
service = Service(executable_path=ChromeDriverManager().install())
chrome = webdriver.Chrome(service=service)
  • Make sure you are using SAP’s network

Running Result:

​ We will get the titles, content, URL of documents. And there will be a file named WikiJson.json generated in the same directory of this program. The content of this file may be like:

NEW!: we will crawl all the documents from wiki.

1
{"id":0, "Title":"51.1 Customer", "Content":"51.1 Customer Page sections for this module Process / Product overview: Requirements, Solution blueprint, Architecture Implementation details User Interface: Agent Workspace, UI16, headers, field definitions, UI actions, UI policies, tabs and related lists, notifications, dashboards, lists, attachment handling, templates, actions plans, removed OOTB functions Process model: Workflow, status values, routing, scripts, business rules, data flow between tables in the same module, removed OOTB functions, SLA definitions Technical dependencies: Data pre-requisites, user roles and permissions Integrations: integrations to other processes / records, Roles and responsibilities: SME / contact person / team Process integrations: Related lists, data exchange and dependencies with other processes, workflow handovers Guidelines for Support & Operations: Error handling / Troubleshooting, logging information Related feature and references: Feature number, attachments, further links, other KBAs for details", "URL":"https://wiki.wdf.sap.corp/wiki/display/wikissisal/51.1+Customer-facing+portal?src=sidebar", "Type":"wiki"}

​ But the data is not clean. There are some useless characters such as '', \xa0 and so on. When we post this data to ElasticSearch, these characters will cause error. So I use regular expressions to clean them.

1
2
3
4
5

title = re.findall(r'"Title":"(.*)", "Content"',line)[0]
title_new = title.replace('"',"'")
content = re.findall(r'"Content":"(.*)", "URL',line)[0]
content_new = content.replace('"'," ").replace('\xa0',' ')

Other Uses:

​ This program can also get the data of other directories in SAP wiki. You only need to modify the root URL here:
chrome.get('https://wiki.wdf.sap.corp/wiki/pages/viewpage.action?pageId=2301363095')

Data Range

1. Source Data

Story of PPM :data structure of story.

No. Elasticsearch Key Name Field Name Annotation
1 sys_id sys_id primary key
2 number number
3 description description
4 short_description short_description
5 u_update_set u_update_set
6 task acceptance_criteria
7 sn_safe_feature feature
8 acceptance_criteria acceptance_criteria
9 sys_updated_on sys_updated_on
10 _index = snowdata name of index
11 _type = doc type of index
12 _id id of index (self-defined)
13 url link of a story
14 time sorted by time order
15 Type = snow

Context of Wiki: all data under 51. HCSM Documentation (internal)

No. Elasticsearch Key Name Field Name Annotation
1 Title Title primary key
2 Content Content
3 URL URL link of Wiki
4 _index = wikidata name of index
5 _type = doc type of index
6 _id id of index (self-defined)
15 Type

Updated on 12/11/2023:

selenium Method Update Notice

As selenium has abandoned some methods that can be used in the past. We changed those outdated function into new versions.

1. Update of the startup method

Methods Used in the past:

In line 6, webdriver.Chrome() can get the chromedriver.exe directly by its address.

1
2
3
4
5

from selenium import webdriver
exec_path = r'C:\Program Files\Google\Chrome\Application\chromedriver.exe' # address of your chormedriver
chrome = webdriver.Chrome(executable_path=exec_path)
chrome.get('https://wiki.wdf.sap.corp/wiki/pages/viewpage.action?pageId=2301363095')

Methods Available now:

In line 6, webdriver.Chrome() need to accept a Service Object, which we can get it in line 5.

Notice: package we have to import is selenium.webdriver.chrome.service not selenium.webdriver.common.service

1
2
3
4
5
from selenium.webdriver.chrome.service import Service
exec_path = r'C:\Program Files\Google\Chrome\Application\chromedriver.exe'
service = Service(exec_path)
chrome = webdriver.Chrome(service=service)
chrome.get('https://wiki.wdf.sap.corp/wiki/pages/viewpage.action?pageId=2301363095')
2. Update of searching methods

You shoud first import from selenium.webdriver.common.by import By, and use the function that available now.

PAST:
chrome.find_elements_by_xxx(XXX)
chrome.find_element_by_xxx(XXX)

NOW
chrome.find_element(by=By.XXX, value=XXX) chrome.find_elements(by=By.XXX, value=XXX)

EXAMPLES

PAST:chrome.find_elements_by_xpath(xpath = XXX)

NOW:chrome.find_elements(by=By.XPATH, value=XXX)