Crawler of SAP Wiki

Background:

In daily work on ServiceNow platform, our employees often need to check the documents in SAP wiki or stories on ServiceNow, when they have trouble with the processes or don’t know the operations exactly. But there are too many document and stories so that it’s too complex to find the target information.

So we are going to develop a search engine to solve this pain point. And for the reason that we don’t have the database of SAP wiki, so we need to create a crawler to get the data. And the target wiki directory is:51. HCSM Documentation (internal) - SSI Support Application Landscape - Wiki@SAP

Design thought:

ALG: SAP wiki’s directory is a tree structure. The parent node contains many child nodes and many of these child nodes contain another batch of child nodes… So to get all of them, I choose depth-first algorithm. The crawler will traverse this tree and when current node has no child node, backtrack.

Crawler: Because of the network security issue, we can’t just send requests to get html source code. So we use selenium and webdriver to simulate browser operations. Therefore, when you run this program, there will be a browser started.

Prerequisite of running:

* modify the exec_path with your chromedriver ‘s path:

1
2
3

# not used anymore
exec_path = r'C:\Program Files\Google\Chrome\Application\chromedriver.exe'
chrome = webdriver.Chrome(executable_path=exec_path)

**NEW!**we use ChromeDriverManager().install() to install the chrome web driver automatedly.

1 2	service = Service(executable_path=ChromeDriverManager().install()) chrome = webdriver.Chrome(service=service)

Make sure you are using SAP’s network

Running Result:

We will get the titles, content, URL of documents. And there will be a file named WikiJson.json generated in the same directory of this program. The content of this file may be like:

NEW!: we will crawl all the documents from wiki.

{"id":0, "Title":"51.1 Customer", "Content":"51.1 Customer Page sections for this module Process / Product overview: Requirements, Solution blueprint, Architecture Implementation details User Interface: Agent Workspace, UI16, headers, field definitions, UI actions, UI policies, tabs and related lists, notifications, dashboards, lists, attachment handling, templates, actions plans, removed OOTB functions Process model: Workflow, status values, routing, scripts, business rules, data flow between tables in the same module, removed OOTB functions, SLA definitions Technical dependencies: Data pre-requisites, user roles and permissions Integrations: integrations to other processes / records, Roles and responsibilities: SME / contact person / team Process integrations: Related lists, data exchange and dependencies with other processes, workflow handovers Guidelines for Support & Operations: Error handling / Troubleshooting, logging information Related feature and references: Feature number, attachments, further links, other KBAs for details", "URL":"https://wiki.wdf.sap.corp/wiki/display/wikissisal/51.1+Customer-facing+portal?src=sidebar", "Type":"wiki"}

But the data is not clean. There are some useless characters such as '', \xa0 and so on. When we post this data to ElasticSearch, these characters will cause error. So I use regular expressions to clean them.


title = re.findall(r'"Title":"(.*)", "Content"',line)[0]
title_new = title.replace('"',"'")
content = re.findall(r'"Content":"(.*)", "URL',line)[0]
content_new = content.replace('"'," ").replace('\xa0',' ')

Other Uses:

This program can also get the data of other directories in SAP wiki. You only need to modify the root URL here:
chrome.get('https://wiki.wdf.sap.corp/wiki/pages/viewpage.action?pageId=2301363095')

Data Range

1. Source Data

Story of PPM ：data structure of story.

No.	Elasticsearch Key Name	Field Name	Annotation
1	sys_id	sys_id	primary key
2	number	number
3	description	description
4	short_description	short_description
5	u_update_set	u_update_set
6	task	acceptance_criteria
7	sn_safe_feature	feature
8	acceptance_criteria	acceptance_criteria
9	sys_updated_on	sys_updated_on
10	_index = snowdata		name of index
11	_type = doc		type of index
12	_id		id of index (self-defined)
13	url		link of a story
14	time		sorted by time order
15	Type = snow

Context of Wiki: all data under 51. HCSM Documentation (internal)

No.	Elasticsearch Key Name	Field Name	Annotation
1	Title	Title	primary key
2	Content	Content
3	URL	URL	link of Wiki
4	_index = wikidata		name of index
5	_type = doc		type of index
6	_id		id of index (self-defined)
15	Type

Updated on 12/11/2023:

I uploaded the python files that used for uploading data to Elasticsearch.(readJson_wiki.py, readJson_snow.py)
The ServiceNow data can be downloaded though the ServiceNow web page.(https://sapppm.service-now.com/nav_to.do?uri=%2Fsn_safe_story_list.do%3Fsysparm_query%3DstateIN3%255Eassignment_group%253Dff3c91d1dbf9c090e5cd4cb11596197e%255EORassignment_group%253Dd9881444dbb50c5020627d78f49619bc%255EORassignment_group%253Deb3c2134dbfedc10fd0ee03cd39619cb%26sysparm_first_row%3D1%26sysparm_view%3Dscrum)

selenium Method Update Notice

As selenium has abandoned some methods that can be used in the past. We changed those outdated function into new versions.

1. Update of the startup method

Methods Used in the past:

In line 6, webdriver.Chrome() can get the chromedriver.exe directly by its address.


from selenium import webdriver
exec_path = r'C:\Program Files\Google\Chrome\Application\chromedriver.exe' # address of your chormedriver
chrome = webdriver.Chrome(executable_path=exec_path)
chrome.get('https://wiki.wdf.sap.corp/wiki/pages/viewpage.action?pageId=2301363095')

Methods Available now:

In line 6, webdriver.Chrome() need to accept a Service Object, which we can get it in line 5.

Notice: package we have to import is selenium.webdriver.chrome.service not selenium.webdriver.common.service

from selenium.webdriver.chrome.service import Service
exec_path = r'C:\Program Files\Google\Chrome\Application\chromedriver.exe'
service = Service(exec_path)
chrome = webdriver.Chrome(service=service)
chrome.get('https://wiki.wdf.sap.corp/wiki/pages/viewpage.action?pageId=2301363095')

2. Update of searching methods

You shoud first import from selenium.webdriver.common.by import By, and use the function that available now.

PAST:
chrome.find_elements_by_xxx(XXX)
chrome.find_element_by_xxx(XXX)

NOW
chrome.find_element(by=By.XXX, value=XXX) chrome.find_elements(by=By.XXX, value=XXX)

EXAMPLES

PAST：chrome.find_elements_by_xpath(xpath = XXX)

NOW：chrome.find_elements(by=By.XPATH, value=XXX)