SAP Crwaler
Crawler of SAP Wiki
Background:
In daily work on ServiceNow platform, our employees often need to check the documents in SAP wiki or stories on ServiceNow, when they have trouble with the processes or don’t know the operations exactly. But there are too many document and stories so that it’s too complex to find the target information.
So we are going to develop a search engine to solve this pain point. And for the reason that we don’t have the database of SAP wiki, so we need to create a crawler to get the data. And the target wiki directory is:51. HCSM Documentation (internal) - SSI Support Application Landscape - Wiki@SAP
Design thought:
ALG: SAP wiki’s directory is a tree structure. The parent node contains many child nodes and many of these child nodes contain another batch of child nodes… So to get all of them, I choose depth-first algorithm. The crawler will traverse this tree and when current node has no child node, backtrack.
Crawler: Because of the network security issue, we can’t just send requests to get html source code. So we use selenium
and webdriver
to simulate browser operations. Therefore, when you run this program, there will be a browser started.
Prerequisite of running:
* modify the exec_path
with your chromedriver
‘s path:
1 | # not used anymore |
**NEW!**we use ChromeDriverManager().install() to install the chrome web driver automatedly.
1 | service = Service(executable_path=ChromeDriverManager().install()) |
- Make sure you are using SAP’s network
Running Result:
We will get the titles, content, URL of documents. And there will be a file named WikiJson.json
generated in the same directory of this program. The content of this file may be like:
NEW!: we will crawl all the documents from wiki.
1 | {"id":0, "Title":"51.1 Customer", "Content":"51.1 Customer Page sections for this module Process / Product overview: Requirements, Solution blueprint, Architecture Implementation details User Interface: Agent Workspace, UI16, headers, field definitions, UI actions, UI policies, tabs and related lists, notifications, dashboards, lists, attachment handling, templates, actions plans, removed OOTB functions Process model: Workflow, status values, routing, scripts, business rules, data flow between tables in the same module, removed OOTB functions, SLA definitions Technical dependencies: Data pre-requisites, user roles and permissions Integrations: integrations to other processes / records, Roles and responsibilities: SME / contact person / team Process integrations: Related lists, data exchange and dependencies with other processes, workflow handovers Guidelines for Support & Operations: Error handling / Troubleshooting, logging information Related feature and references: Feature number, attachments, further links, other KBAs for details", "URL":"https://wiki.wdf.sap.corp/wiki/display/wikissisal/51.1+Customer-facing+portal?src=sidebar", "Type":"wiki"} |
But the data is not clean. There are some useless characters such as ''
, \xa0
and so on. When we post this data to ElasticSearch, these characters will cause error. So I use regular expressions to clean them.
1 |
|
Other Uses:
This program can also get the data of other directories in SAP wiki. You only need to modify the root URL here:
chrome.get('https://wiki.wdf.sap.corp/wiki/pages/viewpage.action?pageId=2301363095')
Data Range
1. Source Data
Story of PPM :data structure of story.
No. | Elasticsearch Key Name | Field Name | Annotation |
---|---|---|---|
1 | sys_id | sys_id | primary key |
2 | number | number | |
3 | description | description | |
4 | short_description | short_description | |
5 | u_update_set | u_update_set | |
6 | task | acceptance_criteria | |
7 | sn_safe_feature | feature | |
8 | acceptance_criteria | acceptance_criteria | |
9 | sys_updated_on | sys_updated_on | |
10 | _index = snowdata | name of index | |
11 | _type = doc | type of index | |
12 | _id | id of index (self-defined) | |
13 | url | link of a story | |
14 | time | sorted by time order | |
15 | Type = snow |
Context of Wiki: all data under 51. HCSM Documentation (internal)
No. | Elasticsearch Key Name | Field Name | Annotation |
---|---|---|---|
1 | Title | Title | primary key |
2 | Content | Content | |
3 | URL | URL | link of Wiki |
4 | _index = wikidata | name of index | |
5 | _type = doc | type of index | |
6 | _id | id of index (self-defined) | |
15 | Type |
Updated on 12/11/2023:
- I uploaded the python files that used for uploading data to Elasticsearch.(readJson_wiki.py, readJson_snow.py)
- The ServiceNow data can be downloaded though the ServiceNow web page.(https://sapppm.service-now.com/nav_to.do?uri=%2Fsn_safe_story_list.do%3Fsysparm_query%3DstateIN3%255Eassignment_group%253Dff3c91d1dbf9c090e5cd4cb11596197e%255EORassignment_group%253Dd9881444dbb50c5020627d78f49619bc%255EORassignment_group%253Deb3c2134dbfedc10fd0ee03cd39619cb%26sysparm_first_row%3D1%26sysparm_view%3Dscrum)
selenium Method Update Notice
As selenium has abandoned some methods that can be used in the past. We changed those outdated function into new versions.
1. Update of the startup method
Methods Used in the past:
In line 6, webdriver.Chrome()
can get the chromedriver.exe directly by its address.
1 |
|
Methods Available now:
In line 6, webdriver.Chrome()
need to accept a Service Object, which we can get it in line 5.
Notice: package we have to import is selenium.webdriver.chrome.service
not selenium.webdriver.common.service
1 | from selenium.webdriver.chrome.service import Service |
2. Update of searching methods
You shoud first import from selenium.webdriver.common.by import By
, and use the function that available now.
PAST:
chrome.find_elements_by_xxx(XXX)
chrome.find_element_by_xxx(XXX)
NOW
chrome.find_element(by=By.XXX, value=XXX) chrome.find_elements(by=By.XXX, value=XXX)
EXAMPLES
PAST:chrome.find_elements_by_xpath(xpath = XXX)
NOW:chrome.find_elements(by=By.XPATH, value=XXX)