Using Wikidata

Wikidata is a collaboratively maintained knowledge-base that is part of the larger Wikimedia universe and that includes more than 120,000,000 data entities. This data is also used to cross-link pages about the same subject across Wiki projects (like different language versions).

Main concepts

The knowledge-base of Wikidata consists of items that are further described by having properties which can have different values (these values in turn can be items too). Together properties and values make up statements about a given item

The Introduction to Wikidata uses this example to make it less abstract, using the British Science Fiction writer Douglas Adams as an example:

The highlighted statement (green) is about Adams’ education, shown as the educated_at property (violet) and gives two values for this, St John's College (demonstrated in orange) and Brentwood School. In this case, both of these values are Wikidata items themselves. But in other cases, this could also just be a simple value, e.g. a date for the birth date.

Each of those statements can also be further qualified and supported by references, depending on the property in question. Qualifications, like seen here, could be start/end dates, degrees, titles and more.

Wikidata can be queried using the SPARQL query language, which allows joining together required values for different properties to search for items.

Motivation

When we tried to find all Kenyan politicians, we used the Category:Kenyan politicians category of the English language Wikipedia as our starting point. But, this is not ideal as it will include false-positives, in the form of pages that are not actually articles that describe politicians, and false-negatives - that is actual pages from Kenyan politicians might be missing because they aren’t listed.

As an example for a false-positive when using the Category namespace: As there are many sub-categories for our category, we used recursion=True on the Category:Kenyan politicians, to make sure we get all the pages of politicians. But, our resulting list of pages surprisingly also includes the page about the Jomo Kenyatta International Airport!

Why? It’s because the page of the airport is part of the Category:Jomo Kenyatta, which in turn is part of Category:Presidents of Kenya, which in turn is part of Category:Political office-holders in Kenya, which - finally - is part of Category:Kenyan politicians.

Wikidata queries to find all pages of a type

By writing a more narrow query for Wikidata, we can avoid such false-positives – and also get a whole range of cross-links between the different Wiki projects, e.g. to find the English, and Swahili editions of the same article.

Writing a query

what properties and values could we use to find only Kenyan politicians?

The articles should be about people, not airports. So we can filter to find items that describe humans only. If we search for this in Wikidata, we find that the corresponding property/value translates to P31 (instance of) and that humans are described by item Q5.
Now, to find politicians, we can filter by finding items that additionally give the occupation P106 as politician (Q82955).
Lastly, we are looking for politicians that are from Kenya. One potential way to narrow it down to this could be filtering for the country of citizenship (P27) to be Kenya (Q114).

If we look at the Wikidata item for Jomo Kenyatta, we find those properties showing up like this:

Querying using `pywikibot`

Let’s now translate this into creating a SPARQL query that the pywikibot can run for us.

First, we load the necessary library again, and then we write out the query using the SPARQL syntax:

We use SELECT DISTINCT to avoid returning duplicate items, and use ?item to get the whole Wikidata item returned for further processing.

In the WHERE clause, we add our three distinct property/value pairs that we all want to be present, and ultimately give a LIMIT of how many Wikidata items we’d like to get returned.

Getting help writing queries

Wikimedia runs its own Query Service for Wikidata. You can use that to explore example queries and test browsers in a simpler interface to directly see if your query is well formatted and what it returns.

import pywikibot # load the whole library
# then we specifically import the page generator so we can iterate over all returned items
from pywikibot import pagegenerators as pg 

# we query Wikidata directly, not the english language Wikipedia
wikidata_site = pywikibot.Site("wikidata", "wikidata") 

# let's translate our query into SPARQL
QUERY = """
SELECT DISTINCT ?item
WHERE {
    ?item wdt:P31 wd:Q5;  # Any instance of a human;
          wdt:P106 wd:Q82955; # occupation politician;
          wdt:P27 wd:Q114; # citizen of kenya;
}
LIMIT 5
"""

Once we got this query written, we can now turn this into a generator, that will allow us to make a loop over all returned elements.

We will already look at some of the data we get back:

generator = pg.WikidataSPARQLPageGenerator(QUERY, site=wikidata_site)

for item in generator:
    print("item label (en): {}".format(
        item.labels['en']
    ))
    if 'en' in item.descriptions:
        print("item description (en): {}".format(
            item.descriptions['en']
        ))
    print("sitelinks: {}".format(list(
        item.sitelinks
    )))
    if 'enwiki' in item.sitelinks:
        print("sitelink to enwiki: {}".format(
            item.sitelinks["enwiki"]))
    print("----------")

item label (en): Elijah Lagat
item description (en): Kenyan marathon runner and politician
sitelinks: ['dewiki', 'enwiki', 'eswiki', 'itwiki', 'nlwiki', 'swwiki']
sitelink to enwiki: [[Elijah Lagat]]
----------
item label (en): Grace Emily Akinyi Ogot
item description (en): Kenyan politician and writer (1930-2015)
sitelinks: ['arwiki', 'arzwiki', 'dewiki', 'enwiki', 'enwikiquote', 'eowiki', 'fiwiki', 'frwiki', 'hawiki', 'hewiki', 'hywikisource', 'igwiki', 'igwikiquote', 'itwiki', 'lawiki', 'ptwiki', 'rowiki', 'sowiki', 'swwiki', 'tawiki', 'ukwiki', 'urwiki']
sitelink to enwiki: [[Grace Ogot]]
----------
item label (en): Newton Kulundu
item description (en): Kenyan politician (1948-2010)
sitelinks: ['dewiki', 'enwiki', 'ptwiki']
sitelink to enwiki: [[Newton Kulundu]]
----------
item label (en): Wesley Korir
item description (en): Kenyan marathon runner and politician
sitelinks: ['arwiki', 'arzwiki', 'commonswiki', 'dewiki', 'enwiki', 'fawiki', 'frwiki', 'mlwiki', 'nlwiki', 'nowiki', 'ptwiki', 'ruwiki']
sitelink to enwiki: [[Wesley Korir]]
----------
item label (en): David Lelei
item description (en): Kenyan distance runner (1971-2010)
sitelinks: ['arzwiki', 'dewiki', 'enwiki', 'eswiki', 'frwiki', 'nlwiki', 'ptwiki']
sitelink to enwiki: [[David Lelei]]
----------

For many items, the labels and descriptions will be available in multiple languages. Here, we printed only the English one.

And as you see in the sitelinks, Wikidata also returns the cross-links to other Wiki projects, like the English (enwiki) and Swahili (swwiki) language editions of Wikipedia. This allows us now to switch to Wikipedia and read the articles as before.

Exploring what the most common language editions are

Let’s say we’re interested to explore how different politicians are represented (or not) across language editions. While there’s a large number of languages, let’s pick a small subset of languages to work with for this example. If you have languages you’re interest in, you can add them in the same format to the list, using the corresponding 2-letter ISO 639 language code.

languages = [
    "enwiki",
    "swwiki",
    "dewiki",
    "frwiki",
    "eswiki",
]

Let’s now expand our query a bit and get 20 entries for politicians instead of just 5, and then create a dataframe that contains the wiki-links for each of those combinations of politician & language-edition if present:

import pandas as pd # we'll need pandas to create the DF, your pandas version should be 2.1 or newer

QUERY = """
SELECT DISTINCT ?item
WHERE {
    ?item wdt:P31 wd:Q5;  # Any instance of a human;
          wdt:P106 wd:Q82955; # occupation politician;
          wdt:P27 wd:Q114; # citizen of kenya;
}
LIMIT 20
"""

generator = pg.WikidataSPARQLPageGenerator(QUERY, site=wikidata_site)

politicians = {} # create an empty dictionary that will hold the names as keys, and another dictionary with {'enwiki': '$pagename', …}

for item in generator:
    label = item.labels['en']
    politicians[label] = {}
    print('get languages for {}'.format(label))
    sitelinks = item.sitelinks
    for language in languages:
        if language in sitelinks.keys():
            politicians[label][language] = sitelinks[language].canonical_title()
        else:
            politicians[label][language] = ""

df_language_links = pd.DataFrame(politicians).T # use .T to transpose, so that policians are rows and languages are columns
df_language_links

get languages for Jomo Kenyatta
get languages for Wangari Muta Maathai
get languages for Elijah Lagat
get languages for William Ruto
get languages for Moses Wetangula
get languages for Harry Thuku
get languages for Grace Emily Akinyi Ogot
get languages for Wesley Korir
get languages for Daniel arap Moi
get languages for Moses Mudavadi
get languages for Uhuru Kenyatta
get languages for Raila A Odinga
get languages for McDonald Mariga
get languages for Richard Leakey
get languages for Newton Kulundu
get languages for Gitobu Imanyara
get languages for David Lelei
get languages for Abdilatif Abdalla
get languages for Mwai Kibaki
get languages for Kalonzo Musyoka

	enwiki	swwiki	dewiki	frwiki	eswiki
Jomo Kenyatta	Jomo Kenyatta	Jomo Kenyatta	Jomo Kenyatta	Jomo Kenyatta	Jomo Kenyatta
Wangari Muta Maathai	Wangarĩ Maathai	Wangari Maathai	Wangari Maathai	Wangari Muta Maathai	Wangari Maathai
Elijah Lagat	Elijah Lagat	Elijah Lagat	Elijah Lagat		Elijah Lagat
William Ruto	William Ruto	William Ruto	William Ruto	William Ruto	William Ruto
Moses Wetangula	Moses Wetang'ula	Moses Wetangula	Moses Wetangula	Moses Wetangula
Harry Thuku	Harry Thuku		Harry Thuku	Harry Thuku
Grace Emily Akinyi Ogot	Grace Ogot	Grace Ogot	Grace Ogot	Grace Ogot
Wesley Korir	Wesley Korir		Wesley Korir	Wesley Korir
Daniel arap Moi	Daniel arap Moi	Daniel Arap Moi	Daniel arap Moi	Daniel arap Moi	Daniel arap Moi
Moses Mudavadi	Moses Mudavadi		Moses Mudavadi
Uhuru Kenyatta	Uhuru Kenyatta	Uhuru Kenyatta	Uhuru Kenyatta	Uhuru Kenyatta	Uhuru Kenyatta
Raila A Odinga	Raila Odinga	Raila Odinga	Raila Odinga	Raila Odinga	Raila Odinga
McDonald Mariga	McDonald Mariga	McDonald Mariga	McDonald Mariga	McDonald Mariga	McDonald Mariga
Richard Leakey	Richard Leakey	Richard Leakey	Richard Leakey	Richard Leakey	Richard Leakey
Newton Kulundu	Newton Kulundu		Newton Kulundu
Gitobu Imanyara	Gitobu Imanyara	Gitobu Imanyara		Gitobu Imanyara
David Lelei	David Lelei		David Lelei	David Lelei	David Lelei
Abdilatif Abdalla	Abdilatif Abdalla	Abdilatif Abdalla	Abdilatif Abdalla	Abdilatif Abdalla	Abdilatif Abdalla
Mwai Kibaki	Mwai Kibaki	Mwai Kibaki	Mwai Kibaki	Mwai Kibaki	Mwai Kibaki
Kalonzo Musyoka	Kalonzo Musyoka	Kalonzo Musyoka	Kalonzo Musyoka	Kalonzo Musyoka

This gives us a nice data frame that includes all the links to the different pages acrosst the target languages. Why did we decide to keep the actual links as values instead of directly just making a binary “yes/no” evaluation?

Because the actual names of the pages can differ between language editions, e.g. they can include middle names or add (politician) to the page name to disambigute multiple persons.

Which means if we want to keep working with the actual pages, we should keep those names to be able to query them.

But let’s now make a small presence/absence Matrix:

pam_df = df_language_links.map(lambda x: int(bool(x))) # create presence/absence matrix

pam_df.style.background_gradient(cmap ='viridis',axis=None) # use axis=None to display heatmap calculated across whole DF

	enwiki	swwiki	dewiki	frwiki	eswiki
Jomo Kenyatta	1	1	1	1	1
Wangari Muta Maathai	1	1	1	1	1
Elijah Lagat	1	1	1	0	1
William Ruto	1	1	1	1	1
Moses Wetangula	1	1	1	1	0
Harry Thuku	1	0	1	1	0
Grace Emily Akinyi Ogot	1	1	1	1	0
Wesley Korir	1	0	1	1	0
Daniel arap Moi	1	1	1	1	1
Moses Mudavadi	1	0	1	0	0
Uhuru Kenyatta	1	1	1	1	1
Raila A Odinga	1	1	1	1	1
McDonald Mariga	1	1	1	1	1
Richard Leakey	1	1	1	1	1
Newton Kulundu	1	0	1	0	0
Gitobu Imanyara	1	1	0	1	0
David Lelei	1	0	1	1	1
Abdilatif Abdalla	1	1	1	1	1
Mwai Kibaki	1	1	1	1	1
Kalonzo Musyoka	1	1	1	1	0

This shows us some differences, both between individual politicians and the language editions, e.g. the Spanish Wikipedia has fewer articles for this set of politicians.

But how long are those articles? Let’s convert this into looking at length, instead of presence/absence. First we need to set up the different Site objects for the different languages, so that we can query those.

And then we re-use our existing politicians dictionary to query the WP API for the corresponding page lengths, and save those in a dictionary that we can ultimately convert into a DataFrame.

Note: Querying all of those pages across language editions can take some time due to API limits.

language_sites = {} # let's store the site objects in a dict, so we can use them for our queries

for language in languages:
    language_sites[language] = pywikibot.Site(language[:2], 'wikipedia') # [:2] to get the language prefix

# for now we store the article lengths in a dict

article_lengths = {}

for politician in politicians:
    article_lengths[politician] = {}
    for language in politicians[politician]:
        if politicians[politician][language]:
            page = pywikibot.Page(language_sites[language], politicians[politician][language])
            article_lengths[politician][language] = len(page.text)
        else: # don't have a page, so length is zero
            article_lengths[politician][language] = 0

df_article_length = pd.DataFrame(article_lengths).T

Now that we got the data frame with article lengths, we can plot it as a heat map again.

The heatmap above was done over all cells. But to better compare the lengths within a language, we will now assign colors by column instead, i.e. language edition:

df_article_length.style.background_gradient(cmap ='viridis',axis=0)

	enwiki	swwiki	dewiki	frwiki	eswiki
Jomo Kenyatta	154884	5528	25027	13234	15096
Wangari Muta Maathai	76445	3211	22759	27246	61486
Elijah Lagat	12564	1091	4761	0	5463
William Ruto	87708	8643	18334	19112	26486
Moses Wetangula	19496	2758	4878	2891	0
Harry Thuku	13230	0	3704	12453	0
Grace Emily Akinyi Ogot	16636	4621	4682	8000	0
Wesley Korir	16759	0	3764	9470	0
Daniel arap Moi	55812	10665	9839	8431	10622
Moses Mudavadi	985	0	1033	0	0
Uhuru Kenyatta	120687	8931	24740	26056	11713
Raila A Odinga	123275	10775	30587	53369	6078
McDonald Mariga	19953	4250	25325	7607	15887
Richard Leakey	44530	3165	10229	8184	20618
Newton Kulundu	4786	0	3386	0	0
Gitobu Imanyara	4593	4758	0	1856	0
David Lelei	5151	0	1779	3847	4818
Abdilatif Abdalla	4321	1564	3818	1056	3112
Mwai Kibaki	89924	10254	13499	9307	7643
Kalonzo Musyoka	28477	2510	2659	5503	0

Working with claims

Now that we have seen how we can bridge between Wikidata items and Wikipedia pages, let us see how we can explore the Wikidata knowledge graph a bit more beyond just getting the objects and cross-links.

One thing that many Wikidata items about people contain, is the gender of the person they describe, which we need if we want to explore if Wikipedia articles describing people’s biographies differ between gender of the person.

Inferring a person’s gender is a fraught effort, and misclassifications are not distributed amongst race or ethnicity. In Wikidata knowledge graph, gender is represented by property P21 (technically, Wikidata translates that to sex or gender), which would allow

To get explore this, let’s query a single politician using the same query as before and use that Wikidata item to explore the knowledge graph:

QUERY = """
SELECT DISTINCT ?item
WHERE {
    ?item wdt:P31 wd:Q5;  # Any instance of a human;
          wdt:P106 wd:Q82955; # occupation politician;
          wdt:P27 wd:Q114; # citizen of kenya;
}
LIMIT 1
"""

generator = pg.WikidataSPARQLPageGenerator(QUERY, site=wikidata_site)

wikidata_item = generator.__next__() # to just emit the single response

print('got {} as politician'.format(wikidata_item.labels['en']))

got Moses Mudavadi as politician

Now that we got our wikidata item, let’s find out if it contains claim about the person’s gender:

# p21: gender
# [0] get the first claim for this property - for some properties there could be multiple claims!
# target: Get the value of that claim - can be an item too
# label: get english name of item, i.e. gender marker in english

wikidata_item.claims['P21'][0].target.labels['en']

'male'

Let’s look at this step by step:

.claims['P21'] asks to get all values of this property for our Wikidata item of interest. Items can have zero to many claims for a given property. We need to consider this when doing error handling.
[0] - we are interested in only looking at the first value for this property, in case there’d be multiple. Of course, depending on the property, we could be interested at looking at all of them!
.target gives us the actual value of that claim. For gender this will be another Wikidata item, which has translatable labels

Bringing it together

Now, let’s try to get a set of politicians, their gender according to Wikidata and then get the page lengths for the diferrent language editions. We can then do a simple comparison between article length and gender in those languages:

QUERY = """
SELECT DISTINCT ?item
WHERE {
    ?item wdt:P31 wd:Q5;  # Any instance of a human;
          wdt:P106 wd:Q82955; # occupation politician;
          wdt:P27 wd:Q114; # citizen of kenya;
}
LIMIT 10
"""


LANGUAGES = [
    "enwiki",
    "swwiki",
    "dewiki",
    "frwiki",
    "eswiki",
]

LANGUAGE_SITES = {} # let's store the site objects in a dict, so we can use them for our queries

for language in LANGUAGES:
    LANGUAGE_SITES[language] = pywikibot.Site(language[:2], 'wikipedia') # [:2] to get the language prefix



# Let's wrap some of the code we wrote above into functions, to make it easier to call and ultimately merge it into a dataframe at the end!

def get_gender(wikidata_item):
    if "P21" in wikidata_item.claims.keys():
        return wikidata_item.claims['P21'][0].target.labels['en']
    else:
        return "unknown"


def run_query(QUERY):
    generator = pg.WikidataSPARQLPageGenerator(QUERY, site=wikidata_site)

    politician_data = {}

    for item in generator:
        qid = item.getID() # get unique ID
        politician_data[qid] = {} # initialize dictionary 
        politician_data[qid]['gender'] = get_gender(item) # add gender
        politician_data[qid]['label'] = item.labels['en'] # add english label
        print('get languages for {}'.format(qid))
        sitelinks = item.sitelinks
        for language in LANGUAGES:
            if language in sitelinks.keys():
                page_title = sitelinks[language].canonical_title()
                page = pywikibot.Page(LANGUAGE_SITES[language], page_title)
                politician_data[qid][language] = len(page.text)
            else: # don't have a page, so length is zero
                politician_data[qid][language] = 0
    return politician_data

politician_data = run_query(QUERY)
pd.DataFrame(politician_data).T

get languages for Q173563
get languages for Q193492
get languages for Q308189
get languages for Q313893
get languages for Q46795
get languages for Q202077
get languages for Q318960
get languages for Q195725
get languages for Q121708
get languages for Q196070

	gender	label	enwiki	swwiki	dewiki	frwiki	eswiki
Q173563	male	Jomo Kenyatta	154884	5528	25027	13234	15096
Q193492	male	Daniel arap Moi	55812	10665	9839	8431	10622
Q308189	male	Abdilatif Abdalla	4321	1564	3818	1056	3112
Q313893	male	McDonald Mariga	19953	4250	25325	7607	15887
Q46795	female	Wangari Muta Maathai	76445	3211	22759	27246	61486
Q202077	male	Harry Thuku	13230	0	3704	12453	0
Q318960	male	Richard Leakey	44530	3165	10229	8184	20618
Q195725	male	William Ruto	87708	8643	18334	19112	26486
Q121708	male	Moses Mudavadi	985	0	1033	0	0
Q196070	male	Uhuru Kenyatta	120687	8931	24740	26056	11713

While this form of querying works, it shows quite a heavy gender bias (which could be in politics, in Wikipedia, or both?). Either way, by querying without gender, we end up with an overwhelmingly male data set.

To be able to actually make a comparison, let’s do two queries instead, one for male and one for female politicians:

male_query = """
SELECT DISTINCT ?item
WHERE {
    ?item wdt:P31 wd:Q5;  # Any instance of a human;
          wdt:P106 wd:Q82955; # occupation politician;
          wdt:P27 wd:Q114; # citizen of kenya;
          wdt:P21 wd:Q6581097; # male
}
LIMIT 20
"""

female_query = """
SELECT DISTINCT ?item
WHERE {
    ?item wdt:P31 wd:Q5;  # Any instance of a human;
          wdt:P106 wd:Q82955; # occupation politician;
          wdt:P27 wd:Q114; # citizen of kenya;
          wdt:P21 wd:Q6581072; # female
}
LIMIT 20
"""


male_politicians = run_query(male_query)
female_politicians = run_query(female_query)

get languages for Q173563
get languages for Q710549
get languages for Q346046
get languages for Q1441663
get languages for Q195725
get languages for Q733817
get languages for Q202077
get languages for Q593552
get languages for Q447119
get languages for Q193492
get languages for Q121708
get languages for Q196070
get languages for Q457713
get languages for Q1372633
get languages for Q645228
get languages for Q456466
get languages for Q308189
get languages for Q743818
get languages for Q733180
get languages for Q1395018
get languages for Q46795
get languages for Q16886393
get languages for Q47489052
get languages for Q47490030
get languages for Q47490021
get languages for Q47486849
get languages for Q469641
get languages for Q16734504
get languages for Q16886147
get languages for Q16842239
get languages for Q47490034
get languages for Q539821
get languages for Q16885582
get languages for Q565671
get languages for Q44090439
get languages for Q47489991
get languages for Q47490033
get languages for Q16866530
get languages for Q47483883
get languages for Q16886400

Now we can merge the two dictionaries and create one dataframe:

all_politicians = female_politicians | male_politicians
df_all_politicians = pd.DataFrame(all_politicians).T
df_all_politicians

	gender	label	enwiki	swwiki	dewiki	frwiki	eswiki
Q46795	female	Wangari Muta Maathai	76445	3211	22759	27246	61486
Q16886393	female	Anne Waiguru	16354	4402	0	0	0
Q47489052	female	Gladys Wanga	5150	1417	0	0	0
Q47490030	female	Catherine Nanjala Wambilianga	2536	690	0	0	0
Q47490021	female	Joyce Chepkoech Korir	4077	645	0	0	0
Q47486849	female	Gertrude Mbeyu Mwanyanje	1293	0	0	0	0
Q469641	female	Grace Emily Akinyi Ogot	16636	4621	4682	8000	0
Q16734504	female	Phyllis Kandie	9499	0	0	0	0
Q16886147	female	Peris Tobiko	5914	5274	0	0	0
Q16842239	female	Sonia Birdi	6330	1647	0	0	0
Q47490034	female	Jane Wanjuki Njiru	4486	933	0	0	0
Q539821	female	Charity Ngilu	23356	1772	2740	10607	1800
Q16885582	female	Naomi Shaban	3471	793	0	0	0
Q565671	female	Anne Nyokabi Muhoho	2144	0	0	0	0
Q44090439	female	Roslyne Akombe	11692	613	0	0	0
Q47489991	female	Gladwell Jesire Cheruiyot	4350	1342	0	0	0
Q47490033	female	Jane Jepkorir Kiptoo Chebaibai	4073	0	0	0	0
Q16866530	female	Margaret Gakuo Kenyatta	5542	0	8273	5127	0
Q47483883	female	Elsie Busihile Muhanda	1930	0	0	0	0
Q16886400	female	Judy Wakhungu	17234	1778	0	0	0
Q173563	male	Jomo Kenyatta	154884	5528	25027	13234	15096
Q710549	male	Arthur Magugu	1125	920	1201	0	1835
Q346046	male	Elijah Lagat	12564	1091	4761	0	5463
Q1441663	male	Francis Muthaura	11033	0	2896	1359	0
Q195725	male	William Ruto	87708	8643	18334	19112	26486
Q733817	male	George Saitoti	50502	5149	3978	6445	1868
Q202077	male	Harry Thuku	13230	0	3704	12453	0
Q593552	male	Michael Kijana Wamalwa	8368	5503	3166	0	0
Q447119	male	Wesley Korir	16759	0	3764	9470	0
Q193492	male	Daniel arap Moi	55812	10665	9839	8431	10622
Q121708	male	Moses Mudavadi	985	0	1033	0	0
Q196070	male	Uhuru Kenyatta	120687	8931	24740	26056	11713
Q457713	male	Newton Kulundu	4786	0	3386	0	0
Q1372633	male	Gitobu Imanyara	4593	4758	0	1856	0
Q645228	male	Jaramogi Oginga Odinga	14768	4219	3711	15220	2581
Q456466	male	David Lelei	5151	0	1779	3847	4818
Q308189	male	Abdilatif Abdalla	4321	1564	3818	1056	3112
Q743818	male	David Kimutai Too	4976	0	1562	7618	0
Q733180	male	Tom Mboya	24597	10552	6843	18019	0
Q1395018	male	Kalonzo Musyoka	28477	2510	2659	5503	0

Now let’s try plotting this as a boxplot, grouping by gender for each of the languages:

df_all_politicians.plot(kind='box',column=LANGUAGES, by=["gender"],layout=(5, 1),figsize=(10,15))

dewiki    Axes(0.125,0.747241;0.775x0.132759)
enwiki    Axes(0.125,0.587931;0.775x0.132759)
eswiki    Axes(0.125,0.428621;0.775x0.132759)
frwiki     Axes(0.125,0.26931;0.775x0.132759)
swwiki        Axes(0.125,0.11;0.775x0.132759)
dtype: object

Exploring the difference in number of pages by gender

Let’s also investigate how large the difference in number of recorded items is for male & female politicians in Wikidata. For that we can run two queries to just get the number of items:

SELECT DISTINCT (COUNT(?item) as ?count) # get count of items instead of actual pages
WHERE {
    ?item wdt:P31 wd:Q5;  # Any instance of a human;
          wdt:P106 wd:Q82955; # occupation politician;
          wdt:P27 wd:Q114; # citizen of kenya;
          wdt:P21 wd:Q6581097; # male
}

and

SELECT DISTINCT (COUNT(?item) as ?count) # get count of items instead of actual pages
WHERE {
    ?item wdt:P31 wd:Q5;  # Any instance of a human;
          wdt:P106 wd:Q82955; # occupation politician;
          wdt:P27 wd:Q114; # citizen of kenya;
          wdt:P21 wd:Q6581072; # female
}

We can run these directly using the Wikidata Query Service, without having to write code (e.g. for male and female politicians): This shows us that there are 224 items for female politicians, and 915 items for male politicians.

Using some very basic NLP to identify a gender from a Wikipedia page

So far, we have used the gender information as directly supplied by Wikidata (and even explicitly filtered for the presence of the information). But potentially this information is not present in all Wikidata items describing politicians.

Plus, if we think back to just querying category pages - we might want to also use Wikipedia pages directly, without expecting or needing a corresponding Wikidata item. How could we assign gender in that case? A ver simple example would be to count the pronouns in each page and use those to assign gender.

We can try to explore how well such an approach would work by using our existing set of politicians, loading the (English) articles and giving it a try:

wikidata_site = language_sites['enwiki'].data_repository() # setup wikidata site object

pronouns_female = ['she', 'her', 'herself', 'hers']
pronouns_male = ['he', 'him', 'himself', 'his']

pronoun_female_count = []
pronoun_male_count = []

for qid in df_all_politicians.index: # iterate over all qid which we so far used as df index
    if 'enwiki' in pywikibot.ItemPage(wikidata_site, qid).sitelinks.keys(): # not all wikidata items have english language articles
        article_title = pywikibot.ItemPage(wikidata_site, qid).sitelinks['enwiki'].canonical_title() # get english page title
        page = pywikibot.Page(LANGUAGE_SITES['enwiki'], article_title) # get article text 
        page_text = page.text.lower() # make lower case to simplify matching
    
        # sum up female pronoun counts, add to list 
        counts_female = 0
        for p in pronouns_female:
            counts_female += page_text.count(" {} ".format(p)) # require space before/after pronoun
        pronoun_female_count.append(counts_female)
    
        # sum up female pronoun counts, add to list
        counts_male = 0
        for p in pronouns_male:
            counts_male += page_text.count(" {} ".format(p))
        pronoun_male_count.append(counts_male)
    else:
        pronoun_female_count.append(0)
        pronoun_male_count.append(0)


# add pronouns to DF
df_all_politicians['pronoun_female_count'] = pronoun_female_count
df_all_politicians['pronoun_male_count'] = pronoun_male_count

df_all_politicians

	gender	label	enwiki	swwiki	dewiki	frwiki	eswiki	pronoun_female_count	pronoun_male_count
Q46795	female	Wangari Muta Maathai	76445	3211	22759	27246	61486	193	13
Q16886393	female	Anne Waiguru	16354	4402	0	0	0	37	1
Q47489052	female	Gladys Wanga	5150	1417	0	0	0	10	0
Q47490030	female	Catherine Nanjala Wambilianga	2536	690	0	0	0	3	0
Q47490021	female	Joyce Chepkoech Korir	4077	645	0	0	0	6	0
Q47486849	female	Gertrude Mbeyu Mwanyanje	1293	0	0	0	0	1	0
Q469641	female	Grace Emily Akinyi Ogot	16636	4621	4682	8000	0	40	2
Q16734504	female	Phyllis Kandie	9499	0	0	0	0	14	1
Q16886147	female	Peris Tobiko	5914	5274	0	0	0	47	2
Q16842239	female	Sonia Birdi	6330	1647	0	0	0	12	0
Q47490034	female	Jane Wanjuki Njiru	4486	933	0	0	0	10	0
Q539821	female	Charity Ngilu	23356	1772	2740	10607	1800	45	5
Q16885582	female	Naomi Shaban	3471	793	0	0	0	10	0
Q565671	female	Anne Nyokabi Muhoho	2144	0	0	0	0	7	0
Q44090439	female	Roslyne Akombe	11692	613	0	0	0	32	7
Q47489991	female	Gladwell Jesire Cheruiyot	4350	1342	0	0	0	13	0
Q47490033	female	Jane Jepkorir Kiptoo Chebaibai	4073	0	0	0	0	14	0
Q16866530	female	Margaret Gakuo Kenyatta	5542	0	8273	5127	0	9	0
Q47483883	female	Elsie Busihile Muhanda	1930	0	0	0	0	8	0
Q16886400	female	Judy Wakhungu	17234	1778	0	0	0	26	0
Q173563	male	Jomo Kenyatta	154884	5528	25027	13234	15096	12	490
Q710549	male	Arthur Magugu	1125	920	1201	0	1835	0	2
Q346046	male	Elijah Lagat	12564	1091	4761	0	5463	0	31
Q1441663	male	Francis Muthaura	11033	0	2896	1359	0	0	28
Q195725	male	William Ruto	87708	8643	18334	19112	26486	1	129
Q733817	male	George Saitoti	50502	5149	3978	6445	1868	0	86
Q202077	male	Harry Thuku	13230	0	3704	12453	0	0	27
Q593552	male	Michael Kijana Wamalwa	8368	5503	3166	0	0	0	38
Q447119	male	Wesley Korir	16759	0	3764	9470	0	0	56
Q193492	male	Daniel arap Moi	55812	10665	9839	8431	10622	5	88
Q121708	male	Moses Mudavadi	985	0	1033	0	0	0	3
Q196070	male	Uhuru Kenyatta	120687	8931	24740	26056	11713	1	147
Q457713	male	Newton Kulundu	4786	0	3386	0	0	0	11
Q1372633	male	Gitobu Imanyara	4593	4758	0	1856	0	0	16
Q645228	male	Jaramogi Oginga Odinga	14768	4219	3711	15220	2581	0	42
Q456466	male	David Lelei	5151	0	1779	3847	4818	0	8
Q308189	male	Abdilatif Abdalla	4321	1564	3818	1056	3112	0	17
Q743818	male	David Kimutai Too	4976	0	1562	7618	0	1	9
Q733180	male	Tom Mboya	24597	10552	6843	18019	0	0	65
Q1395018	male	Kalonzo Musyoka	28477	2510	2659	5503	0	0	56

Now we got the absolute counts. For demonstration purposes, let’s now just make a very simple classifer, in which we assign the gender based on simple majority:

def assign_gender(row):
    if row['pronoun_male_count'] != 0 and row['pronoun_female_count'] != 0: 
        ratio = row['pronoun_female_count'] / row['pronoun_male_count']
        if ratio > 1:
            return 'female'
        else:
            return 'male'
    elif row['pronoun_male_count'] != 0:
        return 'male'
    else:
        return 'female'
    


df_all_politicians['gender_nlp'] = df_all_politicians.apply(assign_gender,axis=1)

We can now compare if the wikidata label and our prediction agree:

df_all_politicians['gender'] == df_all_politicians['gender_nlp']

Q46795       True
Q16886393    True
Q47489052    True
Q47490030    True
Q47490021    True
Q47486849    True
Q469641      True
Q16734504    True
Q16886147    True
Q16842239    True
Q47490034    True
Q539821      True
Q16885582    True
Q565671      True
Q44090439    True
Q47489991    True
Q47490033    True
Q16866530    True
Q47483883    True
Q16886400    True
Q173563      True
Q710549      True
Q346046      True
Q1441663     True
Q195725      True
Q733817      True
Q202077      True
Q593552      True
Q447119      True
Q193492      True
Q121708      True
Q196070      True
Q457713      True
Q1372633     True
Q645228      True
Q456466      True
Q308189      True
Q743818      True
Q733180      True
Q1395018     True
dtype: bool

We see, that in this case, the naive classification seems to have mostly worked, but of course that is far from guaranteed!